{"id":9503,"date":"2026-01-28T10:54:57","date_gmt":"2026-01-28T10:54:57","guid":{"rendered":"https:\/\/uplatz.com\/blog\/?p=9503"},"modified":"2026-01-28T10:54:57","modified_gmt":"2026-01-28T10:54:57","slug":"compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization","status":"publish","type":"post","link":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/","title":{"rendered":"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">The modern data architecture has undergone a fundamental shift from local, block-based distributed file systems to cloud-native object storage. While this transition has enabled unprecedented scalability and decoupling of compute from storage, it has also amplified a structural inefficiency known as the small file problem. In distributed storage systems, particularly those supporting big data frameworks like Apache Spark, Trino, and Presto, a small file is typically defined as any file significantly smaller than the default block size, often 128 MB or 256 MB.<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> The proliferation of these files\u2014driven by high-frequency streaming ingestion, IoT sensor telemetry, and over-partitioned directory structures\u2014introduces severe performance degradation across the entire data processing lifecycle.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> As analytics engines struggle to manage millions of discrete objects, compaction strategies have emerged as the primary mechanism for restructuring data to optimize metadata management, I\/O throughput, and query planning efficiency.<\/span><\/p>\n<h2><b>The Genesis and Mechanics of the Small File Problem<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The emergence of the small file problem is rarely a result of intentional design but is rather an emergent property of contemporary data ingestion patterns. Many enterprises deal with streaming sources such as networking equipment, application logs, and server event streams that generate thousands of event logs per second.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> To achieve near-real-time data freshness, ingestion pipelines frequently commit these records to object storage in tiny increments, often as separate JSON, XML, or CSV files.<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> Over time, this results in a fragmented storage layout where the logical data volume is dwarfed by the administrative overhead of managing its physical manifestation.<\/span><\/p>\n<h3><b>Metadata Inflation and System Bottlenecks<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In traditional Hadoop Distributed File System (HDFS) environments, the small file problem was primarily a memory constraint on the NameNode. HDFS tracks every file, directory, and block as a metadata record in the NameNode&#8217;s memory, with each record occupying between 150 and 300 bytes.<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> A hundred million small files can consume hundreds of gigabytes of memory, eventually exhausting the NameNode\u2019s capacity and forcing architectural workarounds like federation.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> Furthermore, because HDFS is optimized for large contiguous files, storing a 16 KB file consumes the same metadata resources as a 128 MB file, leading to massive inefficiencies in memory and storage utilization.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Cloud object storage providers like Amazon S3, Azure Blob Storage, and Google Cloud Storage (GCS) decouple metadata from a central node, yet the performance penalties remain significant. The efficiency of a storage system is largely measured by IOPS (Input\/Output Operations Per Second), which includes seek time, read time, and data transmission time.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> For mechanical media and even modern SSDs, sequential reads of large files are substantially faster than the random reads required to access thousands of small files.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> Every file access involves a distinct metadata request to the storage provider\u2019s API, incurring latency for connection establishment, authentication, and metadata retrieval.<\/span><span style=\"font-weight: 400;\">2<\/span><\/p>\n<h3><b>The Computational Cost of Fragmentation<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Distributed query engines are designed to parallelize workloads by splitting data into tasks. When a query scans a table composed of millions of small files, the engine must spawn an equivalent number of tasks, as each file often corresponds to at least one split.<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\"> This creates a massive scheduling overhead where the time spent on task initialization, metadata fetching, and context switching exceeds the time spent on actual computation.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>System Component<\/b><\/td>\n<td><b>Impact of Small Files<\/b><\/td>\n<td><b>Impact of Compacted Files<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Metadata Store<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Memory exhaustion (HDFS NameNode)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Low overhead, efficient caching<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Network I\/O<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High latency from thousands of API calls<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High throughput from sequential reads<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Task Scheduling<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Excessive task creation and imbalanced load<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Optimized parallel execution<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Cost Efficiency<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High API request costs and billable minimums<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Optimized storage and reduced request fees<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Data derived from architectural comparisons of HDFS and cloud object storage performance.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<h2><b>Architectural Constraints and Throttling in Cloud Environments<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Cloud providers implement strict rate limits to maintain service stability across multi-tenant environments. In Amazon S3, request rates are typically limited per prefix, and high volumes of small-file reads can trigger 503 &#8220;Slow Down&#8221; errors.<\/span><span style=\"font-weight: 400;\">9<\/span><span style=\"font-weight: 400;\"> These throttling events force engines to implement exponential backoff, which compounds query latency. Similarly, Azure Storage accounts may experience throttling at the account level for standard tiers or the share level for premium tiers if I\/O operations per second (IOPS) or ingress\/egress limits are exceeded.<\/span><span style=\"font-weight: 400;\">11<\/span><\/p>\n<h3><b>The Economic Implications of Small Files<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The financial burden of small files is twofold: request costs and storage minimums. Cloud providers charge per 1,000 requests, and a query engine reading 100,000 tiny files will incur far higher API costs than one reading 100 large files.<\/span><span style=\"font-weight: 400;\">8<\/span><span style=\"font-weight: 400;\"> Furthermore, storage tiers such as Amazon S3 Standard-Infrequent Access and Azure Blob Cool Tier often impose a minimum billable object size, typically 128 KB.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> If an organization stores millions of 10 KB files, they are billed for 128 KB per file, essentially paying for &#8220;ghost&#8221; data that does not exist.<\/span><span style=\"font-weight: 400;\">13<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Storage Tier \/ Feature<\/b><\/td>\n<td><b>Minimum Billable Size<\/b><\/td>\n<td><b>Small File Impact<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">S3 Standard-IA<\/span><\/td>\n<td><span style=\"font-weight: 400;\">128 KB<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Significant cost increase for &lt; 128 KB objects<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">S3 Glacier Instant Retrieval<\/span><\/td>\n<td><span style=\"font-weight: 400;\">128 KB<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High overhead for small log archives<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">S3 Intelligent-Tiering<\/span><\/td>\n<td><span style=\"font-weight: 400;\">No minimum for storage<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Objects &lt; 128 KB always charged at Frequent Access rate<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Azure Blob Cool Tier<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Typically 64 KB &#8211; 128 KB<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Higher cost per actual byte stored<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Economic analysis based on cloud provider billing models for tiered storage.<\/span><span style=\"font-weight: 400;\">13<\/span><\/p>\n<h2><b>Fundamental Compaction Strategies and Algorithms<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Compaction is the process of merging fragmented data files into a more efficient, ordered layout. This process is essential for maintaining read performance and reclaiming space in systems based on the Log-Structured Merge (LSM) tree architecture.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> LSM-based systems buffer writes in memory (MemTables) and flush them to disk as Sorted String Tables (SSTables). As these files accumulate, compaction merges overlapping sorted runs to bound read amplification and remove deleted or superseded data.<\/span><span style=\"font-weight: 400;\">15<\/span><\/p>\n<h3><b>Size-Tiered vs. Leveled Compaction<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The two primary historical compaction strategies are Size-Tiered Compaction Strategy (STCS) and Leveled Compaction Strategy (LCS), each representing a different point on the trade-off curve between write amplification, read amplification, and space amplification.<\/span><span style=\"font-weight: 400;\">16<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Size-Tiered Compaction (STCS):<\/b><span style=\"font-weight: 400;\"> This strategy groups SSTables into &#8220;tiers&#8221; based on their size. When a tier reaches a certain number of files (e.g., four), they are merged into a single larger file in the next tier.<\/span><span style=\"font-weight: 400;\">16<\/span><span style=\"font-weight: 400;\"> STCS is optimized for write-heavy workloads because it copy-merges data fewer times than leveled strategies, resulting in lower write amplification. However, it suffers from high read amplification as query engines must search across multiple SSTables to find the latest version of a record.<\/span><span style=\"font-weight: 400;\">16<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Leveled Compaction (LCS):<\/b><span style=\"font-weight: 400;\"> This strategy organizes data into levels of exponentially increasing capacity. Each level (except L0) consists of non-overlapping files that cover the entire keyspace.<\/span><span style=\"font-weight: 400;\">17<\/span><span style=\"font-weight: 400;\"> When a level reaches its capacity, its files are merged into the overlapping files of the next level. LCS significantly reduces read amplification by ensuring that most reads only need to check one file per level. This comes at the cost of high write amplification, as the same data is rewritten multiple times during the leveling process.<\/span><span style=\"font-weight: 400;\">17<\/span><\/li>\n<\/ol>\n<h3><b>The Unified Compaction Strategy (UCS)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The Unified Compaction Strategy (UCS) generalizes both tiered and leveled compaction by using a density-based sharding mechanism.<\/span><span style=\"font-weight: 400;\">18<\/span><span style=\"font-weight: 400;\"> Density is defined as the size of an SSTable divided by the width of the token range it covers. UCS uses a scaling parameter <\/span><span style=\"font-weight: 400;\"> to determine its behavior: positive values of <\/span><span style=\"font-weight: 400;\"> simulate tiered compaction (low write cost, high read cost), while negative values simulate leveled compaction (high write cost, high read efficiency).<\/span><span style=\"font-weight: 400;\">18<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the UCS framework, tiered compaction is triggered when <\/span><span style=\"font-weight: 400;\">, while leveled compaction is represented by <\/span><span style=\"font-weight: 400;\">. This mathematical unification allows for a stateless compaction process that can be parallelized across different shards of the keyspace, maximizing throughput on high-density storage nodes.<\/span><span style=\"font-weight: 400;\">18<\/span><\/p>\n<h2><b>Advanced Maintenance in Open Table Formats<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The emergence of open table formats\u2014Apache Iceberg, Delta Lake, and Apache Hudi\u2014has revolutionized how compaction is managed in data lakes. Unlike traditional Hive tables, where compaction required manual ETL jobs and often resulted in &#8220;dirty reads&#8221; during the process, these formats use a metadata layer to provide ACID transactions and snapshot isolation.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<h3><b>Apache Iceberg: Metadata-Centric Optimization<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Iceberg manages table state through a hierarchy of metadata files, manifest lists, and manifest files.<\/span><span style=\"font-weight: 400;\">23<\/span><span style=\"font-weight: 400;\"> This architecture allows Iceberg to perform &#8220;hidden partitioning,&#8221; where the engine identifies relevant files without relying on the physical directory structure.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<h4><b>The rewriteDataFiles Action<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Iceberg provides a powerful maintenance utility called rewriteDataFiles that supports three distinct compaction strategies:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Binpack:<\/b><span style=\"font-weight: 400;\"> This is the default and most efficient strategy. It combines small files into larger ones to reach a target size (e.g., 512 MB) without changing the order of the records.<\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> It is ideal for resolving the small file problem when query patterns are random or if the data is already adequately partitioned.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Sort:<\/b><span style=\"font-weight: 400;\"> This strategy combines files and sorts the data globally based on specified columns.<\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> Sorting significantly improves data skipping by narrowing the min\/max statistics stored in manifest files, allowing query engines to prune irrelevant files more aggressively.<\/span><span style=\"font-weight: 400;\">27<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Z-Order:<\/b><span style=\"font-weight: 400;\"> This technique interleaves data from multiple columns to create a multi-dimensional sort order.<\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> Z-Ordering is particularly effective for queries that filter on various combinations of columns, as it preserves spatial locality across multiple dimensions.<\/span><span style=\"font-weight: 400;\">27<\/span><\/li>\n<\/ul>\n<h3><b>Delta Lake: OPTIMIZE and Liquid Clustering<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Delta Lake, pioneered by Databricks, uses a transaction log (_delta_log) to track changes at the file level.<\/span><span style=\"font-weight: 400;\">23<\/span><span style=\"font-weight: 400;\"> It offers a set of automated and manual tools for file hygiene:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>OPTIMIZE Command:<\/b><span style=\"font-weight: 400;\"> This command restructures Delta tables by compacting many small files into larger ones (typically targeting 1 GB).<\/span><span style=\"font-weight: 400;\">31<\/span><span style=\"font-weight: 400;\"> It can be combined with ZORDER BY to co-locate related data.<\/span><span style=\"font-weight: 400;\">33<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Auto-Optimize and Auto-Compact:<\/b><span style=\"font-weight: 400;\"> These features automate the compaction process. optimizeWrite ensures that the initial write produces larger, better-sized files, while autoCompact runs a post-write background job to merge any remaining tiny files.<\/span><span style=\"font-weight: 400;\">31<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Liquid Clustering:<\/b><span style=\"font-weight: 400;\"> A more recent innovation that replaces traditional partitioning and Z-Ordering with a flexible, incremental clustering mechanism.<\/span><span style=\"font-weight: 400;\">34<\/span><span style=\"font-weight: 400;\"> Liquid Clustering is &#8220;set and forget,&#8221; handling data organization continuously as ingestion patterns change, which is especially useful for streaming pipelines and large, fast-changing datasets.<\/span><span style=\"font-weight: 400;\">35<\/span><\/li>\n<\/ul>\n<h3><b>Apache Hudi: Copy-on-Write vs. Merge-on-Read<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Apache Hudi was designed for streaming ingestion and offers two table types that handle updates and compaction differently <\/span><span style=\"font-weight: 400;\">20<\/span><span style=\"font-weight: 400;\">:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Copy-on-Write (COW):<\/b><span style=\"font-weight: 400;\"> Every update results in the rewrite of entire Parquet files. This ensures optimal read performance but results in high write amplification and latency.<\/span><span style=\"font-weight: 400;\">38<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Merge-on-Read (MOR):<\/b><span style=\"font-weight: 400;\"> Updates are appended to row-based delta log files (Avro), while the base data remains in columnar Parquet files.<\/span><span style=\"font-weight: 400;\">37<\/span><span style=\"font-weight: 400;\"> This optimizes for write latency but increases read amplification, as the engine must merge logs and base files on-the-fly during a query.<\/span><span style=\"font-weight: 400;\">37<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Hudi performs asynchronous <\/span><b>Compaction<\/b><span style=\"font-weight: 400;\"> to merge these delta logs into new base files, and <\/span><b>Clustering<\/b><span style=\"font-weight: 400;\"> to reorganize files and improve data locality.<\/span><span style=\"font-weight: 400;\">41<\/span><span style=\"font-weight: 400;\"> Clustering allows users to batch small files into fewer, larger ones and co-locate frequently queried data via sort keys.<\/span><span style=\"font-weight: 400;\">41<\/span><\/p>\n<h2><b>Impact of Compaction on Query Execution Engines<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The primary goal of compaction is to improve the efficiency of query execution engines like Trino, Presto, and Apache Spark. The performance gains are achieved through three primary mechanisms: metadata pruning, task reduction, and sequential I\/O optimization.<\/span><\/p>\n<h3><b>Metadata Pruning and Data Skipping<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Modern columnar formats like Parquet and ORC store metadata at the footer of each file, including statistics like minimum and maximum values for each column.<\/span><span style=\"font-weight: 400;\">42<\/span><span style=\"font-weight: 400;\"> Compaction, especially when combined with sorting or Z-Ordering, ensures that these statistics are highly selective. For instance, if a table is sorted by timestamp, the query engine can skip reading files where the query&#8217;s time range does not overlap with the file&#8217;s min\/max bounds.<\/span><span style=\"font-weight: 400;\">27<\/span><span style=\"font-weight: 400;\"> This &#8220;data skipping&#8221; reduces the total volume of data transferred over the network, which is often the primary bottleneck in cloud-based analytics.<\/span><span style=\"font-weight: 400;\">24<\/span><\/p>\n<h3><b>Task Scheduling and Parallelism<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">As discussed, engines like Spark create a separate task for each file split. Compaction reduces the total task count, which minimizes the overhead on the driver or coordinator node.<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\"> In a Trino environment, worker nodes process data in parallel threads. Reading fewer, larger files allows Trino to leverage its pipelined architecture more effectively, as workers spend more time on actual CPU-bound computation rather than waiting for I\/O from remote object storage.<\/span><span style=\"font-weight: 400;\">42<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Metric<\/b><\/td>\n<td><b>Spark Performance (Small Files)<\/b><\/td>\n<td><b>Spark Performance (Compacted)<\/b><\/td>\n<td><b>Gain \/ Reduction<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Task Count<\/span><\/td>\n<td><span style=\"font-weight: 400;\">10,000<\/span><\/td>\n<td><span style=\"font-weight: 400;\">100<\/span><\/td>\n<td><span style=\"font-weight: 400;\">100x reduction<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Metadata Fetch Time<\/span><\/td>\n<td><span style=\"font-weight: 400;\">45 seconds<\/span><\/td>\n<td><span style=\"font-weight: 400;\">2 seconds<\/span><\/td>\n<td><span style=\"font-weight: 400;\">22.5x faster<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Total Execution Time<\/span><\/td>\n<td><span style=\"font-weight: 400;\">120 seconds<\/span><\/td>\n<td><span style=\"font-weight: 400;\">15 seconds<\/span><\/td>\n<td><span style=\"font-weight: 400;\">8x faster<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Memory Pressure<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High (GC overhead)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Low (Stable heap)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Substantial<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Performance metrics based on comparative benchmarks of Spark jobs reading fragmented vs. optimized Parquet datasets.<\/span><span style=\"font-weight: 400;\">2<\/span><\/p>\n<h3><b>Bloom Filters and Puffin Files<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">For high-cardinality columns where min\/max statistics are insufficient for pruning (e.g., UUIDs or specific product IDs), Apache Iceberg supports Bloom filters.<\/span><span style=\"font-weight: 400;\">27<\/span><span style=\"font-weight: 400;\"> These are stored in &#8220;Puffin&#8221; files\u2014auxiliary binary containers that hold statistics and indices.<\/span><span style=\"font-weight: 400;\">24<\/span><span style=\"font-weight: 400;\"> By checking a Bloom filter, the engine can determine with high probability whether a file contains a specific value without reading the actual data file. Benchmarks show that point lookups on high-cardinality columns can be 50 to 100 times faster when Bloom filters are utilized.<\/span><span style=\"font-weight: 400;\">47<\/span><\/p>\n<h2><b>Implementation Strategies Across Cloud Providers<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Each major cloud provider offers specific tools and limits that influence how compaction should be implemented.<\/span><\/p>\n<h3><b>Amazon Web Services (AWS)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">On AWS, compaction is often orchestrated using AWS Glue, Amazon EMR, or AWS Lambda.<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> AWS Glue Data Catalog provides managed compaction for Iceberg tables, automatically merging small files in the background based on defined thresholds.<\/span><span style=\"font-weight: 400;\">28<\/span><span style=\"font-weight: 400;\"> For organizations requiring more control, an AWS Step Functions workflow can invoke Lambda functions to list and merge small objects in parallel.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> When querying through Amazon Athena, the partition structure and file sizing are critical; Athena performs best when files are between 128 MB and 256 MB, balanced against the overhead of manifest file processing.<\/span><span style=\"font-weight: 400;\">13<\/span><\/p>\n<h3><b>Microsoft Azure<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Azure Blob Storage, particularly with the Data Lake Gen2 hierarchical namespace, provides a more traditional filesystem-like experience.<\/span><span style=\"font-weight: 400;\">52<\/span><span style=\"font-weight: 400;\"> This helps with directory renames and metadata operations common in Hadoop-style workloads.<\/span><span style=\"font-weight: 400;\">52<\/span><span style=\"font-weight: 400;\"> For performance tuning in Azure, users can adjust registry values like DirectoryCacheEntrySizeMax on client machines to cache larger directory listings, reducing the frequency of querydirectory calls to the storage service.<\/span><span style=\"font-weight: 400;\">11<\/span><span style=\"font-weight: 400;\"> Azure also offers Lifecycle Management Policies to automatically move older, less-used files to Cool or Archive tiers, though these files should ideally be compacted before transition to avoid high per-object fees.<\/span><span style=\"font-weight: 400;\">8<\/span><\/p>\n<h3><b>Google Cloud Storage (GCS)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Google Cloud Storage emphasizes a unified API across storage classes and strong integration with BigQuery and Vertex AI.<\/span><span style=\"font-weight: 400;\">54<\/span><span style=\"font-weight: 400;\"> GCS preserves custom metadata during transfers between buckets, which is essential for maintaining the lineage and statistics of compacted files.<\/span><span style=\"font-weight: 400;\">55<\/span><span style=\"font-weight: 400;\"> For real-time analytics, GCS is often paired with the BigQuery Omni or Dataproc platforms, which leverage metadata-based pruning similar to Iceberg and Delta Lake.<\/span><span style=\"font-weight: 400;\">54<\/span><\/p>\n<h2><b>Operational Best Practices and Economic Benchmarks<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The success of a compaction strategy is determined by the balance between the cost of the maintenance operation and the performance benefit to the end users.<\/span><\/p>\n<h3><b>Preventative Measures in the Ingestion Path<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The most cost-effective way to solve the small file problem is to prevent it from occurring in the first place.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Batching at Ingestion:<\/b><span style=\"font-weight: 400;\"> Streaming applications should use larger window sizes or trigger thresholds to ensure that files committed to storage are appropriately sized.<\/span><span style=\"font-weight: 400;\">1<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Repartitioning before Write:<\/b><span style=\"font-weight: 400;\"> In Spark, using df.repartition(n) or df.coalesce(n) before the write() call ensures that each partition produces a single optimized file rather than many tiny ones.<\/span><span style=\"font-weight: 400;\">4<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Bucketing:<\/b><span style=\"font-weight: 400;\"> Dividing tables into a fixed number of hash buckets based on select columns can limit the number of output files and improve join performance.<\/span><span style=\"font-weight: 400;\">5<\/span><\/li>\n<\/ul>\n<h3><b>Reactive Compaction Best Practices<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">When reactive compaction is necessary, it should be targeted and resource-isolated.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Isolate Compute Resources:<\/b><span style=\"font-weight: 400;\"> Compaction is a compute-intensive operation. It is often beneficial to run compaction jobs on a separate cluster to avoid impacting the SLAs of production analytics workloads.<\/span><span style=\"font-weight: 400;\">39<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Target Hot Partitions:<\/b><span style=\"font-weight: 400;\"> Focus compaction efforts on frequently queried partitions or those that have recently received a high volume of small updates.<\/span><span style=\"font-weight: 400;\">45<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Monitor Write Amplification:<\/b><span style=\"font-weight: 400;\"> Be aware that aggressive compaction increases write amplification, which can lead to higher storage and compute costs if not managed correctly.<\/span><span style=\"font-weight: 400;\">59<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Maintain Time Order:<\/b><span style=\"font-weight: 400;\"> For time-series data, compaction should preserve the temporal order of records, which aids in data retention and whole-table expiration.<\/span><span style=\"font-weight: 400;\">18<\/span><\/li>\n<\/ul>\n<h3><b>Performance Gains and Cost Reduction Metrics<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Research conducted on large-scale Iceberg and Delta Lake deployments highlights the non-linear impact of compaction on both speed and cost.<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Optimization Action<\/b><\/td>\n<td><b>Query Latency Reduction<\/b><\/td>\n<td><b>Storage Cost Reduction<\/b><\/td>\n<td><b>Implementation Difficulty<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Simple Binpack Compaction<\/span><\/td>\n<td><span style=\"font-weight: 400;\">30% &#8211; 50%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">10% &#8211; 20%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Low<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Sort \/ Z-Ordering<\/span><\/td>\n<td><span style=\"font-weight: 400;\">60% &#8211; 80%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Negligible<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Moderate<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Bloom Filter Implementation<\/span><\/td>\n<td><span style=\"font-weight: 400;\">90%+ (Point lookups)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">5%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Storage Tiering (Post-Compaction)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">N\/A<\/span><\/td>\n<td><span style=\"font-weight: 400;\">40% &#8211; 70%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Moderate<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Data summarized from multi-engine performance studies (Spark, Trino, Athena) on cloud object storage.<\/span><span style=\"font-weight: 400;\">30<\/span><\/p>\n<h2><b>Future Trends: Autonomous Storage and Stateless Architectures<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The trajectory of cloud data storage is moving toward higher levels of abstraction where the small file problem is managed autonomously by the storage layer itself.<\/span><\/p>\n<h3><b>Stateless Storage Architectures<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">New architectures, such as AutoMQ&#8217;s S3-based stateless design, leverage object storage to replace expensive local disks even for the most demanding streaming workloads.<\/span><span style=\"font-weight: 400;\">20<\/span><span style=\"font-weight: 400;\"> By adopting a storage-compute separation where the storage layer handles the persistence and reorganization of data in real-time, these systems can eliminate the traditional trade-off between ingestion latency and query performance.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<h3><b>Predictive and Managed Optimization<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Managed table services, such as S3 Tables and Databricks Predictive Optimization, are beginning to use machine learning to analyze query patterns and automatically determine the optimal compaction and clustering strategy.<\/span><span style=\"font-weight: 400;\">28<\/span><span style=\"font-weight: 400;\"> This &#8220;set and forget&#8221; approach allows data engineers to focus on business logic while the infrastructure handles the low-level data layout, manifest pruning, and storage tiering.<\/span><span style=\"font-weight: 400;\">28<\/span><\/p>\n<h3><b>The Role of Vectorized Engines<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The performance of compacted data is further enhanced by vectorized execution engines like Photon (Databricks) and the new scan layer in Amazon Redshift.<\/span><span style=\"font-weight: 400;\">34<\/span><span style=\"font-weight: 400;\"> These engines process data in &#8220;batches&#8221; or &#8220;vectors&#8221; rather than one row at a time, making them exceptionally efficient at reading the large, contiguous blocks of data produced by modern compaction strategies.<\/span><span style=\"font-weight: 400;\">34<\/span><\/p>\n<h2><b>Summary of Architectural Conclusions<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The small file problem is an inherent challenge of cloud-native data lakes, stemming from the fundamental mismatch between the high-frequency nature of modern data ingestion and the large-block optimizations of distributed query engines. While the problem manifests as increased metadata overhead, API throttling, and excessive task scheduling, it is effectively mitigated through a layered strategy of prevention and reactive compaction.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As demonstrated by the Unified Compaction Strategy and the managed services provided by Apache Iceberg, Delta Lake, and Apache Hudi, the path to optimal query performance lies in the intelligent reorganization of data. Compaction must be viewed not as a one-time fix but as a continuous table maintenance service that balances the cost of rewriting data against the massive gains in query efficiency. For point lookups, Bloom filters and Puffin files provide surgical precision, while for large-scale analytical scans, sorting and Z-Ordering ensure that query engines process only the data strictly necessary for the result. Ultimately, as the industry moves toward autonomous storage layers, the burden of managing small files will shift from manual data engineering to intelligent, metadata-driven infrastructure, enabling the next generation of real-time, petabyte-scale analytics.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The modern data architecture has undergone a fundamental shift from local, block-based distributed file systems to cloud-native object storage. While this transition has enabled unprecedented scalability and decoupling of compute <span class=\"readmore\"><a href=\"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/\">Read More &#8230;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2374],"tags":[],"class_list":["post-9503","post","type-post","status-publish","format-standard","hentry","category-deep-research"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization | Uplatz Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization | Uplatz Blog\" \/>\n<meta property=\"og:description\" content=\"The modern data architecture has undergone a fundamental shift from local, block-based distributed file systems to cloud-native object storage. While this transition has enabled unprecedented scalability and decoupling of compute Read More ...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/\" \/>\n<meta property=\"og:site_name\" content=\"Uplatz Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-01-28T10:54:57+00:00\" \/>\n<meta name=\"author\" content=\"uplatzblog\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:site\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"uplatzblog\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/\"},\"author\":{\"name\":\"uplatzblog\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\"},\"headline\":\"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization\",\"datePublished\":\"2026-01-28T10:54:57+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/\"},\"wordCount\":3123,\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"articleSection\":[\"Deep Research\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/\",\"name\":\"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization | Uplatz Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\"},\"datePublished\":\"2026-01-28T10:54:57+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"name\":\"Uplatz Blog\",\"description\":\"Uplatz is a global IT Training &amp; Consulting company\",\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\",\"name\":\"uplatz.com\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"width\":1280,\"height\":800,\"caption\":\"uplatz.com\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Uplatz-1077816825610769\\\/\",\"https:\\\/\\\/x.com\\\/uplatz_global\",\"https:\\\/\\\/www.instagram.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\",\"name\":\"uplatzblog\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"caption\":\"uplatzblog\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization | Uplatz Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/","og_locale":"en_US","og_type":"article","og_title":"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization | Uplatz Blog","og_description":"The modern data architecture has undergone a fundamental shift from local, block-based distributed file systems to cloud-native object storage. While this transition has enabled unprecedented scalability and decoupling of compute Read More ...","og_url":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/","og_site_name":"Uplatz Blog","article_publisher":"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","article_published_time":"2026-01-28T10:54:57+00:00","author":"uplatzblog","twitter_card":"summary_large_image","twitter_creator":"@uplatz_global","twitter_site":"@uplatz_global","twitter_misc":{"Written by":"uplatzblog","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/#article","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/"},"author":{"name":"uplatzblog","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e"},"headline":"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization","datePublished":"2026-01-28T10:54:57+00:00","mainEntityOfPage":{"@id":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/"},"wordCount":3123,"publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"articleSection":["Deep Research"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/","url":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/","name":"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization | Uplatz Blog","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/#website"},"datePublished":"2026-01-28T10:54:57+00:00","breadcrumb":{"@id":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/uplatz.com\/blog\/compaction-strategies-and-the-small-file-problem-in-object-storage-a-comprehensive-analysis-of-query-performance-optimization\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/uplatz.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Compaction Strategies and the Small File Problem in Object Storage: A Comprehensive Analysis of Query Performance Optimization"}]},{"@type":"WebSite","@id":"https:\/\/uplatz.com\/blog\/#website","url":"https:\/\/uplatz.com\/blog\/","name":"Uplatz Blog","description":"Uplatz is a global IT Training &amp; Consulting company","publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/uplatz.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/uplatz.com\/blog\/#organization","name":"uplatz.com","url":"https:\/\/uplatz.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","width":1280,"height":800,"caption":"uplatz.com"},"image":{"@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","https:\/\/x.com\/uplatz_global","https:\/\/www.instagram.com\/","https:\/\/www.linkedin.com\/company\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz"]},{"@type":"Person","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e","name":"uplatzblog","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","caption":"uplatzblog"}}]}},"_links":{"self":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9503","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/comments?post=9503"}],"version-history":[{"count":1,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9503\/revisions"}],"predecessor-version":[{"id":9504,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9503\/revisions\/9504"}],"wp:attachment":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media?parent=9503"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/categories?post=9503"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/tags?post=9503"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}