The modern data architecture has undergone a fundamental shift from local, block-based distributed file systems to cloud-native object storage. While this transition has enabled unprecedented scalability and decoupling of compute from storage, it has also amplified a structural inefficiency known as the small file problem. In distributed storage systems, particularly those supporting big data frameworks like Apache Spark, Trino, and Presto, a small file is typically defined as any file significantly smaller than the default block size, often 128 MB or 256 MB.1 The proliferation of these files—driven by high-frequency streaming ingestion, IoT sensor telemetry, and over-partitioned directory structures—introduces severe performance degradation across the entire data processing lifecycle.3 As analytics engines struggle to manage millions of discrete objects, compaction strategies have emerged as the primary mechanism for restructuring data to optimize metadata management, I/O throughput, and query planning efficiency.
The Genesis and Mechanics of the Small File Problem
The emergence of the small file problem is rarely a result of intentional design but is rather an emergent property of contemporary data ingestion patterns. Many enterprises deal with streaming sources such as networking equipment, application logs, and server event streams that generate thousands of event logs per second.3 To achieve near-real-time data freshness, ingestion pipelines frequently commit these records to object storage in tiny increments, often as separate JSON, XML, or CSV files.1 Over time, this results in a fragmented storage layout where the logical data volume is dwarfed by the administrative overhead of managing its physical manifestation.
Metadata Inflation and System Bottlenecks
In traditional Hadoop Distributed File System (HDFS) environments, the small file problem was primarily a memory constraint on the NameNode. HDFS tracks every file, directory, and block as a metadata record in the NameNode’s memory, with each record occupying between 150 and 300 bytes.1 A hundred million small files can consume hundreds of gigabytes of memory, eventually exhausting the NameNode’s capacity and forcing architectural workarounds like federation.3 Furthermore, because HDFS is optimized for large contiguous files, storing a 16 KB file consumes the same metadata resources as a 128 MB file, leading to massive inefficiencies in memory and storage utilization.3
Cloud object storage providers like Amazon S3, Azure Blob Storage, and Google Cloud Storage (GCS) decouple metadata from a central node, yet the performance penalties remain significant. The efficiency of a storage system is largely measured by IOPS (Input/Output Operations Per Second), which includes seek time, read time, and data transmission time.3 For mechanical media and even modern SSDs, sequential reads of large files are substantially faster than the random reads required to access thousands of small files.3 Every file access involves a distinct metadata request to the storage provider’s API, incurring latency for connection establishment, authentication, and metadata retrieval.2
The Computational Cost of Fragmentation
Distributed query engines are designed to parallelize workloads by splitting data into tasks. When a query scans a table composed of millions of small files, the engine must spawn an equivalent number of tasks, as each file often corresponds to at least one split.2 This creates a massive scheduling overhead where the time spent on task initialization, metadata fetching, and context switching exceeds the time spent on actual computation.1
| System Component | Impact of Small Files | Impact of Compacted Files |
| Metadata Store | Memory exhaustion (HDFS NameNode) | Low overhead, efficient caching |
| Network I/O | High latency from thousands of API calls | High throughput from sequential reads |
| Task Scheduling | Excessive task creation and imbalanced load | Optimized parallel execution |
| Cost Efficiency | High API request costs and billable minimums | Optimized storage and reduced request fees |
Data derived from architectural comparisons of HDFS and cloud object storage performance.1
Architectural Constraints and Throttling in Cloud Environments
Cloud providers implement strict rate limits to maintain service stability across multi-tenant environments. In Amazon S3, request rates are typically limited per prefix, and high volumes of small-file reads can trigger 503 “Slow Down” errors.9 These throttling events force engines to implement exponential backoff, which compounds query latency. Similarly, Azure Storage accounts may experience throttling at the account level for standard tiers or the share level for premium tiers if I/O operations per second (IOPS) or ingress/egress limits are exceeded.11
The Economic Implications of Small Files
The financial burden of small files is twofold: request costs and storage minimums. Cloud providers charge per 1,000 requests, and a query engine reading 100,000 tiny files will incur far higher API costs than one reading 100 large files.8 Furthermore, storage tiers such as Amazon S3 Standard-Infrequent Access and Azure Blob Cool Tier often impose a minimum billable object size, typically 128 KB.13 If an organization stores millions of 10 KB files, they are billed for 128 KB per file, essentially paying for “ghost” data that does not exist.13
| Storage Tier / Feature | Minimum Billable Size | Small File Impact |
| S3 Standard-IA | 128 KB | Significant cost increase for < 128 KB objects |
| S3 Glacier Instant Retrieval | 128 KB | High overhead for small log archives |
| S3 Intelligent-Tiering | No minimum for storage | Objects < 128 KB always charged at Frequent Access rate |
| Azure Blob Cool Tier | Typically 64 KB – 128 KB | Higher cost per actual byte stored |
Economic analysis based on cloud provider billing models for tiered storage.13
Fundamental Compaction Strategies and Algorithms
Compaction is the process of merging fragmented data files into a more efficient, ordered layout. This process is essential for maintaining read performance and reclaiming space in systems based on the Log-Structured Merge (LSM) tree architecture.15 LSM-based systems buffer writes in memory (MemTables) and flush them to disk as Sorted String Tables (SSTables). As these files accumulate, compaction merges overlapping sorted runs to bound read amplification and remove deleted or superseded data.15
Size-Tiered vs. Leveled Compaction
The two primary historical compaction strategies are Size-Tiered Compaction Strategy (STCS) and Leveled Compaction Strategy (LCS), each representing a different point on the trade-off curve between write amplification, read amplification, and space amplification.16
- Size-Tiered Compaction (STCS): This strategy groups SSTables into “tiers” based on their size. When a tier reaches a certain number of files (e.g., four), they are merged into a single larger file in the next tier.16 STCS is optimized for write-heavy workloads because it copy-merges data fewer times than leveled strategies, resulting in lower write amplification. However, it suffers from high read amplification as query engines must search across multiple SSTables to find the latest version of a record.16
- Leveled Compaction (LCS): This strategy organizes data into levels of exponentially increasing capacity. Each level (except L0) consists of non-overlapping files that cover the entire keyspace.17 When a level reaches its capacity, its files are merged into the overlapping files of the next level. LCS significantly reduces read amplification by ensuring that most reads only need to check one file per level. This comes at the cost of high write amplification, as the same data is rewritten multiple times during the leveling process.17
The Unified Compaction Strategy (UCS)
The Unified Compaction Strategy (UCS) generalizes both tiered and leveled compaction by using a density-based sharding mechanism.18 Density is defined as the size of an SSTable divided by the width of the token range it covers. UCS uses a scaling parameter to determine its behavior: positive values of simulate tiered compaction (low write cost, high read cost), while negative values simulate leveled compaction (high write cost, high read efficiency).18
In the UCS framework, tiered compaction is triggered when , while leveled compaction is represented by . This mathematical unification allows for a stateless compaction process that can be parallelized across different shards of the keyspace, maximizing throughput on high-density storage nodes.18
Advanced Maintenance in Open Table Formats
The emergence of open table formats—Apache Iceberg, Delta Lake, and Apache Hudi—has revolutionized how compaction is managed in data lakes. Unlike traditional Hive tables, where compaction required manual ETL jobs and often resulted in “dirty reads” during the process, these formats use a metadata layer to provide ACID transactions and snapshot isolation.20
Apache Iceberg: Metadata-Centric Optimization
Iceberg manages table state through a hierarchy of metadata files, manifest lists, and manifest files.23 This architecture allows Iceberg to perform “hidden partitioning,” where the engine identifies relevant files without relying on the physical directory structure.20
The rewriteDataFiles Action
Iceberg provides a powerful maintenance utility called rewriteDataFiles that supports three distinct compaction strategies:
- Binpack: This is the default and most efficient strategy. It combines small files into larger ones to reach a target size (e.g., 512 MB) without changing the order of the records.25 It is ideal for resolving the small file problem when query patterns are random or if the data is already adequately partitioned.
- Sort: This strategy combines files and sorts the data globally based on specified columns.25 Sorting significantly improves data skipping by narrowing the min/max statistics stored in manifest files, allowing query engines to prune irrelevant files more aggressively.27
- Z-Order: This technique interleaves data from multiple columns to create a multi-dimensional sort order.25 Z-Ordering is particularly effective for queries that filter on various combinations of columns, as it preserves spatial locality across multiple dimensions.27
Delta Lake: OPTIMIZE and Liquid Clustering
Delta Lake, pioneered by Databricks, uses a transaction log (_delta_log) to track changes at the file level.23 It offers a set of automated and manual tools for file hygiene:
- OPTIMIZE Command: This command restructures Delta tables by compacting many small files into larger ones (typically targeting 1 GB).31 It can be combined with ZORDER BY to co-locate related data.33
- Auto-Optimize and Auto-Compact: These features automate the compaction process. optimizeWrite ensures that the initial write produces larger, better-sized files, while autoCompact runs a post-write background job to merge any remaining tiny files.31
- Liquid Clustering: A more recent innovation that replaces traditional partitioning and Z-Ordering with a flexible, incremental clustering mechanism.34 Liquid Clustering is “set and forget,” handling data organization continuously as ingestion patterns change, which is especially useful for streaming pipelines and large, fast-changing datasets.35
Apache Hudi: Copy-on-Write vs. Merge-on-Read
Apache Hudi was designed for streaming ingestion and offers two table types that handle updates and compaction differently 20:
- Copy-on-Write (COW): Every update results in the rewrite of entire Parquet files. This ensures optimal read performance but results in high write amplification and latency.38
- Merge-on-Read (MOR): Updates are appended to row-based delta log files (Avro), while the base data remains in columnar Parquet files.37 This optimizes for write latency but increases read amplification, as the engine must merge logs and base files on-the-fly during a query.37
Hudi performs asynchronous Compaction to merge these delta logs into new base files, and Clustering to reorganize files and improve data locality.41 Clustering allows users to batch small files into fewer, larger ones and co-locate frequently queried data via sort keys.41
Impact of Compaction on Query Execution Engines
The primary goal of compaction is to improve the efficiency of query execution engines like Trino, Presto, and Apache Spark. The performance gains are achieved through three primary mechanisms: metadata pruning, task reduction, and sequential I/O optimization.
Metadata Pruning and Data Skipping
Modern columnar formats like Parquet and ORC store metadata at the footer of each file, including statistics like minimum and maximum values for each column.42 Compaction, especially when combined with sorting or Z-Ordering, ensures that these statistics are highly selective. For instance, if a table is sorted by timestamp, the query engine can skip reading files where the query’s time range does not overlap with the file’s min/max bounds.27 This “data skipping” reduces the total volume of data transferred over the network, which is often the primary bottleneck in cloud-based analytics.24
Task Scheduling and Parallelism
As discussed, engines like Spark create a separate task for each file split. Compaction reduces the total task count, which minimizes the overhead on the driver or coordinator node.2 In a Trino environment, worker nodes process data in parallel threads. Reading fewer, larger files allows Trino to leverage its pipelined architecture more effectively, as workers spend more time on actual CPU-bound computation rather than waiting for I/O from remote object storage.42
| Metric | Spark Performance (Small Files) | Spark Performance (Compacted) | Gain / Reduction |
| Task Count | 10,000 | 100 | 100x reduction |
| Metadata Fetch Time | 45 seconds | 2 seconds | 22.5x faster |
| Total Execution Time | 120 seconds | 15 seconds | 8x faster |
| Memory Pressure | High (GC overhead) | Low (Stable heap) | Substantial |
Performance metrics based on comparative benchmarks of Spark jobs reading fragmented vs. optimized Parquet datasets.2
Bloom Filters and Puffin Files
For high-cardinality columns where min/max statistics are insufficient for pruning (e.g., UUIDs or specific product IDs), Apache Iceberg supports Bloom filters.27 These are stored in “Puffin” files—auxiliary binary containers that hold statistics and indices.24 By checking a Bloom filter, the engine can determine with high probability whether a file contains a specific value without reading the actual data file. Benchmarks show that point lookups on high-cardinality columns can be 50 to 100 times faster when Bloom filters are utilized.47
Implementation Strategies Across Cloud Providers
Each major cloud provider offers specific tools and limits that influence how compaction should be implemented.
Amazon Web Services (AWS)
On AWS, compaction is often orchestrated using AWS Glue, Amazon EMR, or AWS Lambda.6 AWS Glue Data Catalog provides managed compaction for Iceberg tables, automatically merging small files in the background based on defined thresholds.28 For organizations requiring more control, an AWS Step Functions workflow can invoke Lambda functions to list and merge small objects in parallel.13 When querying through Amazon Athena, the partition structure and file sizing are critical; Athena performs best when files are between 128 MB and 256 MB, balanced against the overhead of manifest file processing.13
Microsoft Azure
Azure Blob Storage, particularly with the Data Lake Gen2 hierarchical namespace, provides a more traditional filesystem-like experience.52 This helps with directory renames and metadata operations common in Hadoop-style workloads.52 For performance tuning in Azure, users can adjust registry values like DirectoryCacheEntrySizeMax on client machines to cache larger directory listings, reducing the frequency of querydirectory calls to the storage service.11 Azure also offers Lifecycle Management Policies to automatically move older, less-used files to Cool or Archive tiers, though these files should ideally be compacted before transition to avoid high per-object fees.8
Google Cloud Storage (GCS)
Google Cloud Storage emphasizes a unified API across storage classes and strong integration with BigQuery and Vertex AI.54 GCS preserves custom metadata during transfers between buckets, which is essential for maintaining the lineage and statistics of compacted files.55 For real-time analytics, GCS is often paired with the BigQuery Omni or Dataproc platforms, which leverage metadata-based pruning similar to Iceberg and Delta Lake.54
Operational Best Practices and Economic Benchmarks
The success of a compaction strategy is determined by the balance between the cost of the maintenance operation and the performance benefit to the end users.
Preventative Measures in the Ingestion Path
The most cost-effective way to solve the small file problem is to prevent it from occurring in the first place.
- Batching at Ingestion: Streaming applications should use larger window sizes or trigger thresholds to ensure that files committed to storage are appropriately sized.1
- Repartitioning before Write: In Spark, using df.repartition(n) or df.coalesce(n) before the write() call ensures that each partition produces a single optimized file rather than many tiny ones.4
- Bucketing: Dividing tables into a fixed number of hash buckets based on select columns can limit the number of output files and improve join performance.5
Reactive Compaction Best Practices
When reactive compaction is necessary, it should be targeted and resource-isolated.
- Isolate Compute Resources: Compaction is a compute-intensive operation. It is often beneficial to run compaction jobs on a separate cluster to avoid impacting the SLAs of production analytics workloads.39
- Target Hot Partitions: Focus compaction efforts on frequently queried partitions or those that have recently received a high volume of small updates.45
- Monitor Write Amplification: Be aware that aggressive compaction increases write amplification, which can lead to higher storage and compute costs if not managed correctly.59
- Maintain Time Order: For time-series data, compaction should preserve the temporal order of records, which aids in data retention and whole-table expiration.18
Performance Gains and Cost Reduction Metrics
Research conducted on large-scale Iceberg and Delta Lake deployments highlights the non-linear impact of compaction on both speed and cost.
| Optimization Action | Query Latency Reduction | Storage Cost Reduction | Implementation Difficulty |
| Simple Binpack Compaction | 30% – 50% | 10% – 20% | Low |
| Sort / Z-Ordering | 60% – 80% | Negligible | Moderate |
| Bloom Filter Implementation | 90%+ (Point lookups) | 5% | High |
| Storage Tiering (Post-Compaction) | N/A | 40% – 70% | Moderate |
Data summarized from multi-engine performance studies (Spark, Trino, Athena) on cloud object storage.30
Future Trends: Autonomous Storage and Stateless Architectures
The trajectory of cloud data storage is moving toward higher levels of abstraction where the small file problem is managed autonomously by the storage layer itself.
Stateless Storage Architectures
New architectures, such as AutoMQ’s S3-based stateless design, leverage object storage to replace expensive local disks even for the most demanding streaming workloads.20 By adopting a storage-compute separation where the storage layer handles the persistence and reorganization of data in real-time, these systems can eliminate the traditional trade-off between ingestion latency and query performance.20
Predictive and Managed Optimization
Managed table services, such as S3 Tables and Databricks Predictive Optimization, are beginning to use machine learning to analyze query patterns and automatically determine the optimal compaction and clustering strategy.28 This “set and forget” approach allows data engineers to focus on business logic while the infrastructure handles the low-level data layout, manifest pruning, and storage tiering.28
The Role of Vectorized Engines
The performance of compacted data is further enhanced by vectorized execution engines like Photon (Databricks) and the new scan layer in Amazon Redshift.34 These engines process data in “batches” or “vectors” rather than one row at a time, making them exceptionally efficient at reading the large, contiguous blocks of data produced by modern compaction strategies.34
Summary of Architectural Conclusions
The small file problem is an inherent challenge of cloud-native data lakes, stemming from the fundamental mismatch between the high-frequency nature of modern data ingestion and the large-block optimizations of distributed query engines. While the problem manifests as increased metadata overhead, API throttling, and excessive task scheduling, it is effectively mitigated through a layered strategy of prevention and reactive compaction.
As demonstrated by the Unified Compaction Strategy and the managed services provided by Apache Iceberg, Delta Lake, and Apache Hudi, the path to optimal query performance lies in the intelligent reorganization of data. Compaction must be viewed not as a one-time fix but as a continuous table maintenance service that balances the cost of rewriting data against the massive gains in query efficiency. For point lookups, Bloom filters and Puffin files provide surgical precision, while for large-scale analytical scans, sorting and Z-Ordering ensure that query engines process only the data strictly necessary for the result. Ultimately, as the industry moves toward autonomous storage layers, the burden of managing small files will shift from manual data engineering to intelligent, metadata-driven infrastructure, enabling the next generation of real-time, petabyte-scale analytics.
