The Convergence and Divergence of Open Table Formats: A 2025 Comprehensive Report on Apache Iceberg, Delta Lake, and Apache Hudi

1. Executive Summary: The State of the Lakehouse in 2025

The enterprise data landscape has undergone a radical architectural shift over the last half-decade, transitioning from the bifurcation of Data Lakes (low-cost, unstructured) and Data Warehouses (high-performance, governed) to the unified “Data Lakehouse.” Central to this unification is the Open Table Format (OTF)—a middleware layer that superimposes database-like reliability, transactional guarantees, and metadata management atop immutable object storage files (typically Parquet, ORC, or Avro).

As of 2025, the “format wars” that characterized the early 2020s have largely stabilized into a tripartite ecosystem dominated by Apache Iceberg, Delta Lake, and Apache Hudi. While early rhetoric suggested a “winner-take-all” outcome, the current reality reflects a sophisticated market segmentation where engineering teams select formats based on specific workload characteristics—streaming mutation, batch throughput, or ecosystem interoperability—rather than generic superiority.1 Furthermore, the emergence of interoperability layers such as Apache XTable (formerly OneTable) and Delta Lake UniForm has begun to commoditize the storage layer, allowing metadata translation between formats and reducing the penalty of vendor lock-in.3

This report provides an exhaustive analysis of these three formats, synthesizing performance benchmarks, architectural distinctives, and cloud provider integration maturity. The analysis indicates that while features are converging—with all formats now supporting ACID transactions, time travel, and schema evolution—their internal mechanisms create distinct performance profiles. Delta Lake maintains a stronghold in high-throughput batch analytics, particularly within the Spark and Databricks ecosystems, leveraging aggressive caching and compilation optimizations.2 Apache Iceberg has emerged as the de facto standard for interoperability and metadata scalability, favored by hyperscalers like Snowflake, AWS Athena, and Google BigQuery for its engine-agnostic design and O(1) partition pruning capabilities.6 Apache Hudi continues to dominate the streaming data niche, offering the most mature primitives for Change Data Capture (CDC), upserts, and near-real-time ingestion via its distinct “Database on the Lake” architecture.1

2. Theoretical Foundations and Architectural Anatomy

To understand the performance differentials and feature limitations of each format, one must deconstruct their architectural philosophies. The fundamental challenge all three solve is the “listing problem” of eventual consistency in object storage (e.g., S3). By decoupling the “state” of a table from the physical file listing and moving it into a transactional metadata layer, OTFs provide ACID guarantees. However, how they manage this metadata dictates their scalability and latency profiles.

2.1 Apache Iceberg: The Hierarchical Snapshot Architecture

Apache Iceberg, originating from Netflix, was architected specifically to address the scalability bottlenecks of the Hive Metastore and the correctness issues associated with directory-level updates.6 Its design philosophy prioritizes correctness, safety, and strict separation of concerns between the storage format and the compute engine.

2.1.1 The Metadata Tree

Iceberg employs a three-tier hierarchical metadata structure that isolates the query planner from the physical file layout:

  1. Metadata File (metadata.json): This acts as the root of the table’s state. It stores the schema, partition specification, and a historical list of “snapshots.” Every commit (write operation) generates a new metadata file, replacing the pointer to the previous one. This ensures serializable isolation and enables atomic swaps of table state.9
  2. Manifest List: Each snapshot references a specific “Manifest List” (an Avro file). This file serves as an index of manifests, storing aggregate statistics (e.g., partition value ranges) for the manifest files it tracks. This intermediate layer allows the query engine to perform coarse-grained pruning, skipping entire groups of files without opening them.11
  3. Manifest File: These Avro files contain the actual list of data files (Parquet/ORC) and delete files. Crucially, they store fine-grained statistics (column bounds, null counts) for every data file.

2.1.2 The Advantage of Manifests

The use of Avro for manifest files is a strategic choice. Avro is compact and row-oriented, allowing the query engine to efficiently stream metadata and filter files based on predicates (e.g., WHERE timestamp > ‘2025-01-01’) without listing directories. This architecture enables “Hidden Partitioning,” where the relationship between the column value and the partition tuple is stored as a transform function within the metadata. The engine automatically translates queries on the source column into partition filters, decoupling the logical query from the physical layout.6 This solves the “stale metadata” problem inherent in Hive and allows partition schemes to evolve over time without rewriting petabytes of data.13

2.2 Delta Lake: The Transaction Log and Checkpointing

Delta Lake, developed by Databricks, utilizes a log-structured approach akin to a traditional database Write-Ahead Log (WAL). Its architecture is heavily optimized for the Apache Spark execution model, though recent efforts have broadened its compatibility.6

2.2.1 The _delta_log Protocol

Delta Lake records state changes in a sequential directory named _delta_log.

  1. JSON Commits: Every transaction creates a JSON file (e.g., 00000000000000000010.json). This file contains actions such as add (referencing a new Parquet file) or remove (logically deleting an existing file). The add action includes file-level statistics (min/max/nulls) used for data skipping.11
  2. Checkpointing: To prevent the cost of reading the log from growing linearly with the table’s history, Delta Lake automatically aggregates the state into a Parquet checkpoint file every 10 commits (configurable). A reader needs only to read the latest checkpoint and the subsequent JSON files to reconstruct the table state.9

2.2.2 Protocol Evolution

Delta Lake relies on “Protocol Versioning” to introduce new features. For instance, to support “Deletion Vectors” (a Merge-on-Read optimization) or “Column Mapping” (for schema evolution), the table’s protocol version must be upgraded. While this allows for rapid innovation, it can create compatibility friction; a reader running an older version of the Delta library cannot read a table upgraded to a newer protocol, enforcing a tighter coupling between the compute engine version and the storage format.15

2.3 Apache Hudi: The Streaming Primitive

Apache Hudi (Hadoop Upsert Deletes and Incrementals) is architecturally distinct in its “streaming-first” orientation. Originating at Uber, Hudi views a table not as a static state but as a continuous stream of events. It is designed primarily for mutable workloads where records are frequently updated, deleted, or compacted.11

2.3.1 The Timeline and File Layout

Hudi manages state via a “Timeline,” which tracks all actions (commits, rollbacks, compactions) on the table.

  1. File Groups and Slices: Hudi organizes data into “File Groups,” identified by a unique ID. Within a group, data is versioned into “File Slices.” A slice consists of a base file (Parquet) and a set of log files (Avro) that contain updates to records in that base file.14
  2. Table Types:
  • Copy-on-Write (COW): Updates trigger a rewrite of the referenced file group’s Parquet file. This maximizes read performance (no merging required) but incurs high write amplification.3
  • Merge-on-Read (MOR): Updates are appended to log files. The query engine merges the base file and log files at read time. This reduces write latency, making it ideal for streaming ingestion, but increases read latency unless asynchronous compaction is aggressively managed.3

2.3.2 Indexing Subsystem

Unlike Iceberg and Delta, which rely largely on file-level statistics (min/max), Hudi integrates a database-like indexing subsystem. It supports Bloom filters, Simple indexes, and a Record-level Index (RLI). The RLI allows Hudi to map a primary key to a specific file ID directly. During an upsert operation, Hudi uses this index to tag the incoming record with its file location, allowing it to touch only the relevant file group rather than scanning partitions or relying on heuristics. This capability is the cornerstone of its performance dominance in CDC workloads.8

3. Feature Completeness and Functional Analysis

While the basic definition of a “Table Format” suggests parity, the implementation of critical features like partition management, schema evolution, and concurrency control varies significantly, impacting operational overhead and system flexibility.

3.1 Partitioning Strategies and Evolution

Partitioning is the primary lever for performance in large-scale datasets, but it is also a source of rigidity.

Apache Iceberg is the leader in partition management due to Hidden Partitioning. In traditional systems (and early Delta Lake), partitioning was physical; if a user wanted to partition by day, they had to create a column date_str derived from timestamp. Iceberg abstracts this. The partition spec is a metadata property. If a user queries WHERE timestamp = ‘2024-01-01T12:00:00’, Iceberg’s split planner uses the transform definition to identify the relevant partition files transparently. Furthermore, this spec can be updated. A table can start partitioned by month and later switch to daily partitioning. The old data remains in monthly partitions, and new data is written to daily partitions. The planner handles this heterogeneity automatically.6

Delta Lake has historically relied on physical hive-style partitioning. However, in 2024/2025, it introduced Liquid Clustering. This feature replaces rigid directory-based partitions with a dynamic clustering technique (often based on Z-curves or Hilbert curves). Liquid Clustering automatically clusters data based on frequently filtered columns and adjusts the file layout incrementally. This is superior for high-cardinality columns or changing data volumes, as it avoids the “small file problem” inherent in over-partitioning and the “skew problem” of under-partitioning.18 However, Liquid Clustering is a background optimization process that must be managed (or paid for via Databricks’ Predictive Optimization).20

Apache Hudi supports physical partitioning but enhances it with Bucket Indexing and Internal Clustering. Hudi’s clustering service allows users to rewrite data layouts asynchronously to optimize for query performance (e.g., sorting by timestamp) while ingestion continues. This allows Hudi to maintain tight file sizing and layout efficiency even in streaming environments where data arrives out of order.3

3.2 Schema Evolution and Enforcement

Apache Iceberg treats columns as unique IDs rather than names. This allows for full schema evolution: adding, dropping, renaming, and reordering columns, as well as widening types (e.g., int to long). Because the mapping is ID-based, renaming a column does not require rewriting the data files; the metadata simply maps the new name to the old ID. This ensures absolute correctness and prevents “zombie data” issues where a new column with an old name inherits incorrect data.6

Delta Lake introduced “Column Mapping” to support renaming and dropping columns without rewrites. Similar to Iceberg, it uses metadata IDs internally. However, enabling this feature is a one-way operation that changes the table protocol version, potentially breaking compatibility with older readers. While functionally similar now, Iceberg’s implementation is natively foundational to the format, whereas Delta’s is an additive feature.21

Apache Hudi relies on Avro for schema validation. It supports schema evolution (add, drop, rename), but the experience is often more tightly coupled to the schema registry or the compute engine’s interpretation of the Avro schema. While robust for standard use cases, complex evolutions (like rebasing nested structures) can sometimes require more manual intervention compared to Iceberg’s type-safe ID system.23

3.3 Concurrency Control and Multi-Writer Support

Handling concurrent writes—such as a streaming ingest job running alongside a GDPR deletion job—is a critical differentiator.

  • Apache Hudi: Offers the most sophisticated concurrency model. It supports Optimistic Concurrency Control (OCC), where writers check for overlapping modifications before committing. Uniquely, Hudi provides Non-Blocking Concurrency Control (NBCC) for specific use cases. In NBCC, multiple writers can append to the table simultaneously without locking, provided they are writing to different file groups. This is critical for high-throughput streaming architectures.17 Hudi integrates with external lock providers (ZooKeeper, DynamoDB, Hive Metastore) to manage coordination.25
  • Apache Iceberg: Utilizes OCC. Writers perform a “check-and-swap” operation on the metadata file. If two writers attempt to commit simultaneously, one will fail and must retry. Iceberg’s conflict detection is granular; it checks if the specific data files or partitions being modified overlap. If Writer A updates Partition X and Writer B updates Partition Y, both can succeed. This reduces contention compared to table-level locking.26
  • Delta Lake: Also employs OCC. In the Databricks environment, a proprietary commit service handles concurrency seamlessly. In the open-source ecosystem, concurrency relies on the atomic capabilities of the underlying storage (e.g., putIfAbsent in S3). While S3 is now strongly consistent, avoiding conflicts requires coordination. Delta’s conflict resolution logic is generally at the table or partition level, which can lead to higher retry rates in high-concurrency scenarios compared to Hudi’s NBCC.3

4. Performance Benchmarks and Analysis

Performance in the Data Lakehouse is not a single metric. It encompasses Ingestion Throughput (how fast data can be written), Query Latency (how fast it can be read), and Data Freshness (how quickly new data is queryable).

4.1 Ingestion and Upsert Performance

Winner: Apache Hudi

For workloads involving heavy mutation (upserts, deletes) and streaming ingestion, Apache Hudi consistently outperforms peers in 2025 benchmarks.

  • Benchmark Evidence: In controlled tests simulating Change Data Capture (CDC) ingestion, Hudi (configured with the MOR table type and Simple or Bloom index) demonstrated significantly higher throughput than Delta Lake (Merge) and Iceberg (Merge-on-Read). Snippet 27 notes an optimized Iceberg ingestion (using OLake) was 2x faster than Databricks, but for upsert specific workloads, Hudi’s specialized indexing prevents the “scan overhead” that plagues the other formats.
  • The “Upsert” Trap: A critical nuance in benchmarking is the default configuration. Snippet 3 and 28 highlight that Hudi defaults to upsert mode (which incurs index lookup overhead), while Delta and Iceberg often default to append. When comparing apples-to-apples append throughput, all three are comparable (bounded by S3 I/O). However, in upsert scenarios, Hudi’s Record-Level Index allows it to identify exactly which file to update without scanning statistics, offering O(1) lookup behavior that Delta (relying on Z-Ordering) and Iceberg (relying on sort order) struggle to match without rewriting more data.8

4.2 Query Latency (Read Performance)

Winner: Delta Lake (with Ecosystem Caveats)

In standard Decision Support (TPC-DS) benchmarks, Delta Lake typically achieves the lowest query latency, particularly when running within the Databricks ecosystem or using Spark.

  • Benchmark Evidence: Benchmarks referenced in 13 and 13 indicate that Delta Lake outperformed Iceberg in TPC-DS queries, with some complex join queries (e.g., Query 72) executing up to 66x faster in unoptimized scenarios.
  • Why Delta Wins Here:
  1. Z-Ordering and Liquid Clustering: Delta’s ability to colocate related data reduces I/O significantly.
  2. Stats Collection: Delta collects stats for the first 32 columns by default, whereas Iceberg requires explicit configuration for column stats in some engines.
  3. Engine Coupling: The Spark Photon engine is hyper-optimized for the specific Parquet layout and compression schemes used by Delta.
  • The Iceberg Counter-Narrative: When benchmarks are run on engines like Trino or Snowflake, the gap disappears or reverses. Snippet 27 cites an independent benchmark where Iceberg, ingested via optimized tools and queried, ran the full TPC-H suite 18% faster than Databricks. This suggests that “performance” is now less a property of the format and more a property of the engine’s integration with that format. Iceberg’s metadata structure (Manifest Lists) allows for faster planning on extremely large tables (millions of partitions) compared to the linear log scan required by Delta (unless V2 checkpoints are used).2

4.3 Metadata Scalability

Winner: Apache Iceberg

As dataset sizes scale to petabytes with millions of files, the time taken just to plan the query becomes the bottleneck.

  • Iceberg: Its hierarchical manifest structure allows the planner to prune files at the manifest level. A query filtering for “Yesterday” on a 10-year dataset reads only the specific manifest file covering “Yesterday.” This operation is O(1) relative to the table size. This makes Iceberg the preferred format for massive datasets in object storage.2
  • Delta Lake: Requires reading the checkpoint file and subsequent JSON logs. While highly optimized with V2 checkpoints and aggressive caching, extremely large tables can still incur significant driver memory pressure during planning. Liquid Clustering helps mitigate this by reducing the file count, but the fundamental log-linear scan remains.9
  • Hudi: The Timeline allows for efficient incremental access, but managing the sheer volume of file groups in massive tables requires careful tuning of the clustering strategies. Hudi’s metadata table (an internal MOR table) stores file listings to avoid expensive S3 LIST operations, parity with Iceberg’s approach.3

5. Cloud Provider Integration and Feature Completeness Matrices

The theoretical capabilities of these formats are often constrained by the specific cloud platforms hosting them. In 2025, the integration landscape is fragmented, with each major cloud provider implicitly or explicitly favoring a specific format.

5.1 Amazon Web Services (AWS)

AWS maintains a largely agnostic stance but shows a strong strategic leaning towards Apache Iceberg, particularly within its serverless analytics suite.

 

Service Apache Iceberg Delta Lake Apache Hudi
Amazon Athena First-Class Citizen. Native read/write, time travel, and schema evolution. Uses Iceberg SDK for optimized planning. Improving. Native support exists but historically lagged. Often requires manifest files or Glue sync. Limitations on time travel syntax in some versions. Complex. Good for COW tables. MOR support requires syncing to Glue and can have high read latency due to lack of native log-merging optimizations in Presto/Athena versions.
AWS Glue Managed Compaction. Glue offers native “automatic compaction” for Iceberg, a hands-off service to solve the small file problem.30 Supported via Glue Spark jobs. No native “tick-box” managed compaction service akin to the Iceberg offering. Supported via Glue Spark jobs. Requires users to manage compaction via Hudi’s internal configs.
Amazon EMR Full support (Spark/Flink/Trino). Full support. Full support. EMR is a common home for Hudi streaming workloads due to updated Hudi bundles.

Key AWS Insight: The introduction of “S3 Tables” (announced late 2024/2025) which provides an automatic specialized bucket for Iceberg tables, further cements AWS’s preference for Iceberg as the standard for serverless data lakes.32

5.2 Microsoft Azure

Azure’s data strategy is heavily interlocked with Delta Lake, driven by its deep partnership with Databricks and the architecture of Microsoft Fabric.

 

Service Apache Iceberg Delta Lake Apache Hudi
Microsoft Fabric Virtualization. Fabric’s “OneLake” native format is Delta Parquet. It supports Iceberg by “shortcutting”—virtually mapping Iceberg metadata to Delta metadata so Fabric engines can read it. Write support is less integrated.33 Native. The foundation of the entire platform. All Fabric engines (SQL, Spark, KQL) speak Delta natively. Limited. Primarily supported via Spark ingestion. Reading via T-SQL endpoints often requires conversion or external table definitions.
Synapse Analytics Limited. Serverless SQL pools have limited native support. Often requires external catalogs or manifest mappings. Native. Serverless SQL pools have built-in optimizations (caching, stats) for Delta tables. Limited. Similar to Iceberg; requires specific configurations or Spark pools to query effectively.
Azure Databricks Supported via UniForm (allows reading Delta as Iceberg). Gold Standard. Features like Photon, Liquid Clustering, and Predictive Optimization are often available here first before Open Source.20 Supported via Spark libraries, but lacks the native optimizations provided for Delta.

Key Azure Insight: Azure is a “Delta-First” cloud. While they support Iceberg, it is often through the lens of interoperability (converting/mapping to Delta) rather than native engine support.

5.3 Google Cloud Platform (GCP)

GCP’s strategy revolves around BigLake, a storage engine designed to unify data lakes and warehouses, providing fine-grained security over object storage.

 

Service Apache Iceberg Delta Lake Apache Hudi
BigQuery / BigLake High Support. BigQuery can read Iceberg manifest files directly. Supports partition pruning, column security, and decent query performance.34 Native (v3). BigQuery now supports Delta Lake natively (parsing the _delta_log directly) without requiring manifest files. Performance is good but slightly slower than native BigQuery storage.35 Manifest Dependent. BigQuery integration for Hudi typically relies on the “Manifest File” approach (syncing Hudi state to a list of files). Real-time MOR querying is less performant than Spark-based engines.37
Dataproc Full support (Spark/Flink). Full support. Full support.

Key GCP Insight: Google is pragmatically agnostic, aiming to be the “query engine for any data.” However, its support for Iceberg manifests is historically more mature than its support for Hudi’s timeline.

6. Ecosystem Integration: Beyond the Hyperscalers

The choice of format is often dictated not by the cloud provider, but by the query engine of choice.

6.1 Snowflake

Snowflake has aggressively adopted Apache Iceberg. Its “Iceberg Tables” feature allows Snowflake to read/write Iceberg tables in customer-owned storage (S3/Azure/GCS) with performance parity to native Snowflake tables. Snowflake acts as the catalog, managing the metadata directly. While Snowflake allows reading Delta Lake (often via UniForm), its architecture is optimized for the immutable file structure of Iceberg.7

6.2 Apache Flink

For stateful streaming processing, Apache Hudi is the clear leader. The Hudi-Flink connector is highly mature, supporting the “CDC Debezium” format natively. It allows Flink to stream changes into a Hudi table and stream changes out of a Hudi table to downstream systems, effectively turning the Data Lake into a streaming database.6 Iceberg’s Flink support has improved (supporting CDC reads), but Hudi’s non-blocking concurrency control makes it more stable for high-throughput streaming sinks.38

6.3 Trino (Starburst)

Trino has historically favored Apache Iceberg. The Trino connector for Iceberg is one of the most developed, supporting advanced features like MERGE, partition evolution, and extensive predicate pushdown. Trino’s support for Delta Lake is robust but relies on the standalone Delta Kernel or native readers which may lag slightly behind Databricks-specific features.17

7. The Interoperability Revolution: XTable and UniForm

A critical development in 2024/2025 is the decoupling of “Table Format” from “Data Lock-in.” Two major technologies have emerged to render the “format war” partially obsolete.

  1. Apache XTable (Incubating): Formerly “OneTable” (created by Onehouse), this project acts as a translation layer. It allows a user to write data in one format (e.g., Hudi) and automatically generate the metadata for the other formats (Iceberg and Delta). The data files (Parquet) are not duplicated; only the metadata pointers are generated. This allows a pipeline to ingest via Hudi (for streaming efficiency) and query via Snowflake (using the Iceberg metadata) or Fabric (using the Delta metadata).1
  2. Delta Lake UniForm: Developed by Databricks, Universal Format (UniForm) allows Delta tables to automatically generate Iceberg metadata. When enabled, a Delta table becomes dual-format. This is Databricks’ strategy to keep users in the Delta ecosystem while allowing them to interact with Iceberg-native tools like Snowflake or Athena. However, limitations exist: UniForm functionality is often read-only for the Iceberg side and may lag behind the latest Iceberg spec features.4

Implications: These technologies suggest a future where the “Primary Format” is a write-side concern (optimized for ingestion), while the “Read Format” is a dynamic property chosen by the query engine.

8. Strategic Recommendations and Outlook

In 2025, the decision matrix for selecting an Open Table Format should no longer be based on a “winner takes all” mentality, but on specific architectural requirements.

8.1 Recommendations

  • Select Apache Iceberg if:
  • Interoperability is paramount: You have a heterogeneous stack (e.g., Snowflake for BI, Spark for ETL, Trino for ad-hoc). Iceberg is the “USB-C” of data formats—supported almost everywhere.5
  • Governance and Schema Evolution: You require strict schema evolution guarantees and type safety over long data lifecycles.
  • Scale: You have tables with millions of partitions where metadata planning time is a bottleneck.2
  • Select Delta Lake if:
  • The Spark Ecosystem is Central: You are heavily invested in Databricks or Azure Synapse. The integration depth, performance optimizations (Photon), and ease of use (Z-Ordering/Liquid Clustering) in this ecosystem are unmatched.5
  • Simplistic Batch Pipelines: Your workloads are primarily append-only or batch merge patterns where extreme streaming latency is not the primary KPI.
  • Select Apache Hudi if:
  • Streaming Mutation: You are building a “Streaming Data Lakehouse.” You need to ingest CDC data from operational databases with sub-minute freshness.
  • Upsert Performance: You have heavy random update workloads. Hudi’s Record-Level Index and Non-Blocking Concurrency Control provide a throughput ceiling that the other formats struggle to match in mutable scenarios.5

8.2 Conclusion

The Data Lakehouse market has matured. The three formats, while converging on high-level features, have specialized deep in their architecture. Delta Lake is the engine of the Spark-centric warehouse. Apache Iceberg is the universal interchange of the open data ecosystem. Apache Hudi is the streaming database for the lake. By leveraging new interoperability layers like XTable and UniForm, organizations can now design architectures that exploit the write-side strengths of one format without sacrificing the read-side compatibility of another, effectively ending the zero-sum game of the format wars.

Table 1: Detailed Technical Comparison Matrix (2025)

Feature Category Apache Iceberg Delta Lake Apache Hudi
Metadata Structure Hierarchical (Metadata -> Manifest List -> Manifest). O(1) pruning. Transaction Log (Sequential JSON + Parquet Checkpoints). Timeline (LSM-Tree style). Instant-based state tracking.
Partitioning Hidden Partitioning. Virtual; allows evolution without rewriting data. Liquid Clustering (Dynamic Z-Curve) & Physical Partitioning. Physical Partitioning + Internal Clustering/Bucket Index.
Schema Evolution Full Fidelity. ID-based. Column rename/reorder/type promotion supported. Column Mapping. Name/Drop supported via protocol upgrade. Avro-based. Standard evolution (add/append), engine dependent.
Concurrency Optimistic. Granular conflict detection at file/partition level. Optimistic. Table/Partition level. Native locking in Databricks. Optimistic + Non-Blocking. NBCC allows concurrent non-overlapping writes.
Merge-on-Read Supported (Delete vectors / Position deletes). Supported (Deletion Vectors in Delta 3.0+). Native. Log files + Base files. Mature compaction services.
Indexing Partition stats, Min/Max pruning. Z-Order, Liquid Clustering (Data Skipping). Bloom Filters, Record-Level Index (Global/Partitioned).
CDC Ingestion Incremental Read (Append-heavy). Change Data Feed (Must be enabled). Incremental Query. Native support for streaming CDC out.
Primary Ecosystem AWS Athena, Snowflake, Trino, Dremio. Databricks, Azure Fabric, Spark, Microsoft Synapse. Uber, ByteDance, AWS EMR, Flink/Streaming stacks.

Table 2: Benchmark Performance Synthesis

Workload Type Apache Iceberg Delta Lake Apache Hudi
TPC-DS (Complex Read) High (Excellent w/ Trino/Snowflake) Very High (Best w/ Spark/Databricks) Medium (Dependent on compaction tuning)
Bulk Ingestion (Append) High (Parquet write speed) High (Parquet write speed) High (Parquet write speed)
Streaming Upsert Medium (Overhead of position deletes) Medium (Merge overhead) Highest (Indexed lookups + Log Append)
Metadata Listing (1PB+) Best (Manifest Lists) Good (Liquid Clustering helps) Good (Timeline Metadata Table)