The Convergence of Lakehouse Architectures: A Comprehensive Analysis of Governance, Concurrency, and Interoperability in Open Table Formats

1. Introduction: The Evolution of Data Lake Consistency

The modern data architecture landscape has undergone a paradigm shift, moving from the rigid schemas of enterprise data warehouses to the scalable but chaotic data swamps of early Hadoop, and finally arriving at the structured, transactional Data Lakehouse. This evolution has been driven by a singular necessity: the requirement to bring database-like guarantees—specifically Atomicity, Consistency, Isolation, and Durability (ACID)—to the scalable, cost-effective storage tier of object stores like Amazon S3, Azure ADLS, and Google Cloud Storage (GCS).1

For nearly a decade, the Hive Metastore (HMS) served as the de facto standard for managing tabular data in lakes. However, HMS was fundamentally limited by its architecture, which tracked data at the folder level rather than the file level. This design choice introduced severe bottlenecks: changing a partition required expensive recursive directory listings, and the lack of atomic commits meant that readers could often see partial writes or inconsistent states. Furthermore, the eventual consistency models of early cloud object stores (e.g., S3 prior to 2020) exacerbated these issues, forcing engineers to rely on auxiliary consistency mechanisms like Netflix’s S3Guard.3

The resolution to these challenges emerged in the form of “Open Table Formats” (OTFs)—specifically Apache Iceberg, Delta Lake, and Apache Hudi. These formats effectively moved the database management system (DBMS) logic out of the engine and into the application layer, interacting directly with files in object storage to guarantee transactional integrity. By 2025, the industry discourse has shifted from a “war” over which format to choose, to a complex engineering challenge: managing governance across disparate engines (Spark, Trino, Flink) and ensuring interoperability in a heterogeneous environment where no single format rules every workload.5

This report provides an exhaustive analysis of the technical underpinnings of this new landscape. It dissects the concurrency control mechanisms that allow multiple engines to write to the same table simultaneously, the governance protocols that manage metadata at scale, and the translation layers that are beginning to dissolve the barriers between formats.

2. The Internal Architecture of Open Table Formats

To understand the interoperability challenges inherent in cross-engine workloads, one must first deconstruct the distinct architectural philosophies of the three dominant formats. While they share a common goal—metadata management over Parquet/Avro files—their internal implementations dictate their specific strengths, weaknesses, and compatibility limits.

2.1 Apache Iceberg: The Snapshot Isolation Model

Apache Iceberg was born at Netflix specifically to address the correctness issues of Hive on S3. Its core design philosophy is the complete isolation of table state into immutable snapshots.2

2.1.1 Hierarchical Metadata Structure

Iceberg utilizes a sophisticated three-tier metadata tree that allows engines to plan queries without listing object storage directories—a crucial optimization for performance on cloud storage.

  1. Metadata File (vN.metadata.json): This is the root of the table. It contains the table’s schema, partition specification, current snapshot ID, and a history of previous snapshots. Every commit to an Iceberg table produces a new metadata file, ensuring a linear history of state changes.4
  2. Manifest List (snap-ID.avro): Each snapshot references a specific Manifest List. This file contains a list of Manifest Files that make up the snapshot, along with partition-level statistics (e.g., min/max values for partition columns). This allows the query engine to perform “Manifest Skipping,” ignoring entire swaths of the table that do not match the query predicates.7
  3. Manifest Files (*.avro): These files track individual data files (Parquet/ORC). They contain the physical file path, partition tuple, and column-level statistics (min/max/null counts) for every column in the file. This granular metadata enables “Scan Planning” where the engine can prune specific files based on filter predicates (e.g., WHERE timestamp > ‘2025-01-01’).7

2.1.2 Partition Evolution and Hidden Partitioning

A differentiating feature of Iceberg is “Hidden Partitioning.” Unlike Hive, which requires the user to create explicit partition columns (e.g., event_date) derived from the data, Iceberg defines partitions as transforms on existing columns (e.g., day(timestamp)). The metadata tracks the relationship between the raw column and the partition.

This architecture enables Partition Evolution: the partitioning scheme can be changed over time (e.g., from month to day) without rewriting old data. The metadata simply tracks which files belong to which partition spec version. This feature, while powerful, complicates interoperability, as translation layers like Apache XTable must map these logical transforms to physical columns in other formats.9

2.2 Delta Lake: The Transaction Log Model

Delta Lake, developed by Databricks, centers its architecture on a sequential transaction log, the _delta_log. This log serves as the single source of truth, providing a verifiable order of operations that facilitates the protocol’s reliance on file system atomic primitives.7

2.2.1 The Delta Log Protocol

The Delta Log consists of a sequence of JSON files (000000.json, 000001.json, etc.), each representing an atomic transaction.

  • Actions: Each JSON file contains “actions” such as add (adding a data file), remove (logically deleting a file), or metaData (changing schema).
  • Checkpoints: To prevent the cost of reading the log from growing indefinitely, Delta automatically creates checkpoint files (in Parquet format) every 10 commits by default. A reader needing the current state of the table reads the latest checkpoint and plays forward any subsequent JSON logs.11

2.2.2 Protocol Versioning and Feature Flags

Delta Lake manages compatibility through strict protocol versioning (minReaderVersion, minWriterVersion). Advanced features are gated behind these versions:

  • Column Mapping: Allows renaming or dropping columns without rewriting Parquet files (by mapping logical names to physical UUIDs). This requires a reader protocol upgrade.
  • Deletion Vectors: Introduced in recent versions to optimize Merge-On-Read performance. Instead of rewriting an entire file to delete a single row, a small bitmap file is written indicating which rows are invalid. This significantly reduces write amplification but breaks compatibility with older readers (e.g., older Trino versions).13

2.3 Apache Hudi: The Timeline and Stream-Processing Model

Apache Hudi (Hadoop Upsert Deletes and Incrementals) was designed by Uber with a “streaming-first” mindset. It treats the table not just as a state, but as a sequence of events on a timeline.1

2.3.1 The Timeline Architecture

The core of Hudi is the .hoodie directory, which maintains a timeline of all actions performed on the table.

  • Instants: Actions are recorded as “instants” with specific states (REQUESTED, INFLIGHT, COMPLETED). Actions include COMMIT (batch write), DELTA_COMMIT (streaming write), CLEAN (file cleanup), and COMPACTION (merging logs to base files).15
  • File Layouts (COW vs. MOR):
  • Copy-On-Write (COW): Updates rewrite the entire Parquet file. This maximizes read performance but increases write latency.
  • Merge-On-Read (MOR): Updates are written to row-based log files (Avro). Readers must merge the base Parquet file with the delta logs at query time. This provides low-latency writes but imposes a merge cost on readers. This dual-structure poses the most significant challenge for cross-engine interoperability, as many engines (like early versions of Trino) have struggled to efficiently implement the complex merging logic required for MOR snapshot queries.16

3. Concurrency Control in Distributed Systems

The ability for multiple distributed systems—such as a Flink streaming job and a Spark compaction job—to safely modify the same table concurrently is the “hard problem” of data lake engineering. The solution depends heavily on the consistency guarantees of the underlying storage and the locking mechanisms implemented by the table format.12

3.1 The S3 Consistency Challenge

While Amazon S3 now offers strong consistency (read-after-write), it does not support atomic rename operations or put-if-absent conditional writes for existing objects. This means that if two writers attempt to commit a transaction simultaneously, one might overwrite the other’s metadata file without realizing it, leading to a “lost update” and data corruption. Consequently, all three formats require an external locking provider or a specific catalog service to arbitrate commits on S3.19

3.2 Apache Iceberg: Optimistic Concurrency Control (OCC)

Iceberg employs Optimistic Concurrency Control. Writers assume they are the sole operator, prepare a new metadata file, and then attempt to atomically swap the table pointer to this new file.

3.2.1 The Atomic Swap Mechanism

  1. Base State: The writer notes the current snapshot ID (S1).
  2. Speculative Write: The writer creates a new snapshot (S2) based on S1 and writes the metadata file.
  3. Conflict Detection: Before committing, the writer checks if current-snapshot-id is still S1.
  • If yes, it swaps the pointer to S2.
  • If no (meaning S3 was committed by another writer), the writer must retry (rebase S2 on top of S3).

3.2.2 The Role of the Catalog

On S3, the “Swap” operation is not atomic. Therefore, the Catalog serves as the synchronizer:

  • DynamoDB Lock Manager: The iceberg-aws module uses a DynamoDB table to acquire a lock on the table key. Only the lock holder can update the metadata location. This is a client-side locking mechanism.21
  • REST Catalog (Server-Side Locking): In the modern REST Catalog architecture, the client sends the “Swap” request to the server. The server (e.g., Tabular, Polaris, Nessie) uses its internal database transaction to ensure atomicity. This removes the complex locking logic from the client and is the preferred method for high-concurrency environments.23

3.3 Delta Lake: LogStore Semantics and Multi-Cluster Writes

Delta Lake relies on the concept of a LogStore to abstract the file system specifics. The correctness of Delta relies on the storage system’s ability to fail a write if the file already exists (mutual exclusion).25

3.3.1 The S3 DynamoDB LogStore

Since S3 lacks “put-if-absent,” Open Source Delta Lake (OSS) cannot guarantee safety for concurrent writers from different clusters (e.g., Spark and Flink) out of the box.

  • Single Cluster Limitation: By default, Delta on S3 supports concurrent reads but requires all writes to originate from a single Spark driver to serialize commits in memory.
  • S3DynamoDBLogStore: To enable multi-cluster writes, users must configure the S3DynamoDBLogStore. This implementation inserts a record into DynamoDB for the target log file (e.g., 000002.json) before writing to S3. If the DynamoDB insert fails because the key exists, the write is rejected.
  • Configuration: spark.delta.logStore.s3.impl=io.delta.storage.S3DynamoDBLogStore.
  • Risk: If one cluster is configured with this LogStore and another is not, the non-configured cluster can silently overwrite the log, corrupting the table.25

3.3.2 Databricks Proprietary Commit Service

It is important to note that within the Databricks platform, a proprietary commit service manages this concurrency, providing a seamless experience. The complexity of DynamoDB configuration is strictly a concern for open-source users managing their own infrastructure.26

3.4 Apache Hudi: Multi-Writer Locking

Hudi introduced multi-writer support via OCC in version 0.8.0. Like Iceberg, it separates the data write phase from the commit phase.28

3.4.1 Lock Providers

Hudi provides a pluggable locking interface. For S3 deployment, the DynamoDBBasedLockProvider is standard.

  • Locking Strategy: The writer acquires the lock only during the critical metadata update phase, minimizing the lock duration.
  • Conflict Resolution: Hudi checks for overlapping file writes. If Writer A and Writer B modify different file groups, the commit succeeds. If they modify the same file group, one fails.
  • Non-Blocking Concurrency Control (NBCC): Hudi has introduced experimental support for NBCC, allowing concurrent writes without strict locking in specific append-only or disjoint-update scenarios, leveraging the timeline’s ability to resolve state logically.15

4. Governance and Catalogs: The Shift to Services

As data lakes scale to petabytes, the “file system as a catalog” model (referencing tables by path) has proven insufficient. The industry has standardized on Catalog Services that provide abstraction, security, and enhanced capabilities like branching.

4.1 The Iceberg REST Catalog Standard

The Apache Iceberg REST Catalog Specification is currently the most significant driver of interoperability. It decouples the engine from the catalog implementation, defining a standard OpenAPI contract for table operations.23

4.1.1 Mechanism and Benefits

  • Standardization: Any engine that implements the REST client (Trino, Spark, Flink) can communicate with any catalog that implements the REST server (Polaris, Unity, Nessie, Gravitino).
  • Security: The spec includes authentication (OAuth2) and allows the catalog tovend temporary storage credentials (e.g., vending S3 temporary tokens) to the client, removing the need for long-lived static keys on the compute nodes.24
  • AWS S3 Tables: AWS has recently launched “S3 Tables,” a managed service that exposes an Iceberg REST endpoint. This service handles the physical storage layout and compaction automatically, presenting a pure Iceberg interface to the user.23

4.2 Project Nessie: Git-Semantics for Data

Project Nessie extends the catalog concept to include Version Control System (VCS) semantics, enabling patterns like “Zero-Copy Isolation” and cross-table transactions.30

4.2.1 The Commit Model

In a standard catalog (e.g., Hive), operations are atomic only at the single-table level. Nessie tracks the state of the entire catalog as a commit hash.

  • Multi-Table Atomicity: A single Nessie commit can update the pointers for Table A, Table B, and Table C simultaneously. This is critical for ETL pipelines that must publish a consistent view of a dimensional model (facts + dimensions) to consumers.30

4.2.2 Branching and Merging in SQL

Nessie exposes Git-like operations via SQL extensions in engines like Trino and Spark. This allows for rigorous DataOps workflows:

  • Scenario: An engineer wants to test a new ETL logic without affecting production dashboards.
  • Step 1 (Create Branch): CREATE BRANCH dev_experiment FROM main IN nessie.32
  • Step 2 (Write): USE REFERENCE dev_experiment IN nessie; INSERT INTO sales…
  • Step 3 (Verify): Run validation queries on the dev_experiment branch.
  • Step 4 (Merge): MERGE BRANCH dev_experiment INTO main IN nessie.33
  • The MERGE command in Trino and Spark supports different behaviors (NORMAL, FORCE, DROP) to handle conflicts if main has moved forward since the branch was created.35

4.3 Unity Catalog and Federation

Unity Catalog (originally Databricks-proprietary) has moved towards openness, with its OSS version supporting the Iceberg REST API.

  • Federation: The emergence of “Catalog of Catalogs” (Federated Catalogs) like Apache Gravitino allows a central governance layer to manage multiple physical catalogs (Hive, Postgres, Glue). This is essential for large enterprises with fragmented data estates.30
  • External Access: Unity Catalog can manage external tables (e.g., tables in S3 not managed by Databricks) and expose them to third-party engines via the REST interface, centralizing lineage and audit logs even for non-Spark workloads.37

5. Cross-Engine Interoperability: Native vs. Translated

A primary requirement of the modern lakehouse is the ability to write data with one engine (e.g., Flink) and read it with another (e.g., Trino) regardless of the underlying format. This has led to two distinct approaches: translation layers (XTable) and native masquerading (UniForm).

5.1 Apache XTable (formerly OneTable)

Apache XTable serves as an omni-directional translator. It does not rewrite the data files (Parquet) but translates the metadata from one format to another.6

5.1.1 The Translation Process

XTable reads the source metadata (e.g., Hudi Timeline) and maps it to the target metadata structures (e.g., Delta Log and Iceberg Manifests).

  • Synchronization: It can run as a sidecar process or a post-write hook.
  • Schema Mapping: It handles type conversions between formats. However, this is not lossless. For example, Hudi’s LogFile format (Avro-based) used in Merge-On-Read tables cannot be directly mapped to Iceberg or Delta, which expect Parquet data files. Therefore, XTable currently supports Copy-On-Write (COW) or Read-Optimized views only. It cannot translate pending compaction logs.39

5.1.2 Limitations and Trade-offs

  • Partitioning Complexity: Translating sophisticated partitioning schemes is difficult. Iceberg’s “Hidden Partitioning” (logical transforms) does not map one-to-one with Delta’s physical partitioning or generated columns. XTable may force the target table to appear as unpartitioned or require explicit physical columns to be added to the schema.10
  • Lag: Since translation is an asynchronous process, there is an inherent latency. The target formats will always be “eventually consistent” with the source.

5.2 Delta Lake UniForm (Universal Format)

UniForm is a uni-directional solution built directly into the Delta Lake writer. When enabled, the writer generates Iceberg metadata asynchronously alongside the Delta log.41

5.2.1 Mechanism

  • IcebergCompatV2: UniForm requires the Delta table to be configured with iceberg-compat-v2. This restricts certain Delta features that would produce Parquet files unreadable by standard Iceberg readers (e.g., certain timestamp encodings or non-standard types).42
  • The “Read-Only” Compromise: While UniForm allows Trino to read a Delta table as if it were Iceberg, Trino cannot write to this table through the Iceberg interface. The table remains a Delta table; the Iceberg metadata is a read-only view managed by the Delta writer.
  • Feature Gaps: Historically, enabling UniForm disabled Deletion Vectors and Liquid Clustering. However, recent updates in Delta 3.2+ and Databricks Runtime 14+ have begun to support Deletion Vectors with UniForm, provided the Iceberg reader supports the Puffin spec (which defines deletion vectors in Iceberg).43

6. Engine-Specific Integration Deep Dives

The theoretical capability of a table format often exceeds its practical support within specific compute engines. This section details the support matrix and integration architecture for Spark, Trino, and Flink as of 2025.

6.1 Apache Spark: The Reference Implementation

Spark is the most mature engine for all three formats. It is the only engine capable of performing all maintenance operations (compaction, clustering, expiration) across the board.

  • Iceberg on Spark:
  • Procedures: Spark is the primary interface for Iceberg stored procedures: CALL catalog.system.expire_snapshots(), CALL catalog.system.rewrite_data_files().
  • Merge-on-Read: Fully supported for both reads and writes. Spark’s catalyst optimizer can push down filters into the Iceberg manifest reader efficiently.4
  • Delta on Spark:
  • Native Features: Supports all Delta features including Liquid Clustering and Deletion Vectors. The OPTIMIZE command (used for clustering) is native to Spark.47

6.2 Trino: The Interactive Analytics Workhorse

Trino (formerly PrestoSQL) prioritizes read performance and adherence to SQL standards.

6.2.1 Trino + Delta Lake

  • Deletion Vectors: Trino added support for reading Delta tables with Deletion Vectors in late 2023. This was a critical blocker, as Databricks defaults to enabling this feature.
  • Liquid Clustering: Trino can read tables that use Liquid Clustering (Z-order/Hilbert curves). It leverages the spatial locality for data skipping. However, Trino cannot write to these tables using the clustering layout, nor can it run the optimization job to cluster the data. This creates a functional asymmetry: data must be written/maintained by Spark to benefit from clustering in Trino.49
  • Write Limitations: By default, Trino disables writes to S3-backed Delta tables (delta.enable-non-concurrent-writes=false) due to the lack of lock integration. Enabling writes requires careful configuration of the lock mechanism to match other writers.51

6.2.2 Trino + Iceberg

  • Maturity: Iceberg support in Trino is first-class. It supports UPDATE, DELETE, MERGE, and Time Travel.
  • Performance: Trino’s cost-based optimizer utilizes Iceberg’s column statistics (stored in Manifest files) for highly effective partition pruning and split generation.8
  • Time Travel Syntax: Trino standardizes the syntax:
    SQL
    SELECT * FROM my_table FOR VERSION AS OF 123456789;
    SELECT * FROM my_table FOR TIMESTAMP AS OF TIMESTAMP ‘2025-01-25 10:00:00’;

    This syntax works identically for Iceberg and Delta connectors.52

6.2.3 Trino + Hudi

  • Connector Evolution: Historically, Trino queried Hudi via the Hive connector (input format). A native Hudi connector now exists.
  • Limitations: The native connector excels at Copy-On-Write (COW) tables. Support for Merge-On-Read (MOR) snapshot queries (which require merging Avro logs with Parquet base files on the fly) is computationally expensive and less optimized than in Spark. For MOR tables, Trino often defaults to the “Read Optimized” mode, which reads only the base files, sacrificing data freshness for performance.17

6.3 Apache Flink: The Streaming Frontier

Flink is the engine of choice for Change Data Capture (CDC) and low-latency ingestion.

6.3.1 Flink + Iceberg

  • Checkpoint Integration: Flink’s sink integrates with Iceberg’s commit protocol via Flink’s checkpointing mechanism. Data files are written continuously, but the metadata commit (making files visible) occurs only when the Flink checkpoint completes. This ensures end-to-end exactly-once semantics.
  • Small File Problem: Streaming sinks generate many small files. Flink users must configure the write.distribution-mode and potentially run a concurrent Spark compaction job to maintain read health.46

6.3.2 Flink + Delta Lake

  • Sink V3: The new Delta Sink V3 utilizes the Delta Kernel, a library designed to unify Delta logic across engines. This has improved startup performance and consistency.
  • Limitations: While Flink can write to Delta, it lags in supporting the newest write features (e.g., writing Liquid Clustered data directly). It relies on optimistic concurrency (via lock providers) for multi-cluster writes.13

6.3.3 The Rise of Apache Paimon

A significant development in the Flink ecosystem is Apache Paimon (incubating).

  • Architecture: Unlike Iceberg/Delta which are fundamentally columnar (Parquet) and designed for batch scans, Paimon uses an LSM-Tree (Log Structured Merge Tree) architecture similar to RocksDB.
  • Use Case: This structure enables significantly higher throughput for streaming upserts (updates/deletes) than standard OTFs. Paimon serves as a “Streaming Lakehouse” storage layer, often acting as the ingestion buffer that is later compacted into Iceberg/Delta for OLAP querying.16

7. Comparison Tables

7.1 Concurrency & Locking Matrix

Feature Apache Iceberg Delta Lake Apache Hudi
S3 Consistency Requires Lock Provider (DynamoDB) or REST Catalog. Requires S3DynamoDBLogStore (OSS) or Managed Service. Requires Lock Provider (DynamoDB/Zookeeper).
Locking Granularity Table-level atomic swap. Log file sequence (Optimistic). Timeline commit lock (Optimistic).
Multi-Cluster Write Supported (with shared lock). Supported (with S3DynamoDBLogStore). Supported (with shared lock).
Non-Blocking (NBCC) No (Strict Serializability). No (Serializable/WriteSerializable). Yes (Experimental support for disjoint writes).

7.2 Interoperability Matrix

Source Format Target via XTable Target via UniForm
Apache Iceberg Delta, Hudi (COW) N/A
Delta Lake Iceberg, Hudi (COW) Iceberg, Hudi (Read-Only View)
Apache Hudi Iceberg, Delta N/A
Limitation No MOR support; Feature loss (Generated Columns). Write-locked to Delta; Feature gating.

8. Conclusion and Future Outlook

The landscape of table format governance in 2025 is defined by convergence. The rigorous competition of the “Format Wars” has produced three highly capable, technically mature formats that are increasingly interoperable.

8.1 Key Takeaways

  1. Catalog is the new Control Plane: The choice of Catalog (Iceberg REST, Nessie, Unity) is now more architecturally significant than the choice of format. The catalog dictates the governance capabilities (Branching, Federation, Security) available to the platform.
  2. Concurrency is the Operational Bottleneck: While engines can interoperate, safely writing to the same table from Flink and Spark requires meticulous configuration of Locking Providers (DynamoDB). The lack of native S3 atomic primitives remains a complexity tax on open-source architectures.
  3. Read vs. Write Asymmetry: We have achieved near-seamless read interoperability (Trino reading Delta via UniForm). However, write interoperability remains fragmented. Advanced write features (Liquid Clustering, Deletion Vectors) often lock the write path to a specific engine (Spark/Databricks), forcing other engines to remain read-only consumers.
  4. Tiered Storage Architectures: The emergence of Apache Paimon suggests a future where data lakes employ a tiered strategy: Paimon for hot, high-velocity streaming data, and Iceberg/Delta for warm, high-performance analytical data.

8.2 Strategic Recommendation

Organizations should prioritize implementing a robust REST-based Catalog layer. This provides the flexibility to switch engines and formats without re-architecting the access control and governance layer. Furthermore, utilizing translation layers like UniForm or XTable is recommended to bridge the gap between “Producer” engines (Spark/Flink) and “Consumer” engines (Trino), accepting the trade-off of a “Read-Optimized” consumption layer in exchange for architectural simplicity.