Apache Iceberg Pocket Book
ACID lakehouse tables • Schema/partition evolution • Time travel • Hidden partitioning • Maintenance & SQL snippets • Interview Q&A
1) What is Apache Iceberg?
Iceberg is an open table format for data lakes that brings database-like guarantees (ACID, snapshots) to files on S3/GCS/ADLS/HDFS. It separates metadata from data files so engines like Spark, Flink, Trino, and Hive can read/write consistently at scale.
# Spark shell with Iceberg (example coords)
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.0
2) Core Building Blocks
- Table metadata: schema, partition spec, properties, current snapshot.
- Snapshots: point-in-time views; each write creates a new snapshot.
- Manifests & manifest lists: indexes of data files and partitions.
- Data files: typically Parquet/ORC/Avro in object storage.
3) Why Iceberg vs Hive tables?
Iceberg tracks files in manifests (not metastore partitions only), supports hidden partitioning, reliable deletes/updates, schema & partition evolution, and time travel with isolation — avoiding the “directory listing” pitfalls of classic Hive tables.
4) Catalogs
Tables live in a catalog that stores metadata locations: Hadoop, Hive, REST, Glue, Nessie, etc. Choose one and configure your engines to point to it for consistent reads/writes.
spark.sql("CREATE CATALOG lake USING 'org.apache.iceberg.spark.SparkCatalog' \
OPTIONS('type'='hadoop','warehouse'='s3://lakehouse/warehouse')")
5) Table Creation (Spark SQL)
CREATE TABLE lake.sales (
id BIGINT, user_id STRING, amount DOUBLE, ts TIMESTAMP, country STRING
) USING iceberg
PARTITIONED BY (years(ts), bucket(16, user_id));
Hidden partitioning means you don’t include partition columns in queries; pruning is automatic.
6) Insert / Update / Delete
INSERT INTO lake.sales VALUES (1,'u1',19.5, TIMESTAMP '2025-08-09 10:00:00','IN');
UPDATE lake.sales SET amount = 21.0 WHERE id = 1;
DELETE FROM lake.sales WHERE id = 1;
7) MERGE (Upsert)
MERGE INTO lake.sales t
USING lake.staging s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
8) Time Travel
Query any snapshot by id or timestamp for audits and debugging.
-- by snapshot id
SELECT * FROM lake.sales VERSION AS OF 1789456123456;
-- by timestamp
SELECT * FROM lake.sales TIMESTAMP AS OF '2025-08-01 09:30:00';
9) Incremental Reads
Pull only rows changed between snapshots for efficient downstream ETL.
SELECT * FROM lake.sales CHANGES BETWEEN 1789000000000 AND 1789456123456;
10) Row-Level Deletes & Position Deletes
Iceberg supports equality deletes (by key) and position deletes (file/row positions). Engines merge deletes with data files at read time for consistent results.
11) Hidden Partitioning
Define transforms such as years(ts)
, hours(ts)
, bucket(32, col)
, truncate(col, N)
. Queries stay simple; pruning is automatic via manifests.
12) Partition Evolution
Change the partition spec over time without rewriting old data; Iceberg tracks specs per file so reads remain correct across generations.
ALTER TABLE lake.sales ADD PARTITION FIELD hours(ts);
13) Schema Evolution
Add, drop, rename, reorder columns safely. Field IDs ensure compatibility across writers and readers.
ALTER TABLE lake.sales ADD COLUMN channel STRING;
ALTER TABLE lake.sales RENAME COLUMN user_id TO customer_id;
14) Compaction (File Size Optimization)
Combine many small files into larger ones for faster scans.
CALL lake.system.rewrite_data_files(table => 'lake.sales', options => map('min-input-files','10'));
15) Metadata & Manifest Maintenance
Rewrite manifests to improve pruning and clean old snapshots to control storage.
CALL lake.system.rewrite_manifests('lake.sales');
CALL lake.system.expire_snapshots('lake.sales', TIMESTAMP '2025-07-01 00:00:00');
16) Spark Config (Example)
spark.sql.catalog.lake=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.lake.type=hadoop
spark.sql.catalog.lake.warehouse=s3://lakehouse/warehouse
17) Trino/Presto SQL
SELECT * FROM lake.default.sales WHERE ts > current_timestamp - INTERVAL '1' DAY;
CALL system.expire_snapshots('lake.default.sales', date('2025-07-01'));
18) Flink Streaming Writes
-- Flink SQL sink (simplified)
CREATE TABLE lake.sales (
id BIGINT, user_id STRING, amount DOUBLE, ts TIMESTAMP(3), WATERMARK FOR ts AS ts
) PARTITIONED BY (years(ts))
WITH ('connector'='iceberg','catalog-name'='lake','catalog-type'='hadoop','warehouse'='s3://lakehouse/warehouse');
19) Governance & Security
Use catalog-level auth (Glue/Hive/REST), S3/ADLS IAM policies, encryption at rest, and object-lock/versioning if required. Keep audit trails via snapshots and CHANGES
queries.
20) Common Pitfalls
- Too many tiny files → schedule compactions.
- Mismatched engine versions/connectors.
- Forgetting snapshot expiry → metadata growth.
- Over-partitioning or skewed buckets → poor pruning.
21) Ops Checklist
- Automate
expire_snapshots
andrewrite_data_files
. - Watch table metrics: file count, avg file size, snapshots, manifests.
- Validate schema/partition changes in lower envs first.
- Back up catalog metadata and enable bucket versioning.
22) Performance Tips
- Prefer Parquet with column pruning + ZSTD or Snappy.
- Aim for 128–512 MB target file size.
- Use partition transforms that match query filters.
- Leverage vectorized reads in engines.
23) Interview Q&A — 10 Quick Ones
1) Iceberg vs Delta vs Hudi? Iceberg excels at hidden + evolving partitions and multi-engine support; Delta has tight Databricks integration; Hudi focuses on fast upserts/incremental pulls.
2) What is a snapshot? Immutable table state pointer; queries reference the current snapshot unless time traveling.
3) How do deletes work? Equality or position deletes are merged at read time for consistency.
4) Why compaction? Fewer, larger files improve scan throughput and pruning.
5) Hidden partitioning? Engine applies transforms; users query on natural columns without PARTITION BY
predicates.
6) Partition evolution? New spec added; old files keep prior spec — reads remain correct.
7) How to do incremental ETL? Use CHANGES BETWEEN
(or engine support) to fetch only modified rows.
8) Concurrency control? Optimistic commits with atomic metadata pointer updates.
9) Catalog choice? Use a centralized catalog (Glue/Hive/REST) shared by all engines.
10) Cost control? Expire snapshots, compact small files, and avoid over-partitioning.