Apache Iceberg Pocket Book

Apache Iceberg Pocket Book

ACID lakehouse tables • Schema/partition evolution • Time travel • Hidden partitioning • Maintenance & SQL snippets • Interview Q&A

Section 1 — Fundamentals

1) What is Apache Iceberg?

Iceberg is an open table format for data lakes that brings database-like guarantees (ACID, snapshots) to files on S3/GCS/ADLS/HDFS. It separates metadata from data files so engines like Spark, Flink, Trino, and Hive can read/write consistently at scale.

# Spark shell with Iceberg (example coords)
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.0

2) Core Building Blocks

Table metadata: schema, partition spec, properties, current snapshot.
Snapshots: point-in-time views; each write creates a new snapshot.
Manifests & manifest lists: indexes of data files and partitions.
Data files: typically Parquet/ORC/Avro in object storage.

3) Why Iceberg vs Hive tables?

Iceberg tracks files in manifests (not metastore partitions only), supports hidden partitioning, reliable deletes/updates, schema & partition evolution, and time travel with isolation — avoiding the “directory listing” pitfalls of classic Hive tables.

4) Catalogs

Tables live in a catalog that stores metadata locations: Hadoop, Hive, REST, Glue, Nessie, etc. Choose one and configure your engines to point to it for consistent reads/writes.

spark.sql("CREATE CATALOG lake USING 'org.apache.iceberg.spark.SparkCatalog' \
  OPTIONS('type'='hadoop','warehouse'='s3://lakehouse/warehouse')")

5) Table Creation (Spark SQL)

CREATE TABLE lake.sales (
  id BIGINT, user_id STRING, amount DOUBLE, ts TIMESTAMP, country STRING
) USING iceberg
PARTITIONED BY (years(ts), bucket(16, user_id));

Hidden partitioning means you don’t include partition columns in queries; pruning is automatic.

Section 2 — Writing, Reading & Time Travel

6) Insert / Update / Delete

INSERT INTO lake.sales VALUES (1,'u1',19.5, TIMESTAMP '2025-08-09 10:00:00','IN');

UPDATE lake.sales SET amount = 21.0 WHERE id = 1;

DELETE FROM lake.sales WHERE id = 1;

7) MERGE (Upsert)

MERGE INTO lake.sales t
USING lake.staging s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

8) Time Travel

Query any snapshot by id or timestamp for audits and debugging.

-- by snapshot id
SELECT * FROM lake.sales VERSION AS OF 1789456123456;
-- by timestamp
SELECT * FROM lake.sales TIMESTAMP AS OF '2025-08-01 09:30:00';

9) Incremental Reads

Pull only rows changed between snapshots for efficient downstream ETL.

SELECT * FROM lake.sales CHANGES BETWEEN 1789000000000 AND 1789456123456;

10) Row-Level Deletes & Position Deletes

Iceberg supports equality deletes (by key) and position deletes (file/row positions). Engines merge deletes with data files at read time for consistent results.

Section 3 — Partitioning, Evolution & Performance

11) Hidden Partitioning

Define transforms such as years(ts), hours(ts), bucket(32, col), truncate(col, N). Queries stay simple; pruning is automatic via manifests.

12) Partition Evolution

Change the partition spec over time without rewriting old data; Iceberg tracks specs per file so reads remain correct across generations.

ALTER TABLE lake.sales ADD PARTITION FIELD hours(ts);

13) Schema Evolution

Add, drop, rename, reorder columns safely. Field IDs ensure compatibility across writers and readers.

ALTER TABLE lake.sales ADD COLUMN channel STRING;
ALTER TABLE lake.sales RENAME COLUMN user_id TO customer_id;

14) Compaction (File Size Optimization)

Combine many small files into larger ones for faster scans.

CALL lake.system.rewrite_data_files(table => 'lake.sales', options => map('min-input-files','10'));

15) Metadata & Manifest Maintenance

Rewrite manifests to improve pruning and clean old snapshots to control storage.

CALL lake.system.rewrite_manifests('lake.sales');
CALL lake.system.expire_snapshots('lake.sales', TIMESTAMP '2025-07-01 00:00:00');

Section 4 — Engines, SQL & Integrations

16) Spark Config (Example)

spark.sql.catalog.lake=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.lake.type=hadoop
spark.sql.catalog.lake.warehouse=s3://lakehouse/warehouse

17) Trino/Presto SQL

SELECT * FROM lake.default.sales WHERE ts > current_timestamp - INTERVAL '1' DAY;
CALL system.expire_snapshots('lake.default.sales', date('2025-07-01'));

18) Flink Streaming Writes

-- Flink SQL sink (simplified)
CREATE TABLE lake.sales (
  id BIGINT, user_id STRING, amount DOUBLE, ts TIMESTAMP(3), WATERMARK FOR ts AS ts
) PARTITIONED BY (years(ts))
WITH ('connector'='iceberg','catalog-name'='lake','catalog-type'='hadoop','warehouse'='s3://lakehouse/warehouse');

19) Governance & Security

Use catalog-level auth (Glue/Hive/REST), S3/ADLS IAM policies, encryption at rest, and object-lock/versioning if required. Keep audit trails via snapshots and CHANGES queries.

20) Common Pitfalls

Too many tiny files → schedule compactions.
Mismatched engine versions/connectors.
Forgetting snapshot expiry → metadata growth.
Over-partitioning or skewed buckets → poor pruning.

Section 5 — Ops, Tuning & Interview Q&A

21) Ops Checklist

Automate expire_snapshots and rewrite_data_files.
Watch table metrics: file count, avg file size, snapshots, manifests.
Validate schema/partition changes in lower envs first.
Back up catalog metadata and enable bucket versioning.

22) Performance Tips

Prefer Parquet with column pruning + ZSTD or Snappy.
Aim for 128–512 MB target file size.
Use partition transforms that match query filters.
Leverage vectorized reads in engines.

23) Interview Q&A — 10 Quick Ones

1) Iceberg vs Delta vs Hudi? Iceberg excels at hidden + evolving partitions and multi-engine support; Delta has tight Databricks integration; Hudi focuses on fast upserts/incremental pulls.

2) What is a snapshot? Immutable table state pointer; queries reference the current snapshot unless time traveling.

3) How do deletes work? Equality or position deletes are merged at read time for consistency.

4) Why compaction? Fewer, larger files improve scan throughput and pruning.

5) Hidden partitioning? Engine applies transforms; users query on natural columns without PARTITION BY predicates.

6) Partition evolution? New spec added; old files keep prior spec — reads remain correct.

7) How to do incremental ETL? Use CHANGES BETWEEN (or engine support) to fetch only modified rows.

8) Concurrency control? Optimistic commits with atomic metadata pointer updates.

9) Catalog choice? Use a centralized catalog (Glue/Hive/REST) shared by all engines.

10) Cost control? Expire snapshots, compact small files, and avoid over-partitioning.

Cutting-edge Technology Courses by Uplatz

Apache Iceberg Pocket Book

1) What is Apache Iceberg?

2) Core Building Blocks

3) Why Iceberg vs Hive tables?

4) Catalogs

5) Table Creation (Spark SQL)

6) Insert / Update / Delete

7) MERGE (Upsert)

8) Time Travel

9) Incremental Reads

10) Row-Level Deletes & Position Deletes

11) Hidden Partitioning

12) Partition Evolution

13) Schema Evolution

14) Compaction (File Size Optimization)

15) Metadata & Manifest Maintenance

16) Spark Config (Example)

17) Trino/Presto SQL

18) Flink Streaming Writes

19) Governance & Security

20) Common Pitfalls

21) Ops Checklist

22) Performance Tips

23) Interview Q&A — 10 Quick Ones