Delta Lake Pocket Book

Delta Lake Pocket Book — Uplatz

50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, and reliability to big data lakes. It allows scalable batch and streaming data processing while maintaining strong consistency and data versioning.

# PySpark example
from delta import *
spark.read.format("delta").load("/mnt/delta/events")

2) Why Delta Lake? Strengths & Tradeoffs

Strengths: ACID guarantees on data lakes, schema evolution, time travel, scalable streaming/batch unification. Tradeoffs: requires Spark runtime, storage format overhead compared to raw parquet. But delivers governance and reliability.

# Write data as Delta
df.write.format("delta").mode("overwrite").save("/mnt/delta/users")

3) Delta Log (_delta_log)

Delta Lake maintains a transaction log (_delta_log) storing JSON and checkpoint files. This log tracks commits, schema changes, and metadata ensuring ACID and reproducibility.

# View log files
dbutils.fs.ls("/mnt/delta/users/_delta_log")

4) ACID Transactions

Atomicity, Consistency, Isolation, Durability. Delta ensures concurrent writers/readers can operate safely, with optimistic concurrency control and conflict detection.

spark.sql("UPDATE users SET active=false WHERE last_login < '2022-01-01'")

5) Schema Enforcement

Delta enforces schemas on write. Optionally enable schema evolution (merge new columns). Prevents corruption by mismatched data types.

df.write.option("mergeSchema","true").format("delta").mode("append").save("/mnt/delta/users")

6) Time Travel

Query older versions of Delta tables by version or timestamp for audits, debugging, and reproducible experiments.

spark.read.format("delta").option("versionAsOf",5).load("/mnt/delta/users")

7) Data Compaction (OPTIMIZE)

Delta supports compaction to combine small files into larger parquet files, improving read efficiency.

OPTIMIZE delta.`/mnt/delta/users`

8) Vacuum & Retention

VACUUM removes old snapshots and files beyond retention threshold. Default retention: 7 days. Ensure compliance before lowering.

VACUUM delta.`/mnt/delta/users` RETAIN 168 HOURS

9) Delta Format Internals

Delta tables store data as Parquet files + transaction log (_delta_log). JSON logs record commits; Parquet checkpoints speed recovery. Readers combine them for the latest snapshot.

10) Q&A — “What problem does Delta solve?”

Answer: It solves the “data swamp” problem by adding ACID, schema, versioning, and reliability to cheap storage, enabling consistent analytics and ML pipelines on data lakes.

Section 2 — Core APIs & Modules

11) DeltaTable API

DeltaTable provides programmatic access to Delta features like history, vacuum, and updates via Python/Scala APIs.

from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/mnt/delta/users")
dt.history().show()

12) MERGE INTO

Delta SQL supports MERGE INTO for upserts (insert + update). Efficient for CDC pipelines and master data management.

MERGE INTO users u
USING updates s ON u.id = s.id
WHEN MATCHED THEN UPDATE SET u.email=s.email
WHEN NOT MATCHED THEN INSERT *

13) Streaming Reads & Writes

Delta unifies batch and streaming. Use Spark Structured Streaming to read/write from/to Delta tables reliably.

stream = spark.readStream.format("delta").load("/mnt/delta/events")
query = stream.writeStream.format("delta").option("checkpointLocation","/mnt/ckpt").start("/mnt/delta/output")

14) DML Support

Delta supports DELETE, UPDATE, and MERGE SQL operations, unlike plain Parquet. Enables full CRUD on lake data.

DELETE FROM users WHERE active=false

15) Z-Ordering

ZORDER BY co-locates related information in storage to speed up queries with filters. Works with OPTIMIZE.

OPTIMIZE users ZORDER BY (country, signup_date)

16) Change Data Feed (CDF)

CDF exposes row-level changes between versions. Useful for incremental ETL and ML model refresh.

spark.read.format("delta").option("readChangeFeed","true").option("startingVersion",10).table("users")

17) Generate Manifest

Delta can generate manifests (Hive-compatible tables) for compatibility with tools that don’t support Delta natively.

GENERATE symlink_format_manifest FOR TABLE users

18) Constraints

Define CHECK constraints to enforce data rules on Delta tables. Violating writes are rejected.

ALTER TABLE users ADD CONSTRAINT valid_age CHECK (age >= 0)

19) Table History

Use DESCRIBE HISTORY to audit changes: operation, user, timestamp, version, and notebook/job origin.

DESCRIBE HISTORY users

20) Q&A — “MERGE vs UPSERT?”

Answer: MERGE is the SQL operation in Delta that performs UPSERT logic: update when matched, insert when not. It unifies both in a single atomic operation.

Section 3 — Async, Patterns & Concurrency

21) Optimistic Concurrency

Delta uses optimistic concurrency: writers check transaction log at commit; if conflicts, transaction aborts. Ensures correctness with concurrent jobs.

22) Batch + Streaming Unification

One Delta table supports both batch queries and streaming queries simultaneously, unifying Lambda and Kappa architectures.

23) Incremental ETL

Use MERGE with Change Data Feed for incremental ETL pipelines, reducing compute by only processing changed data.

24) Multi-Cluster Concurrency

Delta allows concurrent reads and writes from multiple Spark clusters. Conflict resolution ensures readers see consistent snapshots.

25) Streaming Upserts

Implement CDC and streaming upserts with MERGE INTO in Structured Streaming pipelines, ensuring real-time lakehouse consistency.

26) Idempotency

Delta MERGE and transaction log allow idempotent ETL jobs. Retry safety: no duplicate rows or partial commits.

27) Data Skipping

Delta maintains statistics per file (min/max values) enabling data skipping. Reduces scan cost for selective queries.

28) Partitioning

Partition Delta tables by columns (date, region) for efficient queries. Combine with ZORDER to balance file sizes.

29) DML Isolation

Concurrent DML queries (UPDATE, DELETE, MERGE) are isolated by versioned snapshots. Readers never see partial writes.

30) Q&A — “How does Delta handle concurrent writes?”

Answer: Through optimistic concurrency: commit fails if transaction log has conflicting updates since start. Clients retry with new snapshot.

Section 4 — Frameworks, Data & APIs

31) Delta + Databricks

Tightly integrated in Databricks with SQL APIs, UI for time travel, OPTIMIZE, ZORDER, and Delta Live Tables (DLT) for pipelines.

32) Delta + Presto/Trino

Delta connectors allow Presto/Trino to read Delta tables. Hive manifests improve compatibility for non-Spark engines.

33) Delta + ML

Use Delta tables as feature store input, enabling reproducible ML training with versioned data and time travel.

34) Delta Sharing

Open protocol for secure data sharing across clouds and orgs, without copying data. Supports fine-grained access.

35) Delta + BI Tools

Expose Delta tables via Databricks SQL, Presto, or manifests to connect PowerBI/Tableau. Provides consistent semantics for BI queries.

36) Delta“`html

Delta Lake Pocket Book — Uplatz

50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is Delta Lake?

Delta Lake is an open-source storage layer…

2) Why Delta Lake? Strengths & Tradeoffs

Strengths: ACID… Tradeoffs: overhead vs parquet…

3) Delta Log (_delta_log)

Tracks all commits…

4) ACID Transactions

Atomicity, Consistency…

5) Schema Enforcement

Delta enforces schemas…

6) Time Travel

Query older versions…

7) Data Compaction

OPTIMIZE small files…

8) Vacuum & Retention

Clean old data…

9) Delta Format Internals

Parquet + transaction log…

10) Q&A

Answer: Solves data swamp…

Section 2 — Core APIs & Modules

11) DeltaTable API

Programmatic API…

12) MERGE INTO

SQL upsert operation…

13) Streaming Reads & Writes

Unified batch/stream…

14) DML Support

CRUD with SQL…

15) Z-Ordering

Improve locality…

16) Change Data Feed

Incremental changes…

17) Generate Manifest

Hive compatibility…

18) Constraints

Define rules…

19) Table History

Audit with DESCRIBE…

20) Q&A

Answer: MERGE is UPSERT…

Section 3 — Async, Patterns & Concurrency

21) Optimistic Concurrency

Writers check log…

22) Batch + Streaming

Unifies Lambda/Kappa…

23) Incremental ETL

Merge with CDF…

24) Multi-Cluster Concurrency

Readers consistent snapshot…

25) Streaming Upserts

Real-time pipelines…

26) Idempotency

Safe retries…

27) Data Skipping

Min/max stats per file…

28) Partitioning

Partition by date…

29) DML Isolation

Readers see snapshots…

30) Q&A

Answer: Handles concurrent writes with OCC…

Section 4 — Frameworks, Data & APIs

31) Delta + Databricks

Tight integration…

32) Delta + Presto/Trino

Read via connectors…

33) Delta + ML

Feature store input…

34) Delta Sharing

Secure open protocol…

35) Delta + BI Tools

Expose tables to BI…

36) Delta + Structured Streaming

Unify batch/stream…

37) Delta Live Tables

Declarative pipelines…

38) Delta OSS vs Databricks

Core OSS vs enterprise features…

39) Delta in Azure Synapse

Synapse connector…

40) Q&A

Answer: Delta integrates across ecosystem…

Section 5 — Security, Testing, Deployment, Observability & Interview Q&A

41) Governance

RBAC, lineage, auditing…

42) Data Quality

Constraints, expectations…

43) Testing Pipelines

Unit tests on ETL…

44) Deployment

CI/CD with Terraform…

45) Observability

Metrics on jobs…

46) Cost Optimization

Vacuum, compaction…

47) Multi-cloud

Delta on AWS, Azure, GCP…

48) Prod Checklist

  • Enable time travel
  • Constraints
  • Vacuum policy
  • Audit logs

49) Common Pitfalls

Not vacuuming, schema drift…

50) Interview Q&A — 20 Practical Questions

1) Why Delta vs Parquet?

2) Explain _delta_log…

3) What is time travel?

4) How does merge work?

5) Role of CDF?

6) Optimistic concurrency?

7) ZORDER vs partitioning?

8) When to VACUUM?

9) Idempotency in ETL?

10) Delta vs Hive ACID?

11) Streaming writes?

12) Governance features?

13) Delta Sharing protocol?

14) Schema evolution vs enforcement?

15) Pitfalls of small files?

16) What is OPTIMIZE?

17) Role of checkpoints?

18) What is Delta Live Tables?

19) How to integrate with BI?

20) Future of Lakehouse?