Delta Lake Pocket Book — Uplatz
50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is Delta Lake?
Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, and reliability to big data lakes. It allows scalable batch and streaming data processing while maintaining strong consistency and data versioning.
# PySpark example
from delta import *
spark.read.format("delta").load("/mnt/delta/events")
2) Why Delta Lake? Strengths & Tradeoffs
Strengths: ACID guarantees on data lakes, schema evolution, time travel, scalable streaming/batch unification. Tradeoffs: requires Spark runtime, storage format overhead compared to raw parquet. But delivers governance and reliability.
# Write data as Delta
df.write.format("delta").mode("overwrite").save("/mnt/delta/users")
3) Delta Log (_delta_log)
Delta Lake maintains a transaction log (_delta_log) storing JSON and checkpoint files. This log tracks commits, schema changes, and metadata ensuring ACID and reproducibility.
# View log files
dbutils.fs.ls("/mnt/delta/users/_delta_log")
4) ACID Transactions
Atomicity, Consistency, Isolation, Durability. Delta ensures concurrent writers/readers can operate safely, with optimistic concurrency control and conflict detection.
spark.sql("UPDATE users SET active=false WHERE last_login < '2022-01-01'")
5) Schema Enforcement
Delta enforces schemas on write. Optionally enable schema evolution (merge new columns). Prevents corruption by mismatched data types.
df.write.option("mergeSchema","true").format("delta").mode("append").save("/mnt/delta/users")
6) Time Travel
Query older versions of Delta tables by version or timestamp for audits, debugging, and reproducible experiments.
spark.read.format("delta").option("versionAsOf",5).load("/mnt/delta/users")
7) Data Compaction (OPTIMIZE)
Delta supports compaction to combine small files into larger parquet files, improving read efficiency.
OPTIMIZE delta.`/mnt/delta/users`
8) Vacuum & Retention
VACUUM removes old snapshots and files beyond retention threshold. Default retention: 7 days. Ensure compliance before lowering.
VACUUM delta.`/mnt/delta/users` RETAIN 168 HOURS
9) Delta Format Internals
Delta tables store data as Parquet files + transaction log (_delta_log). JSON logs record commits; Parquet checkpoints speed recovery. Readers combine them for the latest snapshot.
10) Q&A — “What problem does Delta solve?”
Answer: It solves the “data swamp” problem by adding ACID, schema, versioning, and reliability to cheap storage, enabling consistent analytics and ML pipelines on data lakes.
11) DeltaTable API
DeltaTable provides programmatic access to Delta features like history, vacuum, and updates via Python/Scala APIs.
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/mnt/delta/users")
dt.history().show()
12) MERGE INTO
Delta SQL supports MERGE INTO for upserts (insert + update). Efficient for CDC pipelines and master data management.
MERGE INTO users u
USING updates s ON u.id = s.id
WHEN MATCHED THEN UPDATE SET u.email=s.email
WHEN NOT MATCHED THEN INSERT *
13) Streaming Reads & Writes
Delta unifies batch and streaming. Use Spark Structured Streaming to read/write from/to Delta tables reliably.
stream = spark.readStream.format("delta").load("/mnt/delta/events")
query = stream.writeStream.format("delta").option("checkpointLocation","/mnt/ckpt").start("/mnt/delta/output")
14) DML Support
Delta supports DELETE, UPDATE, and MERGE SQL operations, unlike plain Parquet. Enables full CRUD on lake data.
DELETE FROM users WHERE active=false
15) Z-Ordering
ZORDER BY co-locates related information in storage to speed up queries with filters. Works with OPTIMIZE.
OPTIMIZE users ZORDER BY (country, signup_date)
16) Change Data Feed (CDF)
CDF exposes row-level changes between versions. Useful for incremental ETL and ML model refresh.
spark.read.format("delta").option("readChangeFeed","true").option("startingVersion",10).table("users")
17) Generate Manifest
Delta can generate manifests (Hive-compatible tables) for compatibility with tools that don’t support Delta natively.
GENERATE symlink_format_manifest FOR TABLE users
18) Constraints
Define CHECK constraints to enforce data rules on Delta tables. Violating writes are rejected.
ALTER TABLE users ADD CONSTRAINT valid_age CHECK (age >= 0)
19) Table History
Use DESCRIBE HISTORY to audit changes: operation, user, timestamp, version, and notebook/job origin.
DESCRIBE HISTORY users
20) Q&A — “MERGE vs UPSERT?”
Answer: MERGE is the SQL operation in Delta that performs UPSERT logic: update when matched, insert when not. It unifies both in a single atomic operation.
21) Optimistic Concurrency
Delta uses optimistic concurrency: writers check transaction log at commit; if conflicts, transaction aborts. Ensures correctness with concurrent jobs.
22) Batch + Streaming Unification
One Delta table supports both batch queries and streaming queries simultaneously, unifying Lambda and Kappa architectures.
23) Incremental ETL
Use MERGE with Change Data Feed for incremental ETL pipelines, reducing compute by only processing changed data.
24) Multi-Cluster Concurrency
Delta allows concurrent reads and writes from multiple Spark clusters. Conflict resolution ensures readers see consistent snapshots.
25) Streaming Upserts
Implement CDC and streaming upserts with MERGE INTO in Structured Streaming pipelines, ensuring real-time lakehouse consistency.
26) Idempotency
Delta MERGE and transaction log allow idempotent ETL jobs. Retry safety: no duplicate rows or partial commits.
27) Data Skipping
Delta maintains statistics per file (min/max values) enabling data skipping. Reduces scan cost for selective queries.
28) Partitioning
Partition Delta tables by columns (date, region) for efficient queries. Combine with ZORDER to balance file sizes.
29) DML Isolation
Concurrent DML queries (UPDATE, DELETE, MERGE) are isolated by versioned snapshots. Readers never see partial writes.
30) Q&A — “How does Delta handle concurrent writes?”
Answer: Through optimistic concurrency: commit fails if transaction log has conflicting updates since start. Clients retry with new snapshot.
31) Delta + Databricks
Tightly integrated in Databricks with SQL APIs, UI for time travel, OPTIMIZE, ZORDER, and Delta Live Tables (DLT) for pipelines.
32) Delta + Presto/Trino
Delta connectors allow Presto/Trino to read Delta tables. Hive manifests improve compatibility for non-Spark engines.
33) Delta + ML
Use Delta tables as feature store input, enabling reproducible ML training with versioned data and time travel.
34) Delta Sharing
Open protocol for secure data sharing across clouds and orgs, without copying data. Supports fine-grained access.
35) Delta + BI Tools
Expose Delta tables via Databricks SQL, Presto, or manifests to connect PowerBI/Tableau. Provides consistent semantics for BI queries.
36) Delta“`html
Delta Lake Pocket Book — Uplatz
50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples
Section 1 — Fundamentals
1) What is Delta Lake?
Delta Lake is an open-source storage layer…
2) Why Delta Lake? Strengths & Tradeoffs
Strengths: ACID… Tradeoffs: overhead vs parquet…
3) Delta Log (_delta_log)
Tracks all commits…
4) ACID Transactions
Atomicity, Consistency…
5) Schema Enforcement
Delta enforces schemas…
6) Time Travel
Query older versions…
7) Data Compaction
OPTIMIZE small files…
8) Vacuum & Retention
Clean old data…
9) Delta Format Internals
Parquet + transaction log…
10) Q&A
Answer: Solves data swamp…
Section 2 — Core APIs & Modules
11) DeltaTable API
Programmatic API…
12) MERGE INTO
SQL upsert operation…
13) Streaming Reads & Writes
Unified batch/stream…
14) DML Support
CRUD with SQL…
15) Z-Ordering
Improve locality…
16) Change Data Feed
Incremental changes…
17) Generate Manifest
Hive compatibility…
18) Constraints
Define rules…
19) Table History
Audit with DESCRIBE…
20) Q&A
Answer: MERGE is UPSERT…
Section 3 — Async, Patterns & Concurrency
21) Optimistic Concurrency
Writers check log…
22) Batch + Streaming
Unifies Lambda/Kappa…
23) Incremental ETL
Merge with CDF…
24) Multi-Cluster Concurrency
Readers consistent snapshot…
25) Streaming Upserts
Real-time pipelines…
26) Idempotency
Safe retries…
27) Data Skipping
Min/max stats per file…
28) Partitioning
Partition by date…
29) DML Isolation
Readers see snapshots…
30) Q&A
Answer: Handles concurrent writes with OCC…
Section 4 — Frameworks, Data & APIs
31) Delta + Databricks
Tight integration…
32) Delta + Presto/Trino
Read via connectors…
33) Delta + ML
Feature store input…
34) Delta Sharing
Secure open protocol…
35) Delta + BI Tools
Expose tables to BI…
36) Delta + Structured Streaming
Unify batch/stream…
37) Delta Live Tables
Declarative pipelines…
38) Delta OSS vs Databricks
Core OSS vs enterprise features…
39) Delta in Azure Synapse
Synapse connector…
40) Q&A
Answer: Delta integrates across ecosystem…
Section 5 — Security, Testing, Deployment, Observability & Interview Q&A
41) Governance
RBAC, lineage, auditing…
42) Data Quality
Constraints, expectations…
43) Testing Pipelines
Unit tests on ETL…
44) Deployment
CI/CD with Terraform…
45) Observability
Metrics on jobs…
46) Cost Optimization
Vacuum, compaction…
47) Multi-cloud
Delta on AWS, Azure, GCP…
48) Prod Checklist
- Enable time travel
- Constraints
- Vacuum policy
- Audit logs
49) Common Pitfalls
Not vacuuming, schema drift…
50) Interview Q&A — 20 Practical Questions
1) Why Delta vs Parquet?
2) Explain _delta_log…
3) What is time travel?
4) How does merge work?
5) Role of CDF?
6) Optimistic concurrency?
7) ZORDER vs partitioning?
8) When to VACUUM?
9) Idempotency in ETL?
10) Delta vs Hive ACID?
11) Streaming writes?
12) Governance features?
13) Delta Sharing protocol?
14) Schema evolution vs enforcement?
15) Pitfalls of small files?
16) What is OPTIMIZE?
17) Role of checkpoints?
18) What is Delta Live Tables?
19) How to integrate with BI?
20) Future of Lakehouse?
Delta Lake Pocket Book — Uplatz
50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is Delta Lake?
Delta Lake is an open-source storage layer…
2) Why Delta Lake? Strengths & Tradeoffs
Strengths: ACID… Tradeoffs: overhead vs parquet…
3) Delta Log (_delta_log)
Tracks all commits…
4) ACID Transactions
Atomicity, Consistency…
5) Schema Enforcement
Delta enforces schemas…
6) Time Travel
Query older versions…
7) Data Compaction
OPTIMIZE small files…
8) Vacuum & Retention
Clean old data…
9) Delta Format Internals
Parquet + transaction log…
10) Q&A
Answer: Solves data swamp…
11) DeltaTable API
Programmatic API…
12) MERGE INTO
SQL upsert operation…
13) Streaming Reads & Writes
Unified batch/stream…
14) DML Support
CRUD with SQL…
15) Z-Ordering
Improve locality…
16) Change Data Feed
Incremental changes…
17) Generate Manifest
Hive compatibility…
18) Constraints
Define rules…
19) Table History
Audit with DESCRIBE…
20) Q&A
Answer: MERGE is UPSERT…
21) Optimistic Concurrency
Writers check log…
22) Batch + Streaming
Unifies Lambda/Kappa…
23) Incremental ETL
Merge with CDF…
24) Multi-Cluster Concurrency
Readers consistent snapshot…
25) Streaming Upserts
Real-time pipelines…
26) Idempotency
Safe retries…
27) Data Skipping
Min/max stats per file…
28) Partitioning
Partition by date…
29) DML Isolation
Readers see snapshots…
30) Q&A
Answer: Handles concurrent writes with OCC…
31) Delta + Databricks
Tight integration…
32) Delta + Presto/Trino
Read via connectors…
33) Delta + ML
Feature store input…
34) Delta Sharing
Secure open protocol…
35) Delta + BI Tools
Expose tables to BI…
36) Delta + Structured Streaming
Unify batch/stream…
37) Delta Live Tables
Declarative pipelines…
38) Delta OSS vs Databricks
Core OSS vs enterprise features…
39) Delta in Azure Synapse
Synapse connector…
40) Q&A
Answer: Delta integrates across ecosystem…
41) Governance
RBAC, lineage, auditing…
42) Data Quality
Constraints, expectations…
43) Testing Pipelines
Unit tests on ETL…
44) Deployment
CI/CD with Terraform…
45) Observability
Metrics on jobs…
46) Cost Optimization
Vacuum, compaction…
47) Multi-cloud
Delta on AWS, Azure, GCP…
48) Prod Checklist
- Enable time travel
- Constraints
- Vacuum policy
- Audit logs
49) Common Pitfalls
Not vacuuming, schema drift…
50) Interview Q&A — 20 Practical Questions
1) Why Delta vs Parquet?
2) Explain _delta_log…
3) What is time travel?
4) How does merge work?
5) Role of CDF?
6) Optimistic concurrency?
7) ZORDER vs partitioning?
8) When to VACUUM?
9) Idempotency in ETL?
10) Delta vs Hive ACID?
11) Streaming writes?
12) Governance features?
13) Delta Sharing protocol?
14) Schema evolution vs enforcement?
15) Pitfalls of small files?
16) What is OPTIMIZE?
17) Role of checkpoints?
18) What is Delta Live Tables?
19) How to integrate with BI?
20) Future of Lakehouse?