Databricks Pocket Book

Databricks Pocket Book — Uplatz

50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is Databricks?

Databricks is a unified analytics platform (Lakehouse) built around Apache Spark that combines data engineering, data science, ML, and BI on one managed cloud service. It offers collaborative notebooks, scalable compute, Delta Lake ACID storage, SQL endpoints, MLflow, and governance via Unity Catalog.

# Install & configure Databricks CLI (v0.x style)
pip install databricks-cli
databricks configure --token

2) Why Databricks? Strengths & Tradeoffs

Strengths: scalable Spark, Delta Lake reliability, Lakehouse unification, strong ML lifecycle, collaborative UX, and multi-cloud. Tradeoffs: cost management, Spark expertise needed, governance setup complexity. Mitigate with governance (Unity Catalog), cluster policies, and workload-aware optimization.

# Minimal PySpark in a notebook
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()

3) Workspace & Notebooks

Workspaces house notebooks (Python, SQL, Scala, R), repos, dashboards, and jobs. Notebooks support Delta visualization, widgets, dbutils, Git integration, and collaborative commenting/versioning.

# Read a sample dataset (in a Python notebook)
df = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True, inferSchema=True)
display(df.limit(5))

4) Clusters & Compute

Compute options: Interactive (dev), Jobs (automation), SQL warehouses (BI), and Serverless (where available). Configure node types, autoscaling, spot/preemptible, photon acceleration, and termination.

# Create a cluster via API (classic)
databricks clusters create --json-file cluster.json

5) Lakehouse Architecture

Lakehouse sits on open data formats with Delta Lake on object storage, enabling ACID transactions, schema enforcement, and time travel. It unifies streaming + batch with SQL and ML on the same tables.

# Read a Delta table
df = spark.read.format("delta").load("/mnt/delta/events")

6) Databricks Runtime

Pre-built runtimes (DBR) ship optimized Spark, Delta Lake, ML libs, GPU support, and Photon engine. Use ML runtimes for MLflow/scikit-learn; pick photon-enabled runtimes for SQL speedups.

# Check versions (Python notebook)
spark.version, spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")

7) Data Access & Mounts

Use DBFS for paths, or mount cloud storage (S3/ADLS/GCS) via dbutils.fs.mount. Prefer direct paths with secure credentials & IAM passthrough when possible.

dbutils.fs.mount(
  source="s3a://my-bucket/data",
  mount_point="/mnt/data",
  extra_configs={"fs.s3a.aws.credentials.provider":"com.amazonaws.auth.InstanceProfileCredentialsProvider"}
)

8) DBFS Essentials

DBFS is a virtual filesystem exposing object storage. Use it for libraries, checkpoints, temporary outputs. Prefer external locations for production tables (managed by Unity Catalog).

dbutils.fs.mkdirs("/mnt/checkpoints")
dbutils.fs.ls("/databricks-datasets")

9) Jobs & Workflows

Jobs chain notebooks, SQL, and Python tasks with retries, parameters, and dependencies. Use task libraries, cluster policies, and alerts. Workflows adds orchestration UI with conditionals and task values.

# Create job (JSON spec)
databricks jobs create --json-file job.json

10) Q&A — “When should I choose Databricks?”

Answer: Choose Databricks when you need reliable large-scale data engineering & ML on open formats, with unified governance (Unity Catalog) and strong collaboration, rather than separate lake + warehouse + ML stacks.

Section 2 — Core APIs & Modules

11) Delta Lake: ACID Tables

Delta Lake provides ACID transactions, schema enforcement/evolution, time travel, and efficient upserts (MERGE). It stores table versions via transaction logs (JSON) alongside parquet data.

# Create Delta table (SQL)
CREATE TABLE sales_delta (id BIGINT, amount DOUBLE) USING delta LOCATION '/mnt/delta/sales';

12) Delta: Writes & Updates

Use MERGE INTO for upserts; DELETE/UPDATE for mutations. Optimize files with OPTIMIZE and cleanup with VACUUM.

MERGE INTO tgt USING src ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET amount = src.amount
WHEN NOT MATCHED THEN INSERT (id, amount) VALUES (src.id, src.amount);

13) Delta: Time Travel

Query older versions for audits or rollbacks using VERSION AS OF or TIMESTAMP AS OF. Use RESTORE to revert a table to a prior version.

SELECT * FROM sales_delta VERSION AS OF 5;
RESTORE TABLE sales_delta TO VERSION AS OF 5;

14) Autoloader (CloudFiles)

Autoloader incrementally ingests new files from object storage. It tracks discovered files with scalable file notification modes and supports schema inference & evolution.

df = (spark.readStream.format("cloudFiles")
  .option("cloudFiles.format","json")
  .load("s3://raw/topic/"))
df.writeStream.format("delta").option("checkpointLocation","/mnt/ckpt/topic").start("/mnt/delta/topic")

15) Structured Streaming

Process streams with Spark SQL/DataFrames. Sources: Kafka, files, Delta. Sinks: Delta, console, memory, Kafka. Exactly-once semantics with Delta sink and checkpointing.

query = (df.writeStream
  .format("delta")
  .option("checkpointLocation","/mnt/ckpt/stream")
  .outputMode("append").start("/mnt/delta/stream_out"))

16) Delta Live Tables (DLT)

DLT defines reliable pipelines with declarative semantics, auto-testing, expectations, and lineage. Supports streaming + batch with automatic recovery.

-- DLT SQL example
CREATE STREAMING LIVE TABLE bronze AS
SELECT * FROM cloud_files("/raw/data","json");

17) Unity Catalog (UC)

UC centralizes governance: catalogs → schemas → tables/views/functions/storage. Offers fine-grained permissions, data lineage, audit logs, row/column-level security, and external locations.

-- Grant on a UC table
GRANT SELECT ON TABLE main.sales.orders TO `analyst-group`;

18) Databricks SQL (DBSQL)

DBSQL provides SQL Warehouses for BI/analytics with Photon acceleration, dashboards, alerts, and query scheduling. Connect via JDBC/ODBC to BI tools (Power BI, Tableau, Looker).

-- Example SQL
SELECT customer_id, SUM(amount) AS total
FROM main.sales.orders
GROUP BY customer_id
ORDER BY total DESC;

19) MLflow Integration

MLflow tracks experiments (params, metrics, artifacts), packages models, and deploys to serving. Databricks tightly integrates MLflow with runs, model registry, and permissions.

import mlflow
with mlflow.start_run():
    mlflow.log_metric("rmse", 2.1)
    mlflow.sklearn.log_model(model, "model")

20) Q&A — “Delta vs Parquet?”

Answer: Delta adds ACID transactions, schema enforcement/evolution, time travel, efficient merges, and Z-Ordering on top of parquet files. It’s parquet-plus-transaction-log, not a new proprietary format.

Section 3 — Async, Patterns & Concurrency

21) Job Orchestration Patterns

Use Jobs/Workflows to chain tasks (notebooks, JARs, Python, SQL). Pass task values, add retries/backoff, and schedule with CRON. Separate dev vs prod clusters via policies.

# Submit a run (CLI)
databricks runs submit --json-file run.json

22) Streaming Triggers & Checkpointing

Control micro-batch cadence with processing-time triggers. Checkpoints track offsets & state for exactly-once sinks. Use availableNow for catch-up batch on directories.

df.writeStream.trigger(processingTime="30 seconds").option("checkpointLocation","/mnt/ckpt").start(...)

23) Concurrency & Isolation

Use High Concurrency clusters for multi-tenant SQL/BI. Ensure query isolation, pool configs, and SQL warehouse scaling. For notebooks, prefer per-user or small shared dev clusters.

-- SQL warehouse scaling set in UI/API; use small min/max for bursty BI

24) Caching & Photon

Cache frequently read tables; Photon accelerates vectorized execution for SQL. Use Delta OPTIMIZE + Z-ORDER to improve pruning and read performance.

CACHE SELECT * FROM main.sales.orders;
OPTIMIZE main.sales.orders ZORDER BY (customer_id);

25) Cost Control

Right-size clusters, enable autoscaling & auto-terminate, use spot/preemptible nodes where safe, optimize file sizes, and cache hot data. Monitor with tags, budgets, and system tables.

-- Use cluster policies to cap nodes, runtime versions, and autotermination

26) Performance Tuning

Use EXPLAIN, AQE, broadcast hints, bucketing on stable keys, and correct shuffle partitions. Keep files 128–512 MB, avoid tiny files, and reduce skew via salting.

SET spark.sql.shuffle.partitions=auto;
EXPLAIN SELECT ...;

27) Reliability & Idempotency

Design idempotent pipelines (upserts, overwrite partitions), use checkpoints and exactly-once sinks. Store run parameters & versions with MLflow or table audit columns.

-- Partition overwrite
INSERT OVERWRITE TABLE t PARTITION (dt='2025-08-01') SELECT ...;

28) Z-Order & File Compaction

Z-Order co-locates related data to reduce reads for common filters. Compact small files with OPTIMIZE; clean old data with VACUUM respecting retention policies.

OPTIMIZE main.sales.events ZORDER BY (user_id, event_date);
VACUUM main.sales.events RETAIN 168 HOURS;

29) Multi-Hop Medallion

Bronze (raw) → Silver (cleaned) → Gold (aggregated/serving). Enforce contracts between layers, track lineage, and promote via DLT/Workflows. Keep transformations idempotent.

-- Example: silver from bronze
CREATE OR REPLACE TABLE main.sales.silver AS
SELECT * FROM main.sales.bronze WHERE _is_valid;

30) Q&A — “Batch vs Streaming in Delta?”

Answer: With Delta, both use the same table. Streaming writes append with checkpoints; batch jobs can read consistent snapshots. Unifying tables simplifies serving and governance.

Section 4 — Frameworks, Data & APIs

31) Databricks SQL + BI

Expose tables/views to BI via SQL Warehouses. Use service principals and catalog grants. Materialize views for dashboards; schedule refresh queries and alerts.

CREATE VIEW main.sales.top_customers AS
SELECT customer_id, SUM(amount) AS total
FROM main.sales.orders GROUP BY customer_id;

32) Power BI, Tableau, Looker

Connect through ODBC/JDBC to SQL Warehouses. For Power BI, use the Databricks connector; for Tableau/Looker, set OAuth/SCIM where available and align roles with UC.

-- Keep semantic layers in BI, governance in UC; cache extracts strategically

33) Feature Store

Centralize and reuse ML features with lineage and point-in-time correctness. Serve features online with low-latency stores; ensure training/serving parity.

from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
fs.create_table(name="main.fs.customer_features", primary_keys=["customer_id"], schema=...)

34) AutoML

AutoML generates baseline models, code, and MLflow tracking for tabular/time-series problems. Use as a starting point; productionize best candidates with guardrails.

# In UI: AutoML > New experiment > Select dataset/target

35) Model Registry & Serving

Promote MLflow models through Staging → Production with approvals. Serve models via Databricks Model Serving or external endpoints. Track versions & rollbacks.

import mlflow
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage("churn-model","3","Production")

36) Vector Search / Retrieval

Embed data and store vectors in a managed index or Delta table. Use for semantic search/RAG, with pipelines to refresh embeddings and maintain consistency.

-- Pseudocode: store embeddings in Delta; serve via SQL/UDTF or API

37) UDFs & UDAFs (Py/Scala/SQL)

Define UDFs for custom logic; prefer SQL functions or builtin expressions for performance. Use Pandas UDFs (vectorized) for better throughput on Python code.

import pyspark.sql.functions as F
@F.pandas_udf("double")
def zscore(col: pd.Series) -> pd.Series: return (col - col.mean())/col.std()

38) Libraries & Repos

Use Repos to sync with Git, manage notebooks + packages. Install wheel/egg libraries on clusters or use init scripts. Keep dependency graphs slim and pinned.

# Install a wheel at cluster start (UI) or %pip in notebook:
%pip install mypkg==1.2.3

39) Data Quality & Expectations

Use DLT expectations, Deequ, Great Expectations, or SQL constraints. Fail or quarantine bad records; log metrics to tables and dashboards.

-- DLT expectation
CREATE STREAMING LIVE TABLE silver
  TBLPROPERTIES ("quality"="silver")
  AS SELECT * FROM LIVE.bronze WHERE amount >= 0;

40) Q&A — “How to avoid tiny files?”

Answer: Use Auto Loader with cloudFiles.maxFilesPerTrigger, set proper batch sizes, write using foreachBatch with coalesce/repartition, and run periodic OPTIMIZE compaction.

Section 5 — Security, Testing, Deployment, Observability & Interview Q&A

41) Unity Catalog Security

Define catalogs/schemas/tables with grants to groups, service principals, and users. Use data masking, row/column-level filters, and external locations with storage credentials.

GRANT USAGE ON CATALOG main TO `analytics-team`;
GRANT SELECT ON TABLE main.sales.orders TO `bi-readers`;

42) Secrets & Credentials

Manage secrets with Databricks Secrets (backed by cloud KMS). Access via dbutils.secrets.get. Use SCIM for identity, and service principals for automation.

token = dbutils.secrets.get(scope="prod", key="api-key")

43) Table Lineage & Audit

Unity Catalog captures lineage across notebooks, jobs, and SQL. Combine with audit logs/system tables for governance, incident response, and cost analysis.

-- View lineage in UC UI; query system tables for usage & query_history

44) Testing & CI/CD

Unit test PySpark (pytest), use Databricks Repos for GitOps, and promote via Branch → Job in environments. Validate SQL with lightweight checks and data contracts per layer.

# Run tests in job with small dev cluster; block PR on pass

45) Deployment Strategies

Parameterize jobs per env, pin runtimes, and use cluster policies. Package Python libs in wheels; manage notebooks via Repos. Export/import via Terraform/Workspace APIs where applicable.

# Example: %pip install .  (from repo root) to install your package

46) Observability & Monitoring

Track cluster metrics, Spark UI, query profiles, MLflow runs, and system tables. Emit custom logs/metrics to cloud monitoring. Set SLOs for latency, freshness, and cost.

-- Query system.information_schema for usage; schedule cost dashboards

47) Reliability Runbooks

Create playbooks for streaming backfills, checkpoint resets, schema evolution, and hotspot mitigation. Automate with Workflows and safe toggles in parameters.

-- Keep "safe backfill" notebooks with parameterized date ranges

48) Production Checklist

Unity Catalog governance & least privilege
Autoscaling & auto-termination enabled
Delta OPTIMIZE/Z-ORDER; VACUUM retention set
Streaming checkpoints & idempotent writes
Cost tags, budgets, and monitoring dashboards
Runbooks & on-call escalation paths

49) Common Pitfalls

Tiny files explosion, missing Z-ORDER, under/over-partitioning, ungoverned mounts, no autotermination, lack of schema contracts, and ad-hoc notebooks bypassing CI/CD.

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Why Lakehouse? Unifies lake flexibility with warehouse performance via Delta, cutting duplicate stacks.

2) Delta advantages? ACID, schema handling, time travel, MERGE, OPTIMIZE/Z-ORDER.

3) Unity Catalog value? Central governance, lineage, fine-grained permissions, external locations.

4) Avoid tiny files? Batch writes, coalesce/repartition, Auto Loader tuning, periodic OPTIMIZE.

5) Batch vs streaming? Same Delta tables; streaming uses checkpoints & exactly-once sinks.

6) Photon when? SQL-heavy analytics and BI workloads; significant speedups.

7) Optimize joins? Broadcast small tables, partition/bucket large ones, Z-ORDER on filters.

8) Cost controls? Autoscaling, autoterminate, spot nodes, policies, monitoring with tags.

9) DLT benefits? Declarative pipelines, expectations, lineage, auto-recovery.

10) Schema evolution? Use Delta schema evolution flags; validate in silver before gold.

11) Data quality? DLT expectations, Great Expectations, constraints & quarantine.

12) ML lifecycle? MLflow for tracking, registry, serving, and A/B rollouts.

13) BI connectivity? Use SQL Warehouses + JDBC/ODBC connectors; manage grants in UC.

14) Medallion design? Raw bronze, cleaned silver, aggregated gold; contracts between layers.

15) Streaming recovery? Durable checkpoints, replayable sources, availableNow for catch-up.

16) Multi-env promotion? Repos + branches, jobs per env, pinned runtimes, Terraform/API.

17) Row/column security? UC row filters & column masks; test with least-privilege roles.

18) File layout? Partition by selective columns, target 128–512MB file sizes, avoid skew.

19) Debugging performance? Use Spark UI, query profile, AQE, and EXPLAIN.

20) When not Databricks? Tiny datasets, single-node ELT, or when a simpler DB/warehouse suffices.

Cutting-edge Technology Courses by Uplatz