Databricks Pocket Book — Uplatz
50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is Databricks?
Databricks is a unified analytics platform (Lakehouse) built around Apache Spark that combines data engineering, data science, ML, and BI on one managed cloud service. It offers collaborative notebooks, scalable compute, Delta Lake ACID storage, SQL endpoints, MLflow, and governance via Unity Catalog.
# Install & configure Databricks CLI (v0.x style)
pip install databricks-cli
databricks configure --token
2) Why Databricks? Strengths & Tradeoffs
Strengths: scalable Spark, Delta Lake reliability, Lakehouse unification, strong ML lifecycle, collaborative UX, and multi-cloud. Tradeoffs: cost management, Spark expertise needed, governance setup complexity. Mitigate with governance (Unity Catalog), cluster policies, and workload-aware optimization.
# Minimal PySpark in a notebook
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()
3) Workspace & Notebooks
Workspaces house notebooks (Python, SQL, Scala, R), repos, dashboards, and jobs. Notebooks support Delta visualization, widgets, dbutils, Git integration, and collaborative commenting/versioning.
# Read a sample dataset (in a Python notebook)
df = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True, inferSchema=True)
display(df.limit(5))
4) Clusters & Compute
Compute options: Interactive (dev), Jobs (automation), SQL warehouses (BI), and Serverless (where available). Configure node types, autoscaling, spot/preemptible, photon acceleration, and termination.
# Create a cluster via API (classic)
databricks clusters create --json-file cluster.json
5) Lakehouse Architecture
Lakehouse sits on open data formats with Delta Lake on object storage, enabling ACID transactions, schema enforcement, and time travel. It unifies streaming + batch with SQL and ML on the same tables.
# Read a Delta table
df = spark.read.format("delta").load("/mnt/delta/events")
6) Databricks Runtime
Pre-built runtimes (DBR) ship optimized Spark, Delta Lake, ML libs, GPU support, and Photon engine. Use ML runtimes for MLflow/scikit-learn; pick photon-enabled runtimes for SQL speedups.
# Check versions (Python notebook)
spark.version, spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
7) Data Access & Mounts
Use DBFS for paths, or mount cloud storage (S3/ADLS/GCS) via dbutils.fs.mount
. Prefer direct paths with secure credentials & IAM passthrough when possible.
dbutils.fs.mount(
source="s3a://my-bucket/data",
mount_point="/mnt/data",
extra_configs={"fs.s3a.aws.credentials.provider":"com.amazonaws.auth.InstanceProfileCredentialsProvider"}
)
8) DBFS Essentials
DBFS is a virtual filesystem exposing object storage. Use it for libraries, checkpoints, temporary outputs. Prefer external locations for production tables (managed by Unity Catalog).
dbutils.fs.mkdirs("/mnt/checkpoints")
dbutils.fs.ls("/databricks-datasets")
9) Jobs & Workflows
Jobs chain notebooks, SQL, and Python tasks with retries, parameters, and dependencies. Use task libraries, cluster policies, and alerts. Workflows adds orchestration UI with conditionals and task values.
# Create job (JSON spec)
databricks jobs create --json-file job.json
10) Q&A — “When should I choose Databricks?”
Answer: Choose Databricks when you need reliable large-scale data engineering & ML on open formats, with unified governance (Unity Catalog) and strong collaboration, rather than separate lake + warehouse + ML stacks.
11) Delta Lake: ACID Tables
Delta Lake provides ACID transactions, schema enforcement/evolution, time travel, and efficient upserts (MERGE). It stores table versions via transaction logs (JSON) alongside parquet data.
# Create Delta table (SQL)
CREATE TABLE sales_delta (id BIGINT, amount DOUBLE) USING delta LOCATION '/mnt/delta/sales';
12) Delta: Writes & Updates
Use MERGE INTO
for upserts; DELETE
/UPDATE
for mutations. Optimize files with OPTIMIZE
and cleanup with VACUUM
.
MERGE INTO tgt USING src ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET amount = src.amount
WHEN NOT MATCHED THEN INSERT (id, amount) VALUES (src.id, src.amount);
13) Delta: Time Travel
Query older versions for audits or rollbacks using VERSION AS OF
or TIMESTAMP AS OF
. Use RESTORE
to revert a table to a prior version.
SELECT * FROM sales_delta VERSION AS OF 5;
RESTORE TABLE sales_delta TO VERSION AS OF 5;
14) Autoloader (CloudFiles)
Autoloader incrementally ingests new files from object storage. It tracks discovered files with scalable file notification modes and supports schema inference & evolution.
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format","json")
.load("s3://raw/topic/"))
df.writeStream.format("delta").option("checkpointLocation","/mnt/ckpt/topic").start("/mnt/delta/topic")
15) Structured Streaming
Process streams with Spark SQL/DataFrames. Sources: Kafka, files, Delta. Sinks: Delta, console, memory, Kafka. Exactly-once semantics with Delta sink and checkpointing.
query = (df.writeStream
.format("delta")
.option("checkpointLocation","/mnt/ckpt/stream")
.outputMode("append").start("/mnt/delta/stream_out"))
16) Delta Live Tables (DLT)
DLT defines reliable pipelines with declarative semantics, auto-testing, expectations, and lineage. Supports streaming + batch with automatic recovery.
-- DLT SQL example
CREATE STREAMING LIVE TABLE bronze AS
SELECT * FROM cloud_files("/raw/data","json");
17) Unity Catalog (UC)
UC centralizes governance: catalogs → schemas → tables/views/functions/storage. Offers fine-grained permissions, data lineage, audit logs, row/column-level security, and external locations.
-- Grant on a UC table
GRANT SELECT ON TABLE main.sales.orders TO `analyst-group`;
18) Databricks SQL (DBSQL)
DBSQL provides SQL Warehouses for BI/analytics with Photon acceleration, dashboards, alerts, and query scheduling. Connect via JDBC/ODBC to BI tools (Power BI, Tableau, Looker).
-- Example SQL
SELECT customer_id, SUM(amount) AS total
FROM main.sales.orders
GROUP BY customer_id
ORDER BY total DESC;
19) MLflow Integration
MLflow tracks experiments (params, metrics, artifacts), packages models, and deploys to serving. Databricks tightly integrates MLflow with runs, model registry, and permissions.
import mlflow
with mlflow.start_run():
mlflow.log_metric("rmse", 2.1)
mlflow.sklearn.log_model(model, "model")
20) Q&A — “Delta vs Parquet?”
Answer: Delta adds ACID transactions, schema enforcement/evolution, time travel, efficient merges, and Z-Ordering on top of parquet files. It’s parquet-plus-transaction-log, not a new proprietary format.
21) Job Orchestration Patterns
Use Jobs/Workflows to chain tasks (notebooks, JARs, Python, SQL). Pass task values, add retries/backoff, and schedule with CRON. Separate dev vs prod clusters via policies.
# Submit a run (CLI)
databricks runs submit --json-file run.json
22) Streaming Triggers & Checkpointing
Control micro-batch cadence with processing-time triggers. Checkpoints track offsets & state for exactly-once sinks. Use availableNow
for catch-up batch on directories.
df.writeStream.trigger(processingTime="30 seconds").option("checkpointLocation","/mnt/ckpt").start(...)
23) Concurrency & Isolation
Use High Concurrency clusters for multi-tenant SQL/BI. Ensure query isolation, pool configs, and SQL warehouse scaling. For notebooks, prefer per-user or small shared dev clusters.
-- SQL warehouse scaling set in UI/API; use small min/max for bursty BI
24) Caching & Photon
Cache frequently read tables; Photon accelerates vectorized execution for SQL. Use Delta OPTIMIZE
+ Z-ORDER to improve pruning and read performance.
CACHE SELECT * FROM main.sales.orders;
OPTIMIZE main.sales.orders ZORDER BY (customer_id);
25) Cost Control
Right-size clusters, enable autoscaling & auto-terminate, use spot/preemptible nodes where safe, optimize file sizes, and cache hot data. Monitor with tags, budgets, and system tables.
-- Use cluster policies to cap nodes, runtime versions, and autotermination
26) Performance Tuning
Use EXPLAIN
, AQE, broadcast hints, bucketing on stable keys, and correct shuffle partitions. Keep files 128–512 MB, avoid tiny files, and reduce skew via salting.
SET spark.sql.shuffle.partitions=auto;
EXPLAIN SELECT ...;
27) Reliability & Idempotency
Design idempotent pipelines (upserts, overwrite partitions), use checkpoints and exactly-once sinks. Store run parameters & versions with MLflow or table audit columns.
-- Partition overwrite
INSERT OVERWRITE TABLE t PARTITION (dt='2025-08-01') SELECT ...;
28) Z-Order & File Compaction
Z-Order co-locates related data to reduce reads for common filters. Compact small files with OPTIMIZE
; clean old data with VACUUM
respecting retention policies.
OPTIMIZE main.sales.events ZORDER BY (user_id, event_date);
VACUUM main.sales.events RETAIN 168 HOURS;
29) Multi-Hop Medallion
Bronze (raw) → Silver (cleaned) → Gold (aggregated/serving). Enforce contracts between layers, track lineage, and promote via DLT/Workflows. Keep transformations idempotent.
-- Example: silver from bronze
CREATE OR REPLACE TABLE main.sales.silver AS
SELECT * FROM main.sales.bronze WHERE _is_valid;
30) Q&A — “Batch vs Streaming in Delta?”
Answer: With Delta, both use the same table. Streaming writes append with checkpoints; batch jobs can read consistent snapshots. Unifying tables simplifies serving and governance.
31) Databricks SQL + BI
Expose tables/views to BI via SQL Warehouses. Use service principals and catalog grants. Materialize views for dashboards; schedule refresh queries and alerts.
CREATE VIEW main.sales.top_customers AS
SELECT customer_id, SUM(amount) AS total
FROM main.sales.orders GROUP BY customer_id;
32) Power BI, Tableau, Looker
Connect through ODBC/JDBC to SQL Warehouses. For Power BI, use the Databricks connector; for Tableau/Looker, set OAuth/SCIM where available and align roles with UC.
-- Keep semantic layers in BI, governance in UC; cache extracts strategically
33) Feature Store
Centralize and reuse ML features with lineage and point-in-time correctness. Serve features online with low-latency stores; ensure training/serving parity.
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
fs.create_table(name="main.fs.customer_features", primary_keys=["customer_id"], schema=...)
34) AutoML
AutoML generates baseline models, code, and MLflow tracking for tabular/time-series problems. Use as a starting point; productionize best candidates with guardrails.
# In UI: AutoML > New experiment > Select dataset/target
35) Model Registry & Serving
Promote MLflow models through Staging → Production with approvals. Serve models via Databricks Model Serving or external endpoints. Track versions & rollbacks.
import mlflow
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage("churn-model","3","Production")
36) Vector Search / Retrieval
Embed data and store vectors in a managed index or Delta table. Use for semantic search/RAG, with pipelines to refresh embeddings and maintain consistency.
-- Pseudocode: store embeddings in Delta; serve via SQL/UDTF or API
37) UDFs & UDAFs (Py/Scala/SQL)
Define UDFs for custom logic; prefer SQL functions or builtin expressions for performance. Use Pandas UDFs (vectorized) for better throughput on Python code.
import pyspark.sql.functions as F
@F.pandas_udf("double")
def zscore(col: pd.Series) -> pd.Series: return (col - col.mean())/col.std()
38) Libraries & Repos
Use Repos to sync with Git, manage notebooks + packages. Install wheel/egg libraries on clusters or use init scripts. Keep dependency graphs slim and pinned.
# Install a wheel at cluster start (UI) or %pip in notebook:
%pip install mypkg==1.2.3
39) Data Quality & Expectations
Use DLT expectations, Deequ, Great Expectations, or SQL constraints. Fail or quarantine bad records; log metrics to tables and dashboards.
-- DLT expectation
CREATE STREAMING LIVE TABLE silver
TBLPROPERTIES ("quality"="silver")
AS SELECT * FROM LIVE.bronze WHERE amount >= 0;
40) Q&A — “How to avoid tiny files?”
Answer: Use Auto Loader with cloudFiles.maxFilesPerTrigger
, set proper batch sizes, write using foreachBatch
with coalesce/repartition, and run periodic OPTIMIZE
compaction.
41) Unity Catalog Security
Define catalogs/schemas/tables with grants to groups, service principals, and users. Use data masking, row/column-level filters, and external locations with storage credentials.
GRANT USAGE ON CATALOG main TO `analytics-team`;
GRANT SELECT ON TABLE main.sales.orders TO `bi-readers`;
42) Secrets & Credentials
Manage secrets with Databricks Secrets (backed by cloud KMS). Access via dbutils.secrets.get
. Use SCIM for identity, and service principals for automation.
token = dbutils.secrets.get(scope="prod", key="api-key")
43) Table Lineage & Audit
Unity Catalog captures lineage across notebooks, jobs, and SQL. Combine with audit logs/system tables for governance, incident response, and cost analysis.
-- View lineage in UC UI; query system tables for usage & query_history
44) Testing & CI/CD
Unit test PySpark (pytest), use Databricks Repos for GitOps, and promote via Branch → Job in environments. Validate SQL with lightweight checks and data contracts per layer.
# Run tests in job with small dev cluster; block PR on pass
45) Deployment Strategies
Parameterize jobs per env, pin runtimes, and use cluster policies. Package Python libs in wheels; manage notebooks via Repos. Export/import via Terraform/Workspace APIs where applicable.
# Example: %pip install . (from repo root) to install your package
46) Observability & Monitoring
Track cluster metrics, Spark UI, query profiles, MLflow runs, and system tables. Emit custom logs/metrics to cloud monitoring. Set SLOs for latency, freshness, and cost.
-- Query system.information_schema for usage; schedule cost dashboards
47) Reliability Runbooks
Create playbooks for streaming backfills, checkpoint resets, schema evolution, and hotspot mitigation. Automate with Workflows and safe toggles in parameters.
-- Keep "safe backfill" notebooks with parameterized date ranges
48) Production Checklist
- Unity Catalog governance & least privilege
- Autoscaling & auto-termination enabled
- Delta OPTIMIZE/Z-ORDER; VACUUM retention set
- Streaming checkpoints & idempotent writes
- Cost tags, budgets, and monitoring dashboards
- Runbooks & on-call escalation paths
49) Common Pitfalls
Tiny files explosion, missing Z-ORDER, under/over-partitioning, ungoverned mounts, no autotermination, lack of schema contracts, and ad-hoc notebooks bypassing CI/CD.
50) Interview Q&A — 20 Practical Questions (Expanded)
1) Why Lakehouse? Unifies lake flexibility with warehouse performance via Delta, cutting duplicate stacks.
2) Delta advantages? ACID, schema handling, time travel, MERGE, OPTIMIZE/Z-ORDER.
3) Unity Catalog value? Central governance, lineage, fine-grained permissions, external locations.
4) Avoid tiny files? Batch writes, coalesce/repartition, Auto Loader tuning, periodic OPTIMIZE.
5) Batch vs streaming? Same Delta tables; streaming uses checkpoints & exactly-once sinks.
6) Photon when? SQL-heavy analytics and BI workloads; significant speedups.
7) Optimize joins? Broadcast small tables, partition/bucket large ones, Z-ORDER on filters.
8) Cost controls? Autoscaling, autoterminate, spot nodes, policies, monitoring with tags.
9) DLT benefits? Declarative pipelines, expectations, lineage, auto-recovery.
10) Schema evolution? Use Delta schema evolution flags; validate in silver before gold.
11) Data quality? DLT expectations, Great Expectations, constraints & quarantine.
12) ML lifecycle? MLflow for tracking, registry, serving, and A/B rollouts.
13) BI connectivity? Use SQL Warehouses + JDBC/ODBC connectors; manage grants in UC.
14) Medallion design? Raw bronze, cleaned silver, aggregated gold; contracts between layers.
15) Streaming recovery? Durable checkpoints, replayable sources, availableNow
for catch-up.
16) Multi-env promotion? Repos + branches, jobs per env, pinned runtimes, Terraform/API.
17) Row/column security? UC row filters & column masks; test with least-privilege roles.
18) File layout? Partition by selective columns, target 128–512MB file sizes, avoid skew.
19) Debugging performance? Use Spark UI, query profile, AQE, and EXPLAIN
.
20) When not Databricks? Tiny datasets, single-node ELT, or when a simpler DB/warehouse suffices.