Apache Spark Pocket Book — Uplatz
40 deep-dive flashcards • Wide layout • SQL & Streaming • Performance & Tuning • Interview Q&A
Cheat-friendly snippets • Clear mental models • Production-oriented tips
1) What is Apache Spark?
Distributed compute engine for ETL, analytics, ML, and streaming. Builds an optimized DAG (Catalyst), executes with Tungsten codegen, exposes RDD/DataFrame/Dataset/SQL APIs in Scala, Python, Java, R.
spark-shell --master local[*]
val df = spark.read.parquet("s3://bucket/events/")
df.groupBy("country").count().show()
2) RDD vs DataFrame vs Dataset
RDD = low-level/untyped; DataFrame = columnar + optimized; Dataset = typed DataFrame (Scala/Java). Prefer DataFrame/Dataset for pushdown & planner optimizations; use RDDs for custom logic.
df.select("a","b").where($"a" > 10).groupBy("b").count()
3) Lazy Evaluation & Actions
Transformations are lazy; actions (count
, collect
, write
) trigger execution. Stage boundaries appear at shuffles; reason about performance via explain()
.
val t = df.repartition(200,$"country").groupBy("country").count()
t.explain(true)
4) Catalyst & Tungsten
Catalyst builds/optimizes logical→physical plans; Tungsten improves memory, cache locality, and generates bytecode. Prefer built-ins; UDFs can block optimization.
import org.apache.spark.sql.functions._
val cleaned = regexp_replace($"name","\\s+"," ")
5) Partitions, Shuffles, Skew
Shuffles are expensive (disk + network). Reduce via broadcast joins, partition pruning, bucketing, and salting skewed keys. Enable AQE skew optimization.
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled","true")
6) Cluster Managers
Standalone, YARN, Kubernetes, and cloud distros (EMR/Dataproc). Driver placement matters (client vs cluster mode) for data proximity & network reachability.
spark-submit --master k8s://... --deploy-mode cluster --class com.app.Job app.jar
7) Config Hierarchy
Defaults → spark-defaults.conf → submit-time --conf
→ session spark.conf.set
. Prefer infra-as-code (Helm/terraform) for prod reproducibility.
8) File Layout & Small Files
Prefer large-ish Parquet files (e.g., 256MB). Too many small files kill performance. Compact with table maintenance (OPTIMIZE/VACUUM depending on format).
9) Spark SQL Essentials
Register DataFrames as views, query with ANSI SQL. Projection/predicate pushdown accelerates scans on columnar formats.
spark.read.parquet("/data/sales").createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) amt FROM sales GROUP BY region")
10) Parquet / ORC
Columnar, compressed, splittable. Match compression (snappy/zstd) to workload. Keep schemas stable; evolve deliberately.
11) Partitioning Strategy
Partition by commonly filtered columns with adequate cardinality. Avoid over-partitioning. Use partitionBy()
at write; prune at read.
df.write.partitionBy("dt","country").mode("overwrite").parquet("/wh/sales")
12) Table Formats (Delta/Iceberg/Hudi)
ACID transactions, time travel, schema evolution. Pick based on multi-engine support, CDC, and governance needs.
spark.read.format("delta").load("/delta/events")
13) Joins: Broadcast vs Sort-Merge
Broadcast small side for small→big; sort-merge for big→big. Use stats or hints; watch skew. Bucketing helps repeatable joins.
import org.apache.spark.sql.functions.broadcast
val out = big.join(broadcast(dim), Seq("id"))
14) Window Functions
Powerful for analytics (rank, running totals). Partition wisely to control shuffle/State.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("country").orderBy($"ts")
df.withColumn("rn",row_number.over(w))
15) UDFs & pandas UDFs
Built-ins are Catalyst-aware. pandas/Arrow UDFs vectorize in PySpark; standard UDFs can be slower and block optimization—use sparingly.
16) Caching & Persistence
Cache hot DataFrames; always unpersist()
when done. Persist with storage levels for expensive recomputation.
df.cache(); df.count(); df.unpersist()
17) Model
Micro-batch (default) or continuous processing. Exactly-once with checkpoints + idempotent/transactional sinks. Watermarks bound state.
val q = df.withWatermark("event_time","10 minutes")
.groupBy(window($"event_time","5 minutes"),$"user").count()
.writeStream.format("delta").option("checkpointLocation","/chk/s1").start("/out/s1")
18) Sources & Sinks
Sources: Kafka, files, sockets, Kinesis. Sinks: files, Kafka, Delta/Iceberg, memory (dev). Favor transactional table formats for exactly-once.
19) Watermarking
Tells Spark how late events can arrive; drops state beyond threshold to control memory/latency.
df.withWatermark("ts","15 minutes")
.groupBy(window($"ts","10 minutes"),$"k").count()
20) State Store & Checkpoints
State kept per key/window on disk; checkpoints store progress/metadata. Put them on fast, reliable storage; don’t share between jobs.
21) Triggers & Throughput
Trigger.ProcessingTime
for micro-batches; tune batch interval to balance latency vs cost. Use rate limits and backpressure on sources.
22) Exactly-Once Gotchas
Sinks must be idempotent or transactional. Avoid side-effecting UDFs; include unique keys for dedupe where needed.
23) Streaming Joins
Stream-static joins (common) vs stream-stream (requires state & watermarks on both sides). Watch memory and late data handling.
24) Monitoring Streaming Apps
Track input rows/sec, batch duration, state size, and watermark progress via Spark UI / metrics. Alert on stalled batches and growing state.
25) Adaptive Query Execution (AQE)
Coalesces partitions, switches join strategy, mitigates skew dynamically. Turn it on by default in modern Spark.
26) Shuffle Tuning
Tune spark.sql.shuffle.partitions
to match cluster parallelism; use external shuffle service on legacy clusters; avoid excessive repartitions.
spark.conf.set("spark.sql.shuffle.partitions","200")
27) Skew Handling
Detect skewed keys (Spark UI, histograms). Mitigate using salting, pre-aggregation, or AQE skew join.
28) Memory & GC
Balance executor memory vs cores. Fewer, larger executors reduce overhead but risk GC pauses; observe GC metrics and adjust.
29) Speculation & Retries
Enable speculative execution for stragglers. Configure task/retry limits and timeouts for resilience.
spark.speculation=true
spark.task.maxFailures=4
30) Predicate Pushdown & Pruning
Keep filters sargable; push computation to the source; prune partitions early via partition columns and metadata.
31) Bucketing
Improves repeated joins/aggregations by pre-hashing on keys; works best when both sides share bucket spec. Requires management discipline.
32) File I/O Best Practices
Avoid coalesce(1)
for production. Use append mode for streaming sinks, and periodic compaction for read efficiency.
33) Packaging & Submit
Shade/fat JARs for Scala/Java; zip/whl for PySpark; keep Python env versions consistent across driver/executors (use containers).
spark-submit --class com.app.Main --conf spark.executor.instances=10 app.jar
34) Observability
Enable event logs, use History Server; export Dropwizard metrics to Prometheus; make Grafana dashboards for shuffle, GC, and stage times.
--conf spark.eventLog.enabled=true
--conf spark.metrics.conf=metrics.properties
35) Security Basics
Restrict data access in the warehouse, enable encryption at rest/in transit, and sanitize UDF inputs. Handle secrets via env/secret stores (not code).
36) Cost Controls
Right-size clusters; turn on auto-scaling; cache wisely; materialize expensive queries; schedule off-peak batch windows.
37) Testing Strategy
Unit test transforms with local sessions; golden-file tests for SQL; contract tests for schemas; sample production data in pre-prod.
38) Common Pitfalls
Small-file explosions, over-partitioning, unbounded state in streaming, heavy Python UDFs, missing checkpoints, mixing table formats without plan.
39) Interview Q&A — Quick Hits (1/2)
How to reduce shuffles? Broadcast small tables, prune early, bucket for repeated joins, enable AQE.
Why avoid standard UDFs? They can block Catalyst optimizations; prefer built-ins or pandas UDFs.
Exactly-once streaming? Checkpoint + idempotent/transactional sinks; unique keys for dedupe.
Skew mitigation? Salting, pre-agg, AQE skew join, data model changes.
40) Interview Q&A — Quick Hits (2/2)
RDD vs DataFrame? Use DataFrame for planner optimizations; RDD for custom logic.
Delta vs Iceberg vs Hudi? Choose by ecosystem breadth, CDC support, and multi-engine reads.
Why History Server? Post-mortem analysis of completed apps; capacity planning.
When to bucket? Stable, repeated joins on same keys; accept write-time constraints for read wins.