Apache Spark Pocket Book


Apache Spark Pocket Book — Uplatz

40 deep-dive flashcards • Wide layout • SQL & Streaming • Performance & Tuning • Interview Q&A

Cheat-friendly snippets • Clear mental models • Production-oriented tips

Section 1 — Fundamentals

1) What is Apache Spark?

Distributed compute engine for ETL, analytics, ML, and streaming. Builds an optimized DAG (Catalyst), executes with Tungsten codegen, exposes RDD/DataFrame/Dataset/SQL APIs in Scala, Python, Java, R.

spark-shell --master local[*]
val df = spark.read.parquet("s3://bucket/events/")
df.groupBy("country").count().show()

2) RDD vs DataFrame vs Dataset

RDD = low-level/untyped; DataFrame = columnar + optimized; Dataset = typed DataFrame (Scala/Java). Prefer DataFrame/Dataset for pushdown & planner optimizations; use RDDs for custom logic.

df.select("a","b").where($"a" > 10).groupBy("b").count()

3) Lazy Evaluation & Actions

Transformations are lazy; actions (count, collect, write) trigger execution. Stage boundaries appear at shuffles; reason about performance via explain().

val t = df.repartition(200,$"country").groupBy("country").count()
t.explain(true)

4) Catalyst & Tungsten

Catalyst builds/optimizes logical→physical plans; Tungsten improves memory, cache locality, and generates bytecode. Prefer built-ins; UDFs can block optimization.

import org.apache.spark.sql.functions._
val cleaned = regexp_replace($"name","\\s+"," ")

5) Partitions, Shuffles, Skew

Shuffles are expensive (disk + network). Reduce via broadcast joins, partition pruning, bucketing, and salting skewed keys. Enable AQE skew optimization.

spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled","true")

6) Cluster Managers

Standalone, YARN, Kubernetes, and cloud distros (EMR/Dataproc). Driver placement matters (client vs cluster mode) for data proximity & network reachability.

spark-submit --master k8s://... --deploy-mode cluster --class com.app.Job app.jar

7) Config Hierarchy

Defaults → spark-defaults.conf → submit-time --conf → session spark.conf.set. Prefer infra-as-code (Helm/terraform) for prod reproducibility.

8) File Layout & Small Files

Prefer large-ish Parquet files (e.g., 256MB). Too many small files kill performance. Compact with table maintenance (OPTIMIZE/VACUUM depending on format).

Section 2 — Spark SQL & Storage

9) Spark SQL Essentials

Register DataFrames as views, query with ANSI SQL. Projection/predicate pushdown accelerates scans on columnar formats.

spark.read.parquet("/data/sales").createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) amt FROM sales GROUP BY region")

10) Parquet / ORC

Columnar, compressed, splittable. Match compression (snappy/zstd) to workload. Keep schemas stable; evolve deliberately.

11) Partitioning Strategy

Partition by commonly filtered columns with adequate cardinality. Avoid over-partitioning. Use partitionBy() at write; prune at read.

df.write.partitionBy("dt","country").mode("overwrite").parquet("/wh/sales")

12) Table Formats (Delta/Iceberg/Hudi)

ACID transactions, time travel, schema evolution. Pick based on multi-engine support, CDC, and governance needs.

spark.read.format("delta").load("/delta/events")

13) Joins: Broadcast vs Sort-Merge

Broadcast small side for small→big; sort-merge for big→big. Use stats or hints; watch skew. Bucketing helps repeatable joins.

import org.apache.spark.sql.functions.broadcast
val out = big.join(broadcast(dim), Seq("id"))

14) Window Functions

Powerful for analytics (rank, running totals). Partition wisely to control shuffle/State.

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("country").orderBy($"ts")
df.withColumn("rn",row_number.over(w))

15) UDFs & pandas UDFs

Built-ins are Catalyst-aware. pandas/Arrow UDFs vectorize in PySpark; standard UDFs can be slower and block optimization—use sparingly.

16) Caching & Persistence

Cache hot DataFrames; always unpersist() when done. Persist with storage levels for expensive recomputation.

df.cache(); df.count(); df.unpersist()

Section 3 — Structured Streaming & State

17) Model

Micro-batch (default) or continuous processing. Exactly-once with checkpoints + idempotent/transactional sinks. Watermarks bound state.

val q = df.withWatermark("event_time","10 minutes")
  .groupBy(window($"event_time","5 minutes"),$"user").count()
  .writeStream.format("delta").option("checkpointLocation","/chk/s1").start("/out/s1")

18) Sources & Sinks

Sources: Kafka, files, sockets, Kinesis. Sinks: files, Kafka, Delta/Iceberg, memory (dev). Favor transactional table formats for exactly-once.

19) Watermarking

Tells Spark how late events can arrive; drops state beyond threshold to control memory/latency.

df.withWatermark("ts","15 minutes")
  .groupBy(window($"ts","10 minutes"),$"k").count()

20) State Store & Checkpoints

State kept per key/window on disk; checkpoints store progress/metadata. Put them on fast, reliable storage; don’t share between jobs.

21) Triggers & Throughput

Trigger.ProcessingTime for micro-batches; tune batch interval to balance latency vs cost. Use rate limits and backpressure on sources.

22) Exactly-Once Gotchas

Sinks must be idempotent or transactional. Avoid side-effecting UDFs; include unique keys for dedupe where needed.

23) Streaming Joins

Stream-static joins (common) vs stream-stream (requires state & watermarks on both sides). Watch memory and late data handling.

24) Monitoring Streaming Apps

Track input rows/sec, batch duration, state size, and watermark progress via Spark UI / metrics. Alert on stalled batches and growing state.

Section 4 — Performance, Tuning & Reliability

25) Adaptive Query Execution (AQE)

Coalesces partitions, switches join strategy, mitigates skew dynamically. Turn it on by default in modern Spark.

26) Shuffle Tuning

Tune spark.sql.shuffle.partitions to match cluster parallelism; use external shuffle service on legacy clusters; avoid excessive repartitions.

spark.conf.set("spark.sql.shuffle.partitions","200")

27) Skew Handling

Detect skewed keys (Spark UI, histograms). Mitigate using salting, pre-aggregation, or AQE skew join.

28) Memory & GC

Balance executor memory vs cores. Fewer, larger executors reduce overhead but risk GC pauses; observe GC metrics and adjust.

29) Speculation & Retries

Enable speculative execution for stragglers. Configure task/retry limits and timeouts for resilience.

spark.speculation=true
spark.task.maxFailures=4

30) Predicate Pushdown & Pruning

Keep filters sargable; push computation to the source; prune partitions early via partition columns and metadata.

31) Bucketing

Improves repeated joins/aggregations by pre-hashing on keys; works best when both sides share bucket spec. Requires management discipline.

32) File I/O Best Practices

Avoid coalesce(1) for production. Use append mode for streaming sinks, and periodic compaction for read efficiency.

Section 5 — Deployment, Ops & Interview Q&A

33) Packaging & Submit

Shade/fat JARs for Scala/Java; zip/whl for PySpark; keep Python env versions consistent across driver/executors (use containers).

spark-submit --class com.app.Main --conf spark.executor.instances=10 app.jar

34) Observability

Enable event logs, use History Server; export Dropwizard metrics to Prometheus; make Grafana dashboards for shuffle, GC, and stage times.

--conf spark.eventLog.enabled=true
--conf spark.metrics.conf=metrics.properties

35) Security Basics

Restrict data access in the warehouse, enable encryption at rest/in transit, and sanitize UDF inputs. Handle secrets via env/secret stores (not code).

36) Cost Controls

Right-size clusters; turn on auto-scaling; cache wisely; materialize expensive queries; schedule off-peak batch windows.

37) Testing Strategy

Unit test transforms with local sessions; golden-file tests for SQL; contract tests for schemas; sample production data in pre-prod.

38) Common Pitfalls

Small-file explosions, over-partitioning, unbounded state in streaming, heavy Python UDFs, missing checkpoints, mixing table formats without plan.

39) Interview Q&A — Quick Hits (1/2)

How to reduce shuffles? Broadcast small tables, prune early, bucket for repeated joins, enable AQE.

Why avoid standard UDFs? They can block Catalyst optimizations; prefer built-ins or pandas UDFs.

Exactly-once streaming? Checkpoint + idempotent/transactional sinks; unique keys for dedupe.

Skew mitigation? Salting, pre-agg, AQE skew join, data model changes.

40) Interview Q&A — Quick Hits (2/2)

RDD vs DataFrame? Use DataFrame for planner optimizations; RDD for custom logic.

Delta vs Iceberg vs Hudi? Choose by ecosystem breadth, CDC support, and multi-engine reads.

Why History Server? Post-mortem analysis of completed apps; capacity planning.

When to bucket? Stable, repeated joins on same keys; accept write-time constraints for read wins.