Apache Flink Pocket Book

Apache Flink Pocket Book

Low-latency streaming • Stateful processing • Time & windows • Checkpoints & savepoints • CEP • Connectors • Deploy & tune

Section 1 — Fundamentals

1) What is Apache Flink?

Apache Flink is a distributed engine for stateful stream and batch processing with event-time semantics, exactly-once state consistency, and sub-second latency. It powers real-time analytics, fraud detection, anomaly alerts, and ETL.

# Quick start (local binary)
# download from flink.apache.org, then:
bin/start-cluster.sh
# UI on http://localhost:8081

2) Core Concepts

DataStream API (streaming), Table/SQL API (relational), DataSet (legacy batch).
Event time, processing time, ingestion time.
State (keyed/operator), checkpoints, savepoints.
Parallelism, slots, task managers.

3) Minimal Streaming Job (Java)

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromElements("a","b","a")
   .keyBy(v -> v)
   .map(v -> Tuple2.of(v, 1))
   .keyBy(t -> t.f0)
   .sum(1)
   .print();
env.execute("WordCount");

4) Minimal Table SQL

TableEnvironment tEnv = TableEnvironment.create(...);
tEnv.executeSql("CREATE TEMPORARY TABLE clicks (user STRING, ts TIMESTAMP(3), WATERMARK FOR ts AS ts - INTERVAL '5' SECOND) WITH (...)");
tEnv.executeSql("SELECT user, TUMBLE_START(ts, INTERVAL '1' MINUTE), COUNT(*) FROM clicks GROUP BY user, TUMBLE(ts, INTERVAL '1' MINUTE)");

5) Connectors

Common sources/sinks: Kafka, Kinesis, Pulsar, Filesystem, JDBC, Elasticsearch, Iceberg, Hive. Use the corresponding connector JARs and matching versions.

Section 2 — Time, Windows & State

6) Event-time with Watermarks

env.getConfig().setAutoWatermarkInterval(1000);
WatermarkStrategy<MyEvent> wm = WatermarkStrategy
  .<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(10))
  .withTimestampAssigner((e, ts) -> e.getTsMillis());

Pick out-of-orderness budget carefully; larger = more latency but fewer late events.

7) Windows

stream.keyBy(e -> e.user)
  .window(TumblingEventTimeWindows.of(Time.minutes(1)))
  .reduce(new MyReducer());

Also: sliding, session, global windows. Use triggers & evictors for custom emission.

8) Keyed State

class Dedup extends KeyedProcessFunction<String, Event, Event> {
  private transient ValueState<Long> lastSeen;
  public void open(OpenContext ctx){ lastSeen = getRuntimeContext()
    .getState(new ValueStateDescriptor<>("last", Long.class)); }
  public void processElement(Event e, Context ctx, Collector<Event> out) throws Exception {
    Long prev = lastSeen.value();
    if (prev == null || e.ts > prev) { lastSeen.update(e.ts); out.collect(e); }
  }
}

9) Timers

ctx.timerService().registerEventTimeTimer(e.ts + 60000);

10) Exactly-once with Checkpoints

env.enableCheckpointing(60000, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(120000);

Use a persistent state backend (RocksDB) for large state.

Section 3 — Table/SQL & CDC

11) Table API Joins & Windows

Table result = tEnv.sqlQuery("
 SELECT user, COUNT(*) cnt
 FROM clicks
 WINDOW T AS (TUMBLE(ts, INTERVAL '1' MINUTE))
 GROUP BY user, TUMBLE_START(ts, INTERVAL '1' MINUTE)");

12) CDC with Debezium

Ingest change streams from MySQL/Postgres via Debezium + Kafka into Flink SQL for materialized views and real-time ETL.

13) Upserts & Exactly-once Sinks

Use upsert-kafka, JDBC upserts, or file sinks with atomic commit. Ensure primary keys for correct changelog semantics.

14) Flink SQL Catalogs

Integrate with Hive/Iceberg catalogs for schema and table management; enables time travel and ACID in lakes.

Section 4 — CEP, Patterns & ML

15) Complex Event Processing (CEP)

Pattern<Event, ?> p = Pattern.<Event>begin("a")
  .where(e -> e.type.equals("login"))
  .next("b").where(e -> e.type.equals("transfer"))
  .within(Time.minutes(10));

Apply to a keyed stream and select matches for fraud/anomaly scenarios.

16) Side Outputs

Split streams for errors/late data using OutputTag<T> and ProcessFunction.

17) ML & Feature Pipelines

Use Flink for real-time feature computation/aggregation and serve to online stores; model scoring can run via custom operators or external services.

Section 5 — Deploy, Operate & Tune

18) Deployment Modes

Standalone: simple clusters.
YARN/Kubernetes: session or per-job clusters.
Native K8s: Helm charts, operator-based management.

19) Savepoints & Upgrades

Create savepoints for planned upgrades and schema changes. Keep state compatibility (serializers, schema evolution) stable across versions.

bin/flink savepoint :jobId s3://state/savepoints/ -yid :yarnAppId

20) Observability & Tuning

Monitor backpressure, checkpoint duration, busy time, GC.
Right-size parallelism and task slots; avoid hot keys.
Use RocksDB state backend for large state; tune memory/IO.

# Submit a job (fat JAR)
bin/flink run -d -p 8 -c com.acme.Job target/app-1.0.jar

21) Interview Q&A — 8 Quick Ones

1) Why Flink over Spark Structured Streaming? Lower latency, fine-grained event-time control, rich state/timers, CEP.

2) Exactly-once semantics? Checkpointed state + transactional sinks or idempotent upserts.

3) Backpressure? Propagates upstream; monitor and scale/slow sources, increase resources, or optimize operators.

4) Late data handling? Watermarks + allowed lateness + side outputs; or retractions in Table API.

5) State explosion? TTL on state, compaction, partitioning strategy, RocksDB.

6) Per-job vs session cluster? Per-job isolates dependencies; session improves utilization.

7) Scaling? Rescale with savepoint; adjust parallelism and repartition keys.

8) Schema evolution? Use formats with evolution (Avro/Protobuf), manage serializers carefully.

Cutting-edge Technology Courses by Uplatz

Apache Flink Pocket Book

1) What is Apache Flink?

2) Core Concepts

3) Minimal Streaming Job (Java)

4) Minimal Table SQL

5) Connectors

6) Event-time with Watermarks

7) Windows

8) Keyed State

9) Timers

10) Exactly-once with Checkpoints

11) Table API Joins & Windows

12) CDC with Debezium

13) Upserts & Exactly-once Sinks

14) Flink SQL Catalogs

15) Complex Event Processing (CEP)

16) Side Outputs

17) ML & Feature Pipelines

18) Deployment Modes

19) Savepoints & Upgrades

20) Observability & Tuning

21) Interview Q&A — 8 Quick Ones