Apache Flink Pocket Book
Low-latency streaming • Stateful processing • Time & windows • Checkpoints & savepoints • CEP • Connectors • Deploy & tune
1) What is Apache Flink?
Apache Flink is a distributed engine for stateful stream and batch processing with event-time semantics, exactly-once state consistency, and sub-second latency. It powers real-time analytics, fraud detection, anomaly alerts, and ETL.
# Quick start (local binary)
# download from flink.apache.org, then:
bin/start-cluster.sh
# UI on http://localhost:8081
2) Core Concepts
- DataStream API (streaming), Table/SQL API (relational), DataSet (legacy batch).
- Event time, processing time, ingestion time.
- State (keyed/operator), checkpoints, savepoints.
- Parallelism, slots, task managers.
3) Minimal Streaming Job (Java)
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromElements("a","b","a")
.keyBy(v -> v)
.map(v -> Tuple2.of(v, 1))
.keyBy(t -> t.f0)
.sum(1)
.print();
env.execute("WordCount");
4) Minimal Table SQL
TableEnvironment tEnv = TableEnvironment.create(...);
tEnv.executeSql("CREATE TEMPORARY TABLE clicks (user STRING, ts TIMESTAMP(3), WATERMARK FOR ts AS ts - INTERVAL '5' SECOND) WITH (...)");
tEnv.executeSql("SELECT user, TUMBLE_START(ts, INTERVAL '1' MINUTE), COUNT(*) FROM clicks GROUP BY user, TUMBLE(ts, INTERVAL '1' MINUTE)");
5) Connectors
Common sources/sinks: Kafka, Kinesis, Pulsar, Filesystem, JDBC, Elasticsearch, Iceberg, Hive. Use the corresponding connector JARs and matching versions.
6) Event-time with Watermarks
env.getConfig().setAutoWatermarkInterval(1000);
WatermarkStrategy<MyEvent> wm = WatermarkStrategy
.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((e, ts) -> e.getTsMillis());
Pick out-of-orderness budget carefully; larger = more latency but fewer late events.
7) Windows
stream.keyBy(e -> e.user)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.reduce(new MyReducer());
Also: sliding, session, global windows. Use triggers & evictors for custom emission.
8) Keyed State
class Dedup extends KeyedProcessFunction<String, Event, Event> {
private transient ValueState<Long> lastSeen;
public void open(OpenContext ctx){ lastSeen = getRuntimeContext()
.getState(new ValueStateDescriptor<>("last", Long.class)); }
public void processElement(Event e, Context ctx, Collector<Event> out) throws Exception {
Long prev = lastSeen.value();
if (prev == null || e.ts > prev) { lastSeen.update(e.ts); out.collect(e); }
}
}
9) Timers
Register event-time or processing-time timers for delayed actions (e.g., session close, reminders).
ctx.timerService().registerEventTimeTimer(e.ts + 60000);
10) Exactly-once with Checkpoints
env.enableCheckpointing(60000, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(120000);
Use a persistent state backend (RocksDB) for large state.
11) Table API Joins & Windows
Table result = tEnv.sqlQuery("
SELECT user, COUNT(*) cnt
FROM clicks
WINDOW T AS (TUMBLE(ts, INTERVAL '1' MINUTE))
GROUP BY user, TUMBLE_START(ts, INTERVAL '1' MINUTE)");
12) CDC with Debezium
Ingest change streams from MySQL/Postgres via Debezium + Kafka into Flink SQL for materialized views and real-time ETL.
13) Upserts & Exactly-once Sinks
Use upsert-kafka, JDBC upserts, or file sinks with atomic commit. Ensure primary keys for correct changelog semantics.
14) Flink SQL Catalogs
Integrate with Hive/Iceberg catalogs for schema and table management; enables time travel and ACID in lakes.
15) Complex Event Processing (CEP)
Pattern<Event, ?> p = Pattern.<Event>begin("a")
.where(e -> e.type.equals("login"))
.next("b").where(e -> e.type.equals("transfer"))
.within(Time.minutes(10));
Apply to a keyed stream and select matches for fraud/anomaly scenarios.
16) Side Outputs
Split streams for errors/late data using OutputTag<T>
and ProcessFunction
.
17) ML & Feature Pipelines
Use Flink for real-time feature computation/aggregation and serve to online stores; model scoring can run via custom operators or external services.
18) Deployment Modes
- Standalone: simple clusters.
- YARN/Kubernetes: session or per-job clusters.
- Native K8s: Helm charts, operator-based management.
19) Savepoints & Upgrades
Create savepoints for planned upgrades and schema changes. Keep state compatibility (serializers, schema evolution) stable across versions.
bin/flink savepoint :jobId s3://state/savepoints/ -yid :yarnAppId
20) Observability & Tuning
- Monitor backpressure, checkpoint duration, busy time, GC.
- Right-size parallelism and task slots; avoid hot keys.
- Use RocksDB state backend for large state; tune memory/IO.
# Submit a job (fat JAR)
bin/flink run -d -p 8 -c com.acme.Job target/app-1.0.jar
21) Interview Q&A — 8 Quick Ones
1) Why Flink over Spark Structured Streaming? Lower latency, fine-grained event-time control, rich state/timers, CEP.
2) Exactly-once semantics? Checkpointed state + transactional sinks or idempotent upserts.
3) Backpressure? Propagates upstream; monitor and scale/slow sources, increase resources, or optimize operators.
4) Late data handling? Watermarks + allowed lateness + side outputs; or retractions in Table API.
5) State explosion? TTL on state, compaction, partitioning strategy, RocksDB.
6) Per-job vs session cluster? Per-job isolates dependencies; session improves utilization.
7) Scaling? Rescale with savepoint; adjust parallelism and repartition keys.
8) Schema evolution? Use formats with evolution (Avro/Protobuf), manage serializers carefully.