Apache Kafka Pocket Book

Apache Kafka Pocket Book — Uplatz

50 Expanded Cards • One-Column Colorful Layout • Fundamentals · Ops · Dev · Streams · Connect · Security · 20 Interview Q&A

1) What is Apache Kafka?

Kafka is a distributed event streaming platform for publishing, subscribing, storing, and processing streams of records at scale. It powers real-time pipelines (ETL/ELT), event-driven microservices, and analytics. Its append-only commit logs, partitioning, and replication enable high throughput and fault tolerance.

2) Core Building Blocks

  • Topic — named stream of records.
  • Partition — ordered, immutable log per topic shard (ordering guaranteed per partition).
  • Producer — app that publishes records.
  • Consumer — app that reads records.
  • Broker — Kafka server storing partitions.
  • Cluster — group of brokers.

3) Why Kafka vs. Traditional MQ?

Classic MQs focus on transient queues and immediate delivery. Kafka decouples producers/consumers with durable logs, long retention, replayability, and horizontal scale. It’s optimized for throughput and event streaming rather than request/response RPC.

4) Partitioning & Keys

Producers optionally send a key per record. A partitioner maps key → partition (hash by default). All records with the same key land on the same partition, preserving their relative order.

# Java Producer: keyed send
producer.send(new ProducerRecord<>("payments", userId, payload));

5) Offsets & Consumer Groups

Each record in a partition has a sequential offset. Consumers in the same group share partitions (each partition assigned to exactly one consumer at a time), enabling parallelism and scalability. Offsets are committed so consumption can resume after restarts.

6) Delivery Semantics

  • At-most-once — commit before processing; no dupes, possible loss.
  • At-least-once — commit after processing; no loss, possible dupes.
  • Exactly-once — idempotent + transactional processing; no loss, no dupes.

7) Retention & Compaction

Time/size retention keeps all records for a period/size. Log compaction retains only the latest record per key (great for changelogs/state), while still allowing historical snapshots if combined with time retention.

8) Brokers, Leaders & Replicas

Each partition has a leader replica that handles reads/writes and follower replicas for redundancy. If a leader fails, an in-sync follower becomes leader (controller manages elections).

9) KRaft (No ZooKeeper)

Newer Kafka can run in KRaft mode, replacing ZooKeeper with built-in Raft-based metadata quorum. Benefits: simpler ops, fewer moving parts, consistent metadata handling.

10) Typical Use Cases

  • Streaming ETL (DB → Kafka → S3/DWH)
  • Event sourcing & CQRS in microservices
  • Log/metrics collection at scale
  • Real-time recommendations & fraud detection

11) Producer Essentials (Java)

Properties p = new Properties();
p.put("bootstrap.servers","localhost:9092");
p.put("acks","all");                   // strongest durability
p.put("retries","10");                 // retry on transient failures
p.put("enable.idempotence","true");   // dedupe on broker
KafkaProducer<String,String> producer = new KafkaProducer<>(p);

Use acks=all + enable.idempotence=true for durability + no duplicates at source.

12) Consumer Essentials (Java)

Properties p = new Properties();
p.put("bootstrap.servers","localhost:9092");
p.put("group.id","billing-service");
p.put("enable.auto.commit","false");   // manual commits
KafkaConsumer<String,String> c = new KafkaConsumer<>(p);
c.subscribe(Arrays.asList("payments"));

Prefer manual (sync/async) commits after successful processing to avoid data loss.

13) Manual Offset Commit Pattern

while (true) {
  ConsumerRecords<K,V> recs = consumer.poll(Duration.ofMillis(500));
  for (ConsumerRecord<K,V> r : recs) process(r);
  consumer.commitSync(); // after processing batch
}

Commit after handling a batch to balance reliability and throughput.

14) Rebalancing & Sticky Assignor

When group membership changes, partitions are reassigned. Use sticky assignors to reduce churn and leverage cooperative rebalancing for smoother transitions.

15) Handling Poison Messages

For bad records, avoid blocking the stream. Send invalid messages to a DLQ (dead letter queue) topic with error metadata; monitor and reprocess later.

// pseudo
try { process(record); } 
catch (Exception e) { produceTo("payments.DLQ", wrap(record, e)); }

16) Throughput Levers

  • Producers: linger.ms, batch.size, compression (lz4, snappy)
  • Consumers: poll loop cadence, max poll records, async processing
  • Topic layout: more partitions (up to a point) for parallelism

17) Ordering Guarantees

Ordering is per partition. To maintain order across related events, use a stable key (e.g., accountId). Avoid changing partition counts frequently on strictly ordered topics.

18) Idempotent Producers

With enable.idempotence=true, brokers dedupe producer re-sends using producerId/sequence numbers, preventing duplicates on retry.

19) Transactions & Exactly-Once

Use producer transactions to atomically write to multiple partitions/topics and commit consumer offsets in the same transaction (read-process-write pattern) for EOS.

producer.initTransactions();
producer.beginTransaction();
// read from input, process, send to output
producer.sendOffsetsToTransaction(offsets, consumerGroupId);
producer.commitTransaction();

20) Serialization & Schema Management

Use Avro/Protobuf/JSON Schema with a Schema Registry to enforce compatibility (backward/forward/full). This avoids breaking consumers on schema evolution.

21) Topic Design & Sizing

  • Estimate events/sec × record size → throughput.
  • Decide partitions for consumer parallelism headroom.
  • Replication factor (3 typical) for HA.

22) Retention Policies

Set retention.ms / retention.bytes per topic. Use tiered storage (if available in your distro) for cheaper long-term retention without overloading brokers.

23) Log Compaction Use Cases

Perfect for changelog/state topics: user profiles, account balances, last known device state. Combine with snapshots for fast rebuilds.

24) Monitoring Must-Haves

  • Broker health: CPU, disk, network, GC
  • Lag per consumer group/partition
  • Request/response rates, produce/fetch latency
  • Under-replicated partitions (URP) & offline replicas

25) Security (AuthN/Z & Encryption)

  • TLS in transit
  • SASL (SCRAM/OAuth) for authentication
  • ACLs for topic/cluster authorization
  • Principle of least privilege

26) Capacity & Scale

Add brokers to scale storage/IO; rebalance partitions; consider rack awareness to spread replicas across failure domains.

27) Multi-Region & DR

Use cluster-to-cluster replication (MirrorMaker-like solutions) for DR and regional reads. Choose active-active or active-passive patterns based on consistency and failover needs.

28) Upgrades & Rolling Restarts

Perform rolling upgrades broker-by-broker; monitor URP/latency; validate inter-broker protocol compatibility. Back up critical metadata.

29) Troubleshooting Lag

  • Increase consumer parallelism (more instances/partitions)
  • Speed processing (batch, async, faster sinks)
  • Tune max.poll.interval.ms, max.poll.records

30) Cost Awareness

Right-size partitions, apply compression, avoid tiny messages (batch), offload cold data, and watch cross-zone transfer fees in cloud.

31) Kafka Streams: KTable vs KStream

KStream is an unbounded stream of records. KTable is a changelog stream representing the latest value per key (like a materialized view). Joins and aggregations differ between them.

32) Streams DSL Example (Java)

StreamsBuilder b = new StreamsBuilder();
KStream<String,String> in = b.stream("orders");
KTable<String,Long> counts = in
  .groupByKey()
  .count(Materialized.as("order-counts"));
counts.toStream().to("orders.counts");

State stores back this aggregation; compacted changelogs ensure fault tolerance.

33) Windowing & Time Semantics

Use hopping/tumbling/session windows; choose event time and configure grace periods for late data. Watermarks bound lateness.

34) Interactive Queries

Expose Streams state stores (read-only) via a service to query materialized views (e.g., aggregate per customer). Ensure metadata routing for the right host.

35) Error Handling in Streams

Use deserialization exception handlers, production exception handlers, and DLQs. Consider exactly-once_v2 processing guarantees when enabled.

36) Kafka Connect Basics

Connectors are Source (external → Kafka) or Sink (Kafka → external). Distributed workers scale out; configs can be REST-managed. Use transforms (SMTs) to tweak payloads.

37) JDBC Source & S3 Sink (Example)

// POST /connectors
{
 "name":"jdbc-orders-src",
 "config":{
   "connector.class":"io.confluent.connect.jdbc.JdbcSourceConnector",
   "connection.url":"jdbc:postgresql://db:5432/app",
   "mode":"incrementing", "incrementing.column.name":"id",
   "topic.prefix":"db.orders."
 }}
// Sink to S3: "connector.class":"io.confluent.connect.s3.S3SinkConnector"

38) Schema Registry & Compatibility

Register subject versions and enforce compatibility (BACKWARD/FORWARD/FULL). Add fields with defaults; avoid removing required fields to keep consumers happy.

39) ksqlDB (SQL on Streams)

Define streams/tables with SQL, perform joins and aggregations without writing Java. Good for rapid prototyping and analytics on Kafka data.

40) Local Dev & Testing

Use Docker Compose to spin up Kafka + Schema Registry + Connect. For unit tests, Embedded Kafka (testcontainers) simulates brokers; use fake time to test windowing.

41) Q: Kafka vs RabbitMQ?

A: Kafka is a distributed log for high-throughput streaming and long retention; RabbitMQ is a broker for traditional queues and immediate delivery. Kafka excels at replay, partitioned scale, and stream processing ecosystems.

42) Q: How does Kafka ensure durability?

A: Replication (RF≥3), commit to leader + in-sync replicas (acks=all), fsync policies, and recovery via logs. Idempotent producers prevent dupes on retry.

43) Q: What is consumer lag and how to reduce it?

A: Lag = latest offset − committed offset. Reduce by increasing consumer instances/partitions, optimizing processing (batch/async), and tuning poll/commit parameters.

44) Q: When to use log compaction?

A: For changelog/state topics where only the latest value per key matters (e.g., account state). It enables fast rebuilds of state stores/materialized views.

45) Q: Exactly-Once Processing setup?

A: Enable idempotent + transactional producer; wrap read-process-write and offset commits in a single transaction using sendOffsetsToTransaction. Streams can enable processing.guarantee=exactly_once_v2.

46) Q: Ensuring message ordering?

A: Use a stable key so related messages go to the same partition; avoid concurrent producers for the same key if strict order is critical; minimize partition count changes.

47) Q: Hot partition — what and why?

A: When skewed keys route most traffic to a single partition causing imbalance. Fix via better key selection (hash of multiple fields) or custom partitioner; sometimes increase partitions.

48) Q: Common production pitfalls?

A: Auto-commit before processing (data loss risk), no schema compatibility, too many small messages (overhead), ignoring lag/URP alerts, insufficient partitions for growth.

49) Q: Multi-region replication strategies?

A: Active-active (bidirectional with conflict strategy) for read locality; active-passive for simpler DR. Use cluster replication tools, filter topics by need, and test failover regularly.

50) Q: Quick design for payments pipeline?

A: Topics: payments.in (validated events), payments.auth, payments.settlement, payments.DLQ. Keys: customer/account id. Semantics: idempotent + transactional producer for EOS. Schema: Avro with Registry (backward). Ops: RF=3, acks=all, lag/URP monitoring, compaction for account.state changelog.