Logstash Pocket Book

Logstash Pocket Book — Uplatz

50 in-depth cards • Wide layout • Real-world configs • 20-question interview Q&A included

Section 1 — Foundations

1) What is Logstash?

Open-source data processing pipeline that ingests from many sources, transforms with filters, and ships to outputs like Elasticsearch, S3, Kafka, DBs.

Flow: input { } → filter { } → output { }

2) Core Concepts

Events are JSON-like. Pipelines define stages. Plugins (inputs/filters/outputs) do the heavy lifting. Runs on JVM with persistent queues option.

3) Install & Run

Install via packages/Docker. Validate configs before running.

bin/logstash -t -f pipeline.conf
bin/logstash -f pipeline.conf

4) Config Structure

Multiple config files are concatenated by section. Order inside a section matters, between files does not.

5) Multiple Pipelines

Define many independent pipelines in pipelines.yml for isolation and scaling.

- pipeline.id: web
  path.config: pipelines/web/*.conf

6) Event Fields

Each event carries fields and metadata, with @timestamp, @version, and optionally tags.

7) Codec Basics

Codecs (json, line, multiline, avro) encode/decode data at inputs/outputs, shaping the event stream.

8) Performance Knobs

-w pipeline workers, -b batch size, persistent queues, JVM heap, and filter placement affect throughput/latency.

9) Observability

Enable monitoring APIs/metrics, dead letter queues, and log to file. Use stdout { codec => rubydebug } for debugging.

10) Q&A — “Why Logstash vs Beats?”

Answer: Beats ship logs efficiently; Logstash performs heavy parsing/enrichment, complex routing, aggregation, and multi-sink fan-out.

Section 2 — Inputs, Outputs & Routing

11) File & Multiline

Tail files; stitch stack traces with multiline codec.

input {
  file { path => "/var/log/app.log" start_position => "beginning"
    codec => multiline { pattern => "^\s" what => "previous" }
  }
}

12) Beats Input

Receive from Filebeat/Winlogbeat over Lumberjack protocol.

input { beats { port => 5044 } }

13) Kafka Input/Output

Kafka provides buffering and scale; set group id, topics, and serialization codec.

input { kafka { bootstrap_servers => "k1:9092" topics => ["logs"] } }
output { kafka { topic_id => "parsed" } }

14) HTTP/TCP/UDP

Ingest via HTTP API or raw TCP/UDP; useful for custom sources.

input { http { port => 8080 } tcp { port => 5000 codec => json } }

15) JDBC Input

Poll databases and stream rows as events; track last_run metadata for incremental ingestion.

input {
  jdbc { jdbc_connection_string => "jdbc:postgresql://db/app"
         jdbc_user => "ro" schedule => "*/5 * * * *"
         statement => "SELECT * FROM orders WHERE updated_at > :sql_last_value" }
}

16) Elasticsearch Output

Index to Elasticsearch with index pattern, action, and ILM compatibility.

output { elasticsearch { hosts => ["http://es:9200"] index => "app-%{+YYYY.MM.dd}" } }

17) S3 Output

Archive events to S3 with time-based key formatting; prefer gzip codec.

output { s3 { bucket => "logs-raw" prefix => "app/%{+YYYY}/%{+MM}/" codec => "json_lines" } }

18) Conditionals & Tags

Route by fields, tags, and regex matches.

output {
  if "error" in [tags] { elasticsearch { index => "errors-%{+YYYY.MM}" } }
  else { kafka { topic_id => "clean" } }
}

19) Dead Letter Queue (DLQ)

Capture events that failed to index (mapping errors). Reprocess later with a DLQ input pipeline.

20) Q&A — “Kafka vs direct ES?”

Answer: Kafka adds durability and decoupling at the cost of ops complexity; direct ES is simpler but less resilient to spikes.

Section 3 — Parsing, Enrichment & Transformation

21) Grok Basics

Parse unstructured logs into fields using patterns (COMMONAPACHELOG, COMBINEDAPACHELOG, custom).

filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } }

22) Custom Patterns

Extend grok with your own token definitions.

filter { grok { patterns_dir => ["./patterns"] match => { "message" => "%{MYAPP:msg}" } } }

23) Date Filter

Convert timestamp strings to @timestamp with timezone.

filter { date { match => ["time","dd/MMM/YYYY:HH:mm:ss Z"] target => "@timestamp" } }

24) Mutate Filter

Rename, remove, convert, add fields, or lowercase/uppercase values.

filter { mutate { rename => {"host" => "source.host"} convert => {"bytes" => "integer"} } }

25) JSON & KV

Parse JSON or key=value pairs embedded in messages.

filter { json { source => "message" } kv { source => "kv" field_split => " " } }

26) GeoIP & UA

Enrich IPs with GeoIP data; parse user agents for device/browser/OS.

filter { geoip { source => "client_ip" } useragent { source => "agent" } }

27) Dissect vs Grok

Dissect is faster, delimiter-based; use for well-structured tokens, grok for regex-heavy parsing.

filter { dissect { mapping => { "message" => "%{ts} %{level} %{msg}" } } }

28) Translate & DNS

Lookup/translate codes via dictionary files; resolve hostnames/IPs.

filter { translate { field => "status" destination => "status_text" dictionary => { "200" => "OK" } } }

29) Drop, Throttle, Clone

Drop noisy events, throttle by rate, or clone for multiple processing branches.

filter { if [level] == "debug" { drop { } } clone { clones => ["to_kafka"] } }

30) Q&A — “Grok too slow?”

Answer: Prefer dissect for simple separators, reduce regex backtracking, pre-filter with conditionals, and benchmark patterns.

Section 4 — Operations, Scaling & Reliability

31) Pipelines.yml

Isolate concerns: one pipeline per source or per tenant; easier to scale and deploy independently.

32) Persistent Queues

Enable disk-backed queues to survive restarts and absorb spikes; tune capacity and checkpointing.

queue.type: persisted
queue.max_bytes: 4gb

33) JVM & GC

Set heap in jvm.options; monitor GC pauses; avoid over-allocating heap which can increase GC overhead.

34) Pipeline Workers & Batch

Increase -w for CPU-bound filters; adjust -b for I/O-bound outputs; measure end-to-end latency.

35) DLQ Reprocessing

Build a pipeline to read from DLQ, fix mappings, and reindex. Tag DLQ events for audit.

36) Backpressure & Retries

Outputs may block (ES bulk). Use retries, exponential backoff, and circuit-breaker routing to Kafka/S3.

37) High Availability

Run multiple Logstash instances behind LB or via Kafka fan-in; stateless designs ease scaling.

38) Security

TLS for Beats/Kafka/HTTP; mTLS where possible; secrets via keystore; limit network exposure.

bin/logstash-keystore create
bin/logstash-keystore add S3_SECRET

39) Monitoring

Use X-Pack Monitoring or Prometheus exporters; track events in/out, queue sizes, filter durations.

40) Q&A — “Why are events delayed?”

Answer: Blocking outputs, oversized batches, slow regex filters, or GC pauses. Address by tuning outputs, using dissect, and right-sizing heap.

Section 5 — Practical Recipes & Interview Q&A

41) Recipe: Nginx → ES

Parse access logs, set @timestamp, add geoip, index per day.

input { file { path => "/var/log/nginx/access.log" } }
filter {
  grok { match => { "message" => "%{COMBINEDAPACHELOG}" } }
  date { match => ["timestamp","dd/MMM/YYYY:HH:mm:ss Z"] }
  geoip { source => "clientip" }
}
output { elasticsearch { index => "nginx-%{+YYYY.MM.dd}" } }

42) Recipe: Beats fan-in + S3 archive

Ingest from Beats, enrich, to ES and long-term S3.

input { beats { port => 5044 } }
filter { mutate { add_tag => ["ingested_by_logstash"] } }
output {
  elasticsearch { index => "beats-%{+YYYY.MM.dd}" }
  s3 { bucket => "archive-logs" prefix => "beats/%{+YYYY}/%{+MM}/" }
}

43) Recipe: App JSON logs

JSON decode, drop debug, route errors.

filter {
  json { source => "message" }
  if [level] == "debug" { drop {} }
  if [level] == "error" { mutate { add_tag => ["error"] } }
}

44) Recipe: Enrich from CSV

Translate codes to names from a static CSV.

filter {
  translate {
    field => "country_code" destination => "country_name"
    dictionary_path => "/etc/logstash/countries.csv" exact => true
  }
}

45) Recipe: Mask PII

Use mutate/gsub to anonymize emails and card numbers.

filter {
  mutate { gsub => ["message","[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}","[EMAIL]"] }
}

46) Recipe: DLQ Reader

Read from DLQ, fix mapping, reindex.

input { dead_letter_queue { path => "/var/lib/logstash/dead_letter_queue" commit_offsets => true } }
output { elasticsearch { index => "recovered-%{+YYYY.MM.dd}" } }

47) Recipe: Kafka bridge

Ingest raw topic, parse, fan-out to clean topic and ES.

input { kafka { topics => ["raw-logs"] } }
filter { dissect { mapping => { "message" => "%{ts} %{lvl} %{msg}" } } }
output { kafka { topic_id => "clean-logs" } elasticsearch { index => "logs-%{+YYYY.MM.dd}" } }

48) Common Pitfalls

Regex-heavy grok without anchors, unbounded multiline, ignoring backpressure, huge heap causing long GC, single pipeline doing too much.

49) 30-Day Adoption Plan

Week 1: baseline ingestion • Week 2: parsing/PII masking • Week 3: S3 archive/Kafka decoupling • Week 4: HA, monitoring, DLQ.

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Logstash vs Beats? Beats ship/forward; Logstash parses/enriches/routes and supports complex ETL.

2) When to use persistent queues? To survive ES/Kafka outages and absorb bursts; enables reliable at-least-once processing.

3) Grok vs Dissect? Dissect for delimiter-based fast parsing; grok for regex flexibility. Prefer dissect when possible.

4) How to handle multiline logs? Use multiline codec (file/beats) with safe patterns; avoid patterns that glue unrelated lines.

5) What is DLQ? Dead Letter Queue stores events that failed output (e.g., ES mapping). Reprocess later.

6) How to prevent data loss? Persistent queues, Kafka buffering, idempotent outputs, retries, and checkpointing (JDBC last_run).

7) Scale strategies? Multiple pipelines, horizontal instances, Kafka fan-in, selective enrichment only where needed.

8) ES index naming best practice? Include app + date; align with ILM; avoid too many small indices.

9) Why are pipelines slow? Heavy grok, blocking outputs, tiny batch size, insufficient workers, GC pauses—profile and tune each.

10) Secure inputs? TLS/mTLS on Beats/HTTP/Kafka; keystore for secrets; network policies; JVM updates.

11) How to enrich with external data? translate filter, jdbc_streaming filter, or enrich in Kafka/ES ingest pipelines.

12) Ordering guarantees? Not strictly guaranteed end-to-end; if needed, group by key and use single-threaded paths.

13) Backpressure symptoms? Growing queues, rising latency, ES bulk rejections; add capacity, throttle, or buffer to Kafka.

14) JSON parsing errors? Validate source, use json { skip_on_invalid_json => true }, send invalid to a quarantine index.

15) Why split pipelines? Isolation and easier scaling/troubleshooting; avoid “god pipeline”.

16) Zero-downtime changes? Rolling deploy multiple instances; use feature flags/conditionals; validate with -t first.

17) How to mask PII? mutate gsub, anonymize fields, or custom ruby filter; ensure compliance and audit.

18) Monitoring must-haves? Event rates, queue depth, filter durations, JVM heap/GC, output failures.

19) Typical troubleshooting flow? Reproduce with stdin/stdout, add rubydebug codec, disable filters progressively, check logs/metrics.

20) When not to use Logstash? If only lightweight shipping needed (Filebeat) or when stream processing/joins/windows are required (use Kafka Streams/Flink).