Logstash Pocket Book — Uplatz
50 in-depth cards • Wide layout • Real-world configs • 20-question interview Q&A included
1) What is Logstash?
Open-source data processing pipeline that ingests from many sources, transforms with filters, and ships to outputs like Elasticsearch, S3, Kafka, DBs.
Flow: input { } → filter { } → output { }
2) Core Concepts
Events are JSON-like. Pipelines define stages. Plugins (inputs/filters/outputs) do the heavy lifting. Runs on JVM with persistent queues option.
3) Install & Run
Install via packages/Docker. Validate configs before running.
bin/logstash -t -f pipeline.conf
bin/logstash -f pipeline.conf
4) Config Structure
Multiple config files are concatenated by section. Order inside a section matters, between files does not.
5) Multiple Pipelines
Define many independent pipelines in pipelines.yml
for isolation and scaling.
- pipeline.id: web
path.config: pipelines/web/*.conf
6) Event Fields
Each event carries fields and metadata, with @timestamp
, @version
, and optionally tags
.
7) Codec Basics
Codecs (json, line, multiline, avro) encode/decode data at inputs/outputs, shaping the event stream.
8) Performance Knobs
-w
pipeline workers, -b
batch size, persistent queues, JVM heap, and filter placement affect throughput/latency.
9) Observability
Enable monitoring APIs/metrics, dead letter queues, and log to file. Use stdout { codec => rubydebug }
for debugging.
10) Q&A — “Why Logstash vs Beats?”
Answer: Beats ship logs efficiently; Logstash performs heavy parsing/enrichment, complex routing, aggregation, and multi-sink fan-out.
11) File & Multiline
Tail files; stitch stack traces with multiline codec.
input {
file { path => "/var/log/app.log" start_position => "beginning"
codec => multiline { pattern => "^\s" what => "previous" }
}
}
12) Beats Input
Receive from Filebeat/Winlogbeat over Lumberjack protocol.
input { beats { port => 5044 } }
13) Kafka Input/Output
Kafka provides buffering and scale; set group id, topics, and serialization codec.
input { kafka { bootstrap_servers => "k1:9092" topics => ["logs"] } }
output { kafka { topic_id => "parsed" } }
14) HTTP/TCP/UDP
Ingest via HTTP API or raw TCP/UDP; useful for custom sources.
input { http { port => 8080 } tcp { port => 5000 codec => json } }
15) JDBC Input
Poll databases and stream rows as events; track last_run metadata for incremental ingestion.
input {
jdbc { jdbc_connection_string => "jdbc:postgresql://db/app"
jdbc_user => "ro" schedule => "*/5 * * * *"
statement => "SELECT * FROM orders WHERE updated_at > :sql_last_value" }
}
16) Elasticsearch Output
Index to Elasticsearch with index pattern, action, and ILM compatibility.
output { elasticsearch { hosts => ["http://es:9200"] index => "app-%{+YYYY.MM.dd}" } }
17) S3 Output
Archive events to S3 with time-based key formatting; prefer gzip codec.
output { s3 { bucket => "logs-raw" prefix => "app/%{+YYYY}/%{+MM}/" codec => "json_lines" } }
18) Conditionals & Tags
Route by fields, tags, and regex matches.
output {
if "error" in [tags] { elasticsearch { index => "errors-%{+YYYY.MM}" } }
else { kafka { topic_id => "clean" } }
}
19) Dead Letter Queue (DLQ)
Capture events that failed to index (mapping errors). Reprocess later with a DLQ input pipeline.
20) Q&A — “Kafka vs direct ES?”
Answer: Kafka adds durability and decoupling at the cost of ops complexity; direct ES is simpler but less resilient to spikes.
21) Grok Basics
Parse unstructured logs into fields using patterns (COMMONAPACHELOG, COMBINEDAPACHELOG, custom).
filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } }
22) Custom Patterns
Extend grok with your own token definitions.
filter { grok { patterns_dir => ["./patterns"] match => { "message" => "%{MYAPP:msg}" } } }
23) Date Filter
Convert timestamp strings to @timestamp
with timezone.
filter { date { match => ["time","dd/MMM/YYYY:HH:mm:ss Z"] target => "@timestamp" } }
24) Mutate Filter
Rename, remove, convert, add fields, or lowercase/uppercase values.
filter { mutate { rename => {"host" => "source.host"} convert => {"bytes" => "integer"} } }
25) JSON & KV
Parse JSON or key=value pairs embedded in messages.
filter { json { source => "message" } kv { source => "kv" field_split => " " } }
26) GeoIP & UA
Enrich IPs with GeoIP data; parse user agents for device/browser/OS.
filter { geoip { source => "client_ip" } useragent { source => "agent" } }
27) Dissect vs Grok
Dissect is faster, delimiter-based; use for well-structured tokens, grok for regex-heavy parsing.
filter { dissect { mapping => { "message" => "%{ts} %{level} %{msg}" } } }
28) Translate & DNS
Lookup/translate codes via dictionary files; resolve hostnames/IPs.
filter { translate { field => "status" destination => "status_text" dictionary => { "200" => "OK" } } }
29) Drop, Throttle, Clone
Drop noisy events, throttle by rate, or clone for multiple processing branches.
filter { if [level] == "debug" { drop { } } clone { clones => ["to_kafka"] } }
30) Q&A — “Grok too slow?”
Answer: Prefer dissect for simple separators, reduce regex backtracking, pre-filter with conditionals, and benchmark patterns.
31) Pipelines.yml
Isolate concerns: one pipeline per source or per tenant; easier to scale and deploy independently.
32) Persistent Queues
Enable disk-backed queues to survive restarts and absorb spikes; tune capacity and checkpointing.
queue.type: persisted
queue.max_bytes: 4gb
33) JVM & GC
Set heap in jvm.options
; monitor GC pauses; avoid over-allocating heap which can increase GC overhead.
34) Pipeline Workers & Batch
Increase -w
for CPU-bound filters; adjust -b
for I/O-bound outputs; measure end-to-end latency.
35) DLQ Reprocessing
Build a pipeline to read from DLQ, fix mappings, and reindex. Tag DLQ events for audit.
36) Backpressure & Retries
Outputs may block (ES bulk). Use retries, exponential backoff, and circuit-breaker routing to Kafka/S3.
37) High Availability
Run multiple Logstash instances behind LB or via Kafka fan-in; stateless designs ease scaling.
38) Security
TLS for Beats/Kafka/HTTP; mTLS where possible; secrets via keystore; limit network exposure.
bin/logstash-keystore create
bin/logstash-keystore add S3_SECRET
39) Monitoring
Use X-Pack Monitoring or Prometheus exporters; track events in/out, queue sizes, filter durations.
40) Q&A — “Why are events delayed?”
Answer: Blocking outputs, oversized batches, slow regex filters, or GC pauses. Address by tuning outputs, using dissect, and right-sizing heap.
41) Recipe: Nginx → ES
Parse access logs, set @timestamp, add geoip, index per day.
input { file { path => "/var/log/nginx/access.log" } }
filter {
grok { match => { "message" => "%{COMBINEDAPACHELOG}" } }
date { match => ["timestamp","dd/MMM/YYYY:HH:mm:ss Z"] }
geoip { source => "clientip" }
}
output { elasticsearch { index => "nginx-%{+YYYY.MM.dd}" } }
42) Recipe: Beats fan-in + S3 archive
Ingest from Beats, enrich, to ES and long-term S3.
input { beats { port => 5044 } }
filter { mutate { add_tag => ["ingested_by_logstash"] } }
output {
elasticsearch { index => "beats-%{+YYYY.MM.dd}" }
s3 { bucket => "archive-logs" prefix => "beats/%{+YYYY}/%{+MM}/" }
}
43) Recipe: App JSON logs
JSON decode, drop debug, route errors.
filter {
json { source => "message" }
if [level] == "debug" { drop {} }
if [level] == "error" { mutate { add_tag => ["error"] } }
}
44) Recipe: Enrich from CSV
Translate codes to names from a static CSV.
filter {
translate {
field => "country_code" destination => "country_name"
dictionary_path => "/etc/logstash/countries.csv" exact => true
}
}
45) Recipe: Mask PII
Use mutate/gsub to anonymize emails and card numbers.
filter {
mutate { gsub => ["message","[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}","[EMAIL]"] }
}
46) Recipe: DLQ Reader
Read from DLQ, fix mapping, reindex.
input { dead_letter_queue { path => "/var/lib/logstash/dead_letter_queue" commit_offsets => true } }
output { elasticsearch { index => "recovered-%{+YYYY.MM.dd}" } }
47) Recipe: Kafka bridge
Ingest raw topic, parse, fan-out to clean topic and ES.
input { kafka { topics => ["raw-logs"] } }
filter { dissect { mapping => { "message" => "%{ts} %{lvl} %{msg}" } } }
output { kafka { topic_id => "clean-logs" } elasticsearch { index => "logs-%{+YYYY.MM.dd}" } }
48) Common Pitfalls
Regex-heavy grok without anchors, unbounded multiline, ignoring backpressure, huge heap causing long GC, single pipeline doing too much.
49) 30-Day Adoption Plan
Week 1: baseline ingestion • Week 2: parsing/PII masking • Week 3: S3 archive/Kafka decoupling • Week 4: HA, monitoring, DLQ.
50) Interview Q&A — 20 Practical Questions (Expanded)
1) Logstash vs Beats? Beats ship/forward; Logstash parses/enriches/routes and supports complex ETL.
2) When to use persistent queues? To survive ES/Kafka outages and absorb bursts; enables reliable at-least-once processing.
3) Grok vs Dissect? Dissect for delimiter-based fast parsing; grok for regex flexibility. Prefer dissect when possible.
4) How to handle multiline logs? Use multiline codec (file/beats) with safe patterns; avoid patterns that glue unrelated lines.
5) What is DLQ? Dead Letter Queue stores events that failed output (e.g., ES mapping). Reprocess later.
6) How to prevent data loss? Persistent queues, Kafka buffering, idempotent outputs, retries, and checkpointing (JDBC last_run).
7) Scale strategies? Multiple pipelines, horizontal instances, Kafka fan-in, selective enrichment only where needed.
8) ES index naming best practice? Include app + date; align with ILM; avoid too many small indices.
9) Why are pipelines slow? Heavy grok, blocking outputs, tiny batch size, insufficient workers, GC pauses—profile and tune each.
10) Secure inputs? TLS/mTLS on Beats/HTTP/Kafka; keystore for secrets; network policies; JVM updates.
11) How to enrich with external data? translate filter, jdbc_streaming filter, or enrich in Kafka/ES ingest pipelines.
12) Ordering guarantees? Not strictly guaranteed end-to-end; if needed, group by key and use single-threaded paths.
13) Backpressure symptoms? Growing queues, rising latency, ES bulk rejections; add capacity, throttle, or buffer to Kafka.
14) JSON parsing errors? Validate source, use json { skip_on_invalid_json => true }
, send invalid to a quarantine index.
15) Why split pipelines? Isolation and easier scaling/troubleshooting; avoid “god pipeline”.
16) Zero-downtime changes? Rolling deploy multiple instances; use feature flags/conditionals; validate with -t
first.
17) How to mask PII? mutate gsub, anonymize fields, or custom ruby filter; ensure compliance and audit.
18) Monitoring must-haves? Event rates, queue depth, filter durations, JVM heap/GC, output failures.
19) Typical troubleshooting flow? Reproduce with stdin/stdout
, add rubydebug
codec, disable filters progressively, check logs/metrics.
20) When not to use Logstash? If only lightweight shipping needed (Filebeat) or when stream processing/joins/windows are required (use Kafka Streams/Flink).