Splunk Pocket Book — Uplatz
50 cards total • Wide 3-column layout • Readable examples • Interview Q&A included
1) What is Splunk?
Splunk is a platform for searching, monitoring, and analyzing machine data (logs, metrics, events) at scale. It ingests data from forwarders, APIs, HEC, or files, indexes it for fast retrieval, and lets you query via SPL (Search Processing Language). Core pieces are indexers (store & search), search heads (UI & query federation), and forwarders (data shippers). Use cases include security analytics (SIEM), IT operations (ITOM), observability (APM/logs/metrics), and business analytics. Splunk Enterprise is self-managed; Splunk Cloud is managed. Data goes through parsing, indexing, and search pipelines with knowledge objects enriching context.
# Example search (last 15m)
index=main sourcetype=nginx_access status>=500 | stats count by status, uri
2) Splunk Architecture Overview
In distributed Splunk, forwarders send data to indexers; search heads coordinate searches across indexers and merge results. Deployer pushes app configs to search head cluster members; Cluster Master (Cluster Manager) manages indexer clustering, replication, and fixups. Decomposition: UF/HF (forwarders) → Indexers (bucket storage) → Search Heads (SPL/UI) → Deployment components (Deployer/DS/LM). Buckets move through hot → warm → cold → frozen; retention is governed per index. Plan network paths, load balancing, and security at ingress.
# View peers from a search head (UI preferred)
| rest /services/cluster/master/peers | table peer_name status
3) Indexes, Buckets & Retention
An index is a logical store with its own retention, volume, and access controls. Each index consists of buckets (hot/warm/cold/frozen). Hot = actively written; warm = searchable on disk; cold = older searchable data; frozen = deleted or archived to S3/HDFS per policy. Sizing indexes involves EPS (events per second), average event size, compression, and retention days. Plan for tsidx disk, raw data, and replication factors.
# indexes.conf snippet
[web]
homePath = $SPLUNK_DB/web/db
coldPath = $SPLUNK_DB/web/colddb
thawedPath = $SPLUNK_DB/web/thaweddb
maxTotalDataSizeMB = 500000
4) Universal vs Heavy Forwarder
Universal Forwarder (UF) is a lightweight agent that forwards data; it cannot parse/transform beyond basic line-breaking/timestamps. Heavy Forwarder (HF) is a full Splunk instance used when you need parsing, filtering, routing, or modular inputs at the edge (can run apps/add-ons, apply props/transforms.conf
). Prefer UF for most sources; use HF for heavy manipulation or when sourcetypes must be corrected before indexing.
# outputs.conf (UF/HF)
[tcpout]
defaultGroup = idx_group
[tcpout:idx_group]
server = idx01:9997, idx02:9997
5) Data Onboarding Flow
1) Identify source (files, syslog, HEC, DB). 2) Assign sourcetype, host, and index. 3) Validate line breaking, timestamps, and character encoding. 4) Normalize fields via props.conf
& transforms.conf
. 5) Apply CIM mappings and lookups. 6) Validate with sample searches, dashboards, and report acceleration. Proper sourcetyping is critical; it drives field extractions, tags, and knowledge reuse across apps (ES/ITSI).
# inputs.conf (UF)
[monitor:///var/log/nginx]
sourcetype = nginx:access
index = web
6) Parsing Basics: props & transforms
props.conf
defines line-breaking, timestamp extraction, and field extractions; transforms.conf
performs routing, filtering, and field transforms using regex or external lookups. Placement matters: index-time rules must live on indexers/HFs; search-time extractions can live on search heads.
# props.conf
[nginx:access]
TIME_PREFIX = \[
TIME_FORMAT = %d/%b/%Y:%H:%M:%S %z
MAX_TIMESTAMP_LOOKAHEAD = 32
REPORT-extractions = nginx_fields
# transforms.conf
[nginx_fields]
REGEX = ^(?P<client>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<uri>\S+) \S+" (?P<status>\d+)
7) Timestamp, Line Breaking & Character Sets
Accurate timestamps ensure correct time-based searches and bucketing. Configure TIME_PREFIX
, TIME_FORMAT
, and MAX_TIMESTAMP_LOOKAHEAD
. For multiline events (Java stack traces), use SHOULD_LINEMERGE=false
with LINE_BREAKER
OR event boundary logic. Specify CHARSET
for encodings (e.g., UTF-16).
# props.conf (multiline)
[app:java]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)\d{4}-\d{2}-\d{2}\s
8) Knowledge Objects
Knowledge objects include field extractions, lookups, tags, event types, macros, data models, and saved searches. They add semantics at search time. Manage sharing (private, app, global) and permissions carefully. Use naming conventions and version control via apps for team collaboration and reliable deployments.
# macro example (macros.conf)
[http_5xx]
definition = status>=500 status<600
iseval = 0
9) Role-Based Access Control
RBAC controls which indexes and knowledge objects users can access. Roles inherit capabilities and index list. Use least-privilege: separate search access (read on indexes) from admin capabilities. For multitenancy/security, isolate PII into dedicated indexes and restrict via roles and index constraints.
# authorize.conf (conceptual)
[role_engineering]
srchIndexesAllowed = web;auth;infra
srchFilter = index=web OR index=infra
10) Q&A — “Why does sourcetype matter so much?”
Answer: Sourcetype determines line-breaking, timestamp rules, default field extractions, CIM mappings, and app knowledge (ES/ITSI) that expect specific fields. A wrong sourcetype breaks parsing and makes searches unreliable, dashboards empty, and correlation rules ineffective. Always standardize sourcetypes and test with sample data before bulk onboarding.
11) SPL Basics: Search, Filter, Pipe
SPL is a pipeline language: search (retrieve events) → filter/transform → aggregate/visualize. Use base searches to scope data (index/sourcetype/time). Then pipe commands like stats
, eval
, rex
, lookup
, and timechart
. Keep early filters selective to reduce data volume and speed up the pipeline.
index=web sourcetype=nginx:access status>=400
| eval is_error = status>=500
| stats count as hits, sum(is_error) as errors by uri
| eval error_rate = round(errors/hits*100,2)
12) Fields, eval & where
fields
limits fields to speed later stages. eval
creates/transforms fields; where
filters by expressions. Use coalesce
, if
, and case
for flexible logic. Don’t over-compute at event scope if a later aggregation can do it once.
... | fields host, uri, status, bytes
| eval mb = bytes/1024/1024
| where status>=500 AND mb > 1
13) stats, eventstats, streamstats
stats
aggregates into new events (sum, count, avg). eventstats
computes aggregates and appends them to each original event (useful for ratios). streamstats
computes running stats over time windows. Choose the right one to avoid unnecessary event inflation.
... | stats count as hits, avg(bytes) as avg_b by uri
... | eventstats avg(bytes) as global_avg
... | streamstats window=5 avg(bytes) as moving_avg
14) timechart & bin
timechart
aggregates over _time buckets. Use span=
to control bucketing and bin
to pre-bucket for custom fields. Use per_second
or rate
techniques for normalized time series.
index=web error=1
| timechart span=5m count as errors
15) rex & regex extraction
rex
extracts fields using regular expressions at search time. Use named capture groups, test regex in sample searches, and prefer search-time extractions for flexibility unless index-time is mandatory.
... | rex field=_raw "user=(?<user>\w+)\s+ip=(?<ip>[\d\.]+)"
16) lookup, inputlookup & outputlookup
Lookups enrich events with external data (CSVs, KV Store, external scripts). Use automatic lookups in props.conf
for sourcetype-specific enrichment. inputlookup
reads lookup tables directly; outputlookup
writes to them (role-restricted).
... | lookup geoip ip as client_ip OUTPUTNEW city country
| stats count by country
17) join, append, appendcols
Joins are expensive; avoid them for large sets. Prefer lookup
, stats
with by
keys, or summary indexes. append
unions results; appendcols
aligns rows by order (fragile). If you must join, pre-filter both sides and limit fields.
index=a ... | fields id, x
| join type=inner id [ search index=b ... | fields id, y ]
18) tstats & Data Models
tstats
leverages accelerated data models for ultra-fast stats from tsidx files. You must build a data model and enable acceleration. Used heavily by ES & ITSI for speed on big data. Great for rollups and known schemas.
| tstats count where nodename=Authentication by _time span=5m Authentication.user
19) Summary Indexing & Report Acceleration
Summary indexing stores scheduled search results into a summary index for faster reporting (e.g., hourly rollups). Report acceleration transparently accelerates certain reports. Choose summary indexes when you control schema and need reproducible aggregates with retention independent of raw data.
# savedsearches.conf (concept)
action.summary_index = 1
action.summary_index._name = summaries
20) Q&A — “When to use join vs lookup vs tstats?”
Answer: Use lookup for static/dimension data keyed by a field. Use tstats over accelerated data models for high-speed aggregates across big data. Use join only for small, filtered sets when neither lookup nor tstats fits; otherwise it’s slow and memory-hungry.
21) Indexer Clustering
Indexer clusters provide HA and scalability with replication (RF) and search factors (SF). The Cluster Manager coordinates peer indexers, bucket fixups, and rolling updates. Design RF≥2, SF≥2 for HA; consider site awareness for multi-DC. Monitor fixup queues and cluster health views to avoid search degradation.
# server.conf (concept)
[clustering]
mode = manager
replication_factor = 3
search_factor = 2
22) Search Head Clustering
SHC gives HA for the UI and search artifacts. A Deployer pushes apps to members; captain orchestrates scheduled searches. Keep captaincy stable; avoid manual app edits on members. Use KV Store replication and artifact replication carefully; version your apps.
# shcluster-config (splunk init)
splunk init shcluster-config -mgmt_uri https://sh1:8089 -replication_port 8181 -auth admin:pwd
23) Deployment Server & Deployer
Deployment Server (DS) pushes configs to forwarders (serverclasses). Deployer pushes apps to SHC members. Keep clear separation: DS for UFs/HFs; Deployer for SHC. Test app changes in dev, then promote.
# serverclass.conf
[serverClass:linux_ufs:app:web_inputs]
whitelist.0 = *web*
24) HTTP Event Collector (HEC)
HEC ingests JSON over HTTP/HTTPS, ideal for cloud apps and microservices. Supports batched events, token-based auth, and acknowledgments. Map JSON keys to fields; set a sourcetype like _json
or custom. Validate timestamps and host fields for accuracy.
curl -k https://splunk:8088/services/collector -H "Authorization: Splunk TOKEN" \
-d '{"event":{"msg":"hello"},"sourcetype":"app:json","host":"svc1","time":1699999999}'
25) Ingest Budgets & License
Splunk licenses by daily ingest (GB/day). Monitor license usage, peaks, and violations. Control ingest: filter noisy sources at UF/HF, blacklist unneeded paths, and sample verbose logs. Consider metrics indexes or OpenTelemetry for high-cardinality time series.
# transforms.conf (drop noisy)
[drop_healthchecks]
REGEX = "GET /health"
DEST_KEY = queue
FORMAT = nullQueue
26) Performance Tuning: Searches
Constrain by index/time, reduce fields early, prefer stats
over transaction
, avoid large joins, and leverage tstats
. Use search job inspector to find slow stages. Summarize periodic heavy jobs to summary indexes and read from summaries during the day.
index=web earliest=-24h latest=now | fields keep=_time,host,uri,status | stats count by status
27) Metrics vs Logs
Metrics indexes store numeric time series efficiently; use mstats
and metric metadata (metric_name, dimensions). For high-cardinality labels, use careful dimension modeling. Logs remain best for unstructured text and troubleshooting context. Use both for observability.
| mstats avg(cpu.utilization) where metric_name=cpu.utilization by host span=1m
28) Data Model Acceleration & CIM
Splunk’s Common Information Model (CIM) normalizes fields across sources. Data models (often CIM-aligned) can be accelerated to speed searches dramatically using tstats
. Maintain acceleration summaries and rebuild on schema changes. ES depends on well-mapped CIM data.
| datamodel Web Web_Activity search
| tstats count from datamodel=Web.Web_Activity by Web_Activity.url
29) Dashboards: SimpleXML & Dashboard Studio
SimpleXML is classic; Dashboard Studio provides modern visuals and layout control. Use base searches with post-processing to share results across panels and save compute. Apply tokens, drilldowns, and time pickers for flexible analysis. Keep searches efficient (accelerate if needed).
<search base="base1"><query>index=web | stats count by status</query></search>
30) Q&A — “How do I make slow dashboards fast?”
Answer: Use base searches with post-processing, scope time tightly, reduce fields early, and prefer tstats
or summary indexes for panels with heavy stats. Turn on report acceleration for eligible reports, and avoid per-panel expensive joins. Cache lookups as KV Store when appropriate.
31) Splunk Enterprise Security (ES)
ES is Splunk’s SIEM app providing correlation searches, risk-based alerting (RBA), notable events, and dashboards over CIM-normalized data. Success depends on high-quality data onboarding, CIM mapping, and data model acceleration. Tune correlation rules to your environment; reduce alert fatigue with risk scoring and suppression.
# RBA example (conceptual)
| eval risk_score = if(severity="high", 80, 30)
32) IT Service Intelligence (ITSI)
ITSI provides service models, KPIs, episodes, and predictive analytics for IT operations. KPIs roll up to services with thresholds and notable event grouping (“episodes”). Use service analyzers for impact visualization and glass tables for executive overviews. Good data hygiene (metrics/logs) is key.
# KPI search skeleton
index=infra sourcetype=telegraf:cpu | stats avg(usage_idle) as idle by host
33) Alerting & Notables
Saved searches trigger alerts via email, webhooks, ServiceNow, or ES notable events. Use throttling to reduce noise, and include rich context in payloads (host, user, drilldown URL). For high-volume alerts, group into episodes (ITSI) or RBA strategies (ES).
# savedsearches.conf (concept)
action.email = 1
action.webhook = 1
34) KV Store & Lookups
KV Store (MongoDB-backed) powers dynamic lookups and stateful apps. Use for enrichment tables, watchlists, or cache. Size appropriately, back up, and secure access. Prefer KV Store for mutable data; CSV lookups for static small lists.
| inputlookup assets_kv | stats count by owner
35) REST API & SDKs
Automate Splunk via REST (management port 8089) to create searches, manage knowledge objects, or push configs. SDKs exist for Python/JS/Java. Secure with tokens and RBAC; avoid embedding admin creds in scripts.
curl -k -u admin:pwd https://splunk:8089/services/search/jobs -d search="search index=web | head 10"
36) Apps, Add-ons & TA Strategy
Splunkbase provides vendor TAs (Technology Add-ons) for common sources, delivering sourcetypes, extractions, and CIM mappings. Install the correct TA version, read release notes, and localize configs under local/
to survive upgrades. Version control your apps and promote through environments.
# App layout
default/
local/
metadata/
37) Security Hardening
Enforce TLS on management & indexing ports, rotate admin passwords, restrict management to trusted subnets/VPN, and audit role capabilities. Lock down HEC tokens. Keep Splunk and TAs patched. Separate duties: admins vs power users vs viewers. Enable auditing to track changes.
# server.conf (concept)
[sslConfig]
enableSplunkdSSL = true
38) Backups & DR
Back up configuration ($SPLUNK_HOME/etc
), apps, and KV Store. For data, rely on indexer cluster replication and frozen copies (S3/HDFS). Document restore procedures and test them. Keep Deployer/DS backups to recreate deployments quickly.
# KV Store backup (UI/CLI options exist)
splunk backup kvstore
39) Cost Optimization
Control ingest at source; drop noise early with transforms. Use metrics indexes where appropriate. Compress summaries, right-size retention per index, and archive frozen buckets to cheap storage. Educate users to write efficient searches and enforce quotas.
# props.conf (route less critical to low-retention index)
TRANSFORMS-route_low = route_low_index
40) Q&A — “Splunk Cloud vs Enterprise?”
Answer: Splunk Cloud offloads infra ops, upgrades, and scaling, with SaaS-level SLAs and guardrails; Enterprise gives full control on-prem/your cloud but you own ops. Choose Cloud for speed/ops simplicity; Enterprise for strict data residency/control or deep customization needs.
41) Q1–4: Architecture & Ingest
Q1: UF vs HF? UF is lightweight shipper; no heavy parsing. HF can parse/transform/route at edge and run apps. Prefer UF; use HF where pre-index transforms are required.
Q2: Index vs sourcetype? Index = storage & retention boundary; sourcetype = data format semantics. Both are required on every event; sourcetype drives parsing/fields.
Q3: Buckets lifecycle? Hot→Warm→Cold→Frozen. Hot = writing; warm/cold searchable; frozen deleted/archived. Retention & size are per-index.
Q4: HEC best practices? TLS, scoped tokens, batching with ack, set correct host/source/sourcetype, validate timestamps, backoff on 503.
42) Q5–8: Parsing & CIM
Q5: Index-time vs search-time extraction? Index-time affects storage (rare; risky to change later). Search-time is flexible and safer. Do index-time only when essential.
Q6: Multiline handling? Disable SHOULD_LINEMERGE, define LINE_BREAKER; or use event boundaries. Test with sample files.
Q7: Why CIM? Normalizes fields so apps (ES/ITSI) can work across vendors. Speeds correlation and dashboards.
Q8: tstats vs stats? tstats reads accelerated summaries/tsidx—much faster. stats scans raw events; flexible but slower on big data.
43) Q9–12: SPL Performance
Q9: Speed up slow searches? Constrain by index/time, reduce fields, avoid join/transaction when possible, use tstats/summary indexes.
Q10: When use transaction? To stitch events lacking a common ID (e.g., start/stop). Prefer stats+eval if you have IDs; transaction is expensive.
Q11: Lookup vs KV Store? CSV for static small tables; KV Store for mutable, larger, or API-driven enrichments.
Q12: Base search + post-process? Share a heavy base search; panels run lightweight post-processing to save compute.
44) Q13–16: Admin & Scaling
Q13: RF/SF meaning? Replication Factor (copies of raw data) and Search Factor (searchable copies). RF≥2, SF≥2 for HA.
Q14: SHC app deployment? Use Deployer; never edit members directly. Keep app versions in VCS and promote via pipelines.
Q15: License violations? Exceed daily ingest → violation window; repeated violations restrict searching. Fix by reducing ingest or increasing license, and wait out the window.
Q16: Data privacy? Route PII to restricted indexes, mask at ingest with transforms, and enforce RBAC roles.
45) Q17–20: Dashboards, Alerts, Ops
Q17: Fast dashboards? Base searches, post-processing, tstats/acceleration, and narrow time ranges. Avoid per-panel big joins.
Q18: Alert noise? Throttle, deduplicate, group episodes (ITSI), and use RBA in ES. Include rich context for triage.
Q19: Metrics vs logs choice? Metrics for numeric TS with fixed dims (fast & cheap); logs for detailed context. Use both.
Q20: Common onboarding mistakes? Wrong sourcetypes, missing timestamps, multiline errors, no CIM mapping, and unbounded ingest causing license pain.
46) Cheat: Time Constraints
Use earliest/latest for speed: earliest=-15m latest=now
, @d
for day boundaries. Favor relative times in saved searches.
index=web earliest=-1h@h latest=@h | stats count
47) Cheat: Field Normalization
Use coalesce
and rename
to normalize variants across sources before aggregation or joins.
... | eval user=coalesce(user, username, usr)
| rename clientip as src_ip
48) Cheat: Thresholding & Anomaly Hints
Compute moving baselines with streamstats
/eventstats
and trigger alerts on deviations to reduce static threshold noise.
... | timechart span=5m count as c
| streamstats window=12 avg(c) as avg stdev(c) as sd
| where c > avg + 3*sd
49) Cheat: Summary Index Pattern
Schedule heavy job hourly to write summaries; dashboards read summaries for sub-second panels while raw data remains for drilldowns.
# write summary
... | stats count by uri | collect index=summaries sourcetype=web_sum
50) Final Tips
Get sourcetypes and timestamps perfect, design indexes with clear retention, use CIM for app compatibility, prefer tstats/acceleration for scale, and keep searches scoped. Treat Splunk as a shared platform: version control apps, guard ingest, and build a catalog of reusable knowledge objects and macros.
# macro usage
`http_5xx`
| stats count by uri