Splunk Pocket Book

Splunk Pocket Book — Uplatz

50 cards total • Wide 3-column layout • Readable examples • Interview Q&A included

Section 1 — Splunk Core Concepts

1) What is Splunk?

Splunk is a platform for searching, monitoring, and analyzing machine data (logs, metrics, events) at scale. It ingests data from forwarders, APIs, HEC, or files, indexes it for fast retrieval, and lets you query via SPL (Search Processing Language). Core pieces are indexers (store & search), search heads (UI & query federation), and forwarders (data shippers). Use cases include security analytics (SIEM), IT operations (ITOM), observability (APM/logs/metrics), and business analytics. Splunk Enterprise is self-managed; Splunk Cloud is managed. Data goes through parsing, indexing, and search pipelines with knowledge objects enriching context.

# Example search (last 15m)
index=main sourcetype=nginx_access status>=500 | stats count by status, uri

2) Splunk Architecture Overview

In distributed Splunk, forwarders send data to indexers; search heads coordinate searches across indexers and merge results. Deployer pushes app configs to search head cluster members; Cluster Master (Cluster Manager) manages indexer clustering, replication, and fixups. Decomposition: UF/HF (forwarders) → Indexers (bucket storage) → Search Heads (SPL/UI) → Deployment components (Deployer/DS/LM). Buckets move through hot → warm → cold → frozen; retention is governed per index. Plan network paths, load balancing, and security at ingress.

# View peers from a search head (UI preferred)
| rest /services/cluster/master/peers | table peer_name status

3) Indexes, Buckets & Retention

An index is a logical store with its own retention, volume, and access controls. Each index consists of buckets (hot/warm/cold/frozen). Hot = actively written; warm = searchable on disk; cold = older searchable data; frozen = deleted or archived to S3/HDFS per policy. Sizing indexes involves EPS (events per second), average event size, compression, and retention days. Plan for tsidx disk, raw data, and replication factors.

# indexes.conf snippet
[web]
homePath   = $SPLUNK_DB/web/db
coldPath   = $SPLUNK_DB/web/colddb
thawedPath = $SPLUNK_DB/web/thaweddb
maxTotalDataSizeMB = 500000

4) Universal vs Heavy Forwarder

Universal Forwarder (UF) is a lightweight agent that forwards data; it cannot parse/transform beyond basic line-breaking/timestamps. Heavy Forwarder (HF) is a full Splunk instance used when you need parsing, filtering, routing, or modular inputs at the edge (can run apps/add-ons, apply props/transforms.conf). Prefer UF for most sources; use HF for heavy manipulation or when sourcetypes must be corrected before indexing.

# outputs.conf (UF/HF)
[tcpout]
defaultGroup = idx_group
[tcpout:idx_group]
server = idx01:9997, idx02:9997

5) Data Onboarding Flow

1) Identify source (files, syslog, HEC, DB). 2) Assign sourcetype, host, and index. 3) Validate line breaking, timestamps, and character encoding. 4) Normalize fields via props.conf & transforms.conf. 5) Apply CIM mappings and lookups. 6) Validate with sample searches, dashboards, and report acceleration. Proper sourcetyping is critical; it drives field extractions, tags, and knowledge reuse across apps (ES/ITSI).

# inputs.conf (UF)
[monitor:///var/log/nginx]
sourcetype = nginx:access
index = web

6) Parsing Basics: props & transforms

props.conf defines line-breaking, timestamp extraction, and field extractions; transforms.conf performs routing, filtering, and field transforms using regex or external lookups. Placement matters: index-time rules must live on indexers/HFs; search-time extractions can live on search heads.

# props.conf
[nginx:access]
TIME_PREFIX = \[
TIME_FORMAT = %d/%b/%Y:%H:%M:%S %z
MAX_TIMESTAMP_LOOKAHEAD = 32
REPORT-extractions = nginx_fields

# transforms.conf
[nginx_fields]
REGEX = ^(?P<client>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<uri>\S+) \S+" (?P<status>\d+)

7) Timestamp, Line Breaking & Character Sets

Accurate timestamps ensure correct time-based searches and bucketing. Configure TIME_PREFIX, TIME_FORMAT, and MAX_TIMESTAMP_LOOKAHEAD. For multiline events (Java stack traces), use SHOULD_LINEMERGE=false with LINE_BREAKER OR event boundary logic. Specify CHARSET for encodings (e.g., UTF-16).

# props.conf (multiline)
[app:java]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)\d{4}-\d{2}-\d{2}\s

8) Knowledge Objects

Knowledge objects include field extractions, lookups, tags, event types, macros, data models, and saved searches. They add semantics at search time. Manage sharing (private, app, global) and permissions carefully. Use naming conventions and version control via apps for team collaboration and reliable deployments.

# macro example (macros.conf)
[http_5xx]
definition = status>=500 status<600
iseval = 0

9) Role-Based Access Control

RBAC controls which indexes and knowledge objects users can access. Roles inherit capabilities and index list. Use least-privilege: separate search access (read on indexes) from admin capabilities. For multitenancy/security, isolate PII into dedicated indexes and restrict via roles and index constraints.

# authorize.conf (conceptual)
[role_engineering]
srchIndexesAllowed = web;auth;infra
srchFilter = index=web OR index=infra

10) Q&A — “Why does sourcetype matter so much?”

Answer: Sourcetype determines line-breaking, timestamp rules, default field extractions, CIM mappings, and app knowledge (ES/ITSI) that expect specific fields. A wrong sourcetype breaks parsing and makes searches unreliable, dashboards empty, and correlation rules ineffective. Always standardize sourcetypes and test with sample data before bulk onboarding.

Section 2 — SPL Fundamentals & Patterns

11) SPL Basics: Search, Filter, Pipe

SPL is a pipeline language: search (retrieve events) → filter/transform → aggregate/visualize. Use base searches to scope data (index/sourcetype/time). Then pipe commands like stats, eval, rex, lookup, and timechart. Keep early filters selective to reduce data volume and speed up the pipeline.

index=web sourcetype=nginx:access status>=400
| eval is_error = status>=500
| stats count as hits, sum(is_error) as errors by uri
| eval error_rate = round(errors/hits*100,2)

12) Fields, eval & where

fields limits fields to speed later stages. eval creates/transforms fields; where filters by expressions. Use coalesce, if, and case for flexible logic. Don’t over-compute at event scope if a later aggregation can do it once.

... | fields host, uri, status, bytes
| eval mb = bytes/1024/1024
| where status>=500 AND mb > 1

13) stats, eventstats, streamstats

stats aggregates into new events (sum, count, avg). eventstats computes aggregates and appends them to each original event (useful for ratios). streamstats computes running stats over time windows. Choose the right one to avoid unnecessary event inflation.

... | stats count as hits, avg(bytes) as avg_b by uri
... | eventstats avg(bytes) as global_avg
... | streamstats window=5 avg(bytes) as moving_avg

14) timechart & bin

timechart aggregates over _time buckets. Use span= to control bucketing and bin to pre-bucket for custom fields. Use per_second or rate techniques for normalized time series.

index=web error=1
| timechart span=5m count as errors

15) rex & regex extraction

rex extracts fields using regular expressions at search time. Use named capture groups, test regex in sample searches, and prefer search-time extractions for flexibility unless index-time is mandatory.

... | rex field=_raw "user=(?<user>\w+)\s+ip=(?<ip>[\d\.]+)"

16) lookup, inputlookup & outputlookup

Lookups enrich events with external data (CSVs, KV Store, external scripts). Use automatic lookups in props.conf for sourcetype-specific enrichment. inputlookup reads lookup tables directly; outputlookup writes to them (role-restricted).

... | lookup geoip ip as client_ip OUTPUTNEW city country
| stats count by country

17) join, append, appendcols

Joins are expensive; avoid them for large sets. Prefer lookup, stats with by keys, or summary indexes. append unions results; appendcols aligns rows by order (fragile). If you must join, pre-filter both sides and limit fields.

index=a ... | fields id, x
| join type=inner id [ search index=b ... | fields id, y ]

18) tstats & Data Models

tstats leverages accelerated data models for ultra-fast stats from tsidx files. You must build a data model and enable acceleration. Used heavily by ES & ITSI for speed on big data. Great for rollups and known schemas.

| tstats count where nodename=Authentication by _time span=5m Authentication.user

19) Summary Indexing & Report Acceleration

Summary indexing stores scheduled search results into a summary index for faster reporting (e.g., hourly rollups). Report acceleration transparently accelerates certain reports. Choose summary indexes when you control schema and need reproducible aggregates with retention independent of raw data.

# savedsearches.conf (concept)
action.summary_index = 1
action.summary_index._name = summaries

20) Q&A — “When to use join vs lookup vs tstats?”

Answer: Use lookup for static/dimension data keyed by a field. Use tstats over accelerated data models for high-speed aggregates across big data. Use join only for small, filtered sets when neither lookup nor tstats fits; otherwise it’s slow and memory-hungry.

Section 3 — Admin, Ingest, Performance & Reliability

21) Indexer Clustering

Indexer clusters provide HA and scalability with replication (RF) and search factors (SF). The Cluster Manager coordinates peer indexers, bucket fixups, and rolling updates. Design RF≥2, SF≥2 for HA; consider site awareness for multi-DC. Monitor fixup queues and cluster health views to avoid search degradation.

# server.conf (concept)
[clustering]
mode = manager
replication_factor = 3
search_factor = 2

22) Search Head Clustering

SHC gives HA for the UI and search artifacts. A Deployer pushes apps to members; captain orchestrates scheduled searches. Keep captaincy stable; avoid manual app edits on members. Use KV Store replication and artifact replication carefully; version your apps.

# shcluster-config (splunk init)
splunk init shcluster-config -mgmt_uri https://sh1:8089 -replication_port 8181 -auth admin:pwd

23) Deployment Server & Deployer

Deployment Server (DS) pushes configs to forwarders (serverclasses). Deployer pushes apps to SHC members. Keep clear separation: DS for UFs/HFs; Deployer for SHC. Test app changes in dev, then promote.

# serverclass.conf
[serverClass:linux_ufs:app:web_inputs]
whitelist.0 = *web*

24) HTTP Event Collector (HEC)

HEC ingests JSON over HTTP/HTTPS, ideal for cloud apps and microservices. Supports batched events, token-based auth, and acknowledgments. Map JSON keys to fields; set a sourcetype like _json or custom. Validate timestamps and host fields for accuracy.

curl -k https://splunk:8088/services/collector -H "Authorization: Splunk TOKEN" \
 -d '{"event":{"msg":"hello"},"sourcetype":"app:json","host":"svc1","time":1699999999}'

25) Ingest Budgets & License

Splunk licenses by daily ingest (GB/day). Monitor license usage, peaks, and violations. Control ingest: filter noisy sources at UF/HF, blacklist unneeded paths, and sample verbose logs. Consider metrics indexes or OpenTelemetry for high-cardinality time series.

# transforms.conf (drop noisy)
[drop_healthchecks]
REGEX = "GET /health"
DEST_KEY = queue
FORMAT = nullQueue

26) Performance Tuning: Searches

Constrain by index/time, reduce fields early, prefer stats over transaction, avoid large joins, and leverage tstats. Use search job inspector to find slow stages. Summarize periodic heavy jobs to summary indexes and read from summaries during the day.

index=web earliest=-24h latest=now | fields keep=_time,host,uri,status | stats count by status

27) Metrics vs Logs

Metrics indexes store numeric time series efficiently; use mstats and metric metadata (metric_name, dimensions). For high-cardinality labels, use careful dimension modeling. Logs remain best for unstructured text and troubleshooting context. Use both for observability.

| mstats avg(cpu.utilization) where metric_name=cpu.utilization by host span=1m

28) Data Model Acceleration & CIM

Splunk’s Common Information Model (CIM) normalizes fields across sources. Data models (often CIM-aligned) can be accelerated to speed searches dramatically using tstats. Maintain acceleration summaries and rebuild on schema changes. ES depends on well-mapped CIM data.

| datamodel Web Web_Activity search
| tstats count from datamodel=Web.Web_Activity by Web_Activity.url

29) Dashboards: SimpleXML & Dashboard Studio

SimpleXML is classic; Dashboard Studio provides modern visuals and layout control. Use base searches with post-processing to share results across panels and save compute. Apply tokens, drilldowns, and time pickers for flexible analysis. Keep searches efficient (accelerate if needed).

<search base="base1"><query>index=web | stats count by status</query></search>

30) Q&A — “How do I make slow dashboards fast?”

Answer: Use base searches with post-processing, scope time tightly, reduce fields early, and prefer tstats or summary indexes for panels with heavy stats. Turn on report acceleration for eligible reports, and avoid per-panel expensive joins. Cache lookups as KV Store when appropriate.

Section 4 — Security & IT Ops Apps, Extensibility

31) Splunk Enterprise Security (ES)

ES is Splunk’s SIEM app providing correlation searches, risk-based alerting (RBA), notable events, and dashboards over CIM-normalized data. Success depends on high-quality data onboarding, CIM mapping, and data model acceleration. Tune correlation rules to your environment; reduce alert fatigue with risk scoring and suppression.

# RBA example (conceptual)
| eval risk_score = if(severity="high", 80, 30)

32) IT Service Intelligence (ITSI)

ITSI provides service models, KPIs, episodes, and predictive analytics for IT operations. KPIs roll up to services with thresholds and notable event grouping (“episodes”). Use service analyzers for impact visualization and glass tables for executive overviews. Good data hygiene (metrics/logs) is key.

# KPI search skeleton
index=infra sourcetype=telegraf:cpu | stats avg(usage_idle) as idle by host

33) Alerting & Notables

Saved searches trigger alerts via email, webhooks, ServiceNow, or ES notable events. Use throttling to reduce noise, and include rich context in payloads (host, user, drilldown URL). For high-volume alerts, group into episodes (ITSI) or RBA strategies (ES).

# savedsearches.conf (concept)
action.email = 1
action.webhook = 1

34) KV Store & Lookups

KV Store (MongoDB-backed) powers dynamic lookups and stateful apps. Use for enrichment tables, watchlists, or cache. Size appropriately, back up, and secure access. Prefer KV Store for mutable data; CSV lookups for static small lists.

| inputlookup assets_kv | stats count by owner

35) REST API & SDKs

Automate Splunk via REST (management port 8089) to create searches, manage knowledge objects, or push configs. SDKs exist for Python/JS/Java. Secure with tokens and RBAC; avoid embedding admin creds in scripts.

curl -k -u admin:pwd https://splunk:8089/services/search/jobs -d search="search index=web | head 10"

36) Apps, Add-ons & TA Strategy

Splunkbase provides vendor TAs (Technology Add-ons) for common sources, delivering sourcetypes, extractions, and CIM mappings. Install the correct TA version, read release notes, and localize configs under local/ to survive upgrades. Version control your apps and promote through environments.

# App layout
default/
local/
metadata/

37) Security Hardening

Enforce TLS on management & indexing ports, rotate admin passwords, restrict management to trusted subnets/VPN, and audit role capabilities. Lock down HEC tokens. Keep Splunk and TAs patched. Separate duties: admins vs power users vs viewers. Enable auditing to track changes.

# server.conf (concept)
[sslConfig]
enableSplunkdSSL = true

38) Backups & DR

Back up configuration ($SPLUNK_HOME/etc), apps, and KV Store. For data, rely on indexer cluster replication and frozen copies (S3/HDFS). Document restore procedures and test them. Keep Deployer/DS backups to recreate deployments quickly.

# KV Store backup (UI/CLI options exist)
splunk backup kvstore

39) Cost Optimization

Control ingest at source; drop noise early with transforms. Use metrics indexes where appropriate. Compress summaries, right-size retention per index, and archive frozen buckets to cheap storage. Educate users to write efficient searches and enforce quotas.

# props.conf (route less critical to low-retention index)
TRANSFORMS-route_low = route_low_index

40) Q&A — “Splunk Cloud vs Enterprise?”

Answer: Splunk Cloud offloads infra ops, upgrades, and scaling, with SaaS-level SLAs and guardrails; Enterprise gives full control on-prem/your cloud but you own ops. Choose Cloud for speed/ops simplicity; Enterprise for strict data residency/control or deep customization needs.

Section 5 — Interview Q&A (20 Questions)

41) Q1–4: Architecture & Ingest

Q1: UF vs HF? UF is lightweight shipper; no heavy parsing. HF can parse/transform/route at edge and run apps. Prefer UF; use HF where pre-index transforms are required.

Q2: Index vs sourcetype? Index = storage & retention boundary; sourcetype = data format semantics. Both are required on every event; sourcetype drives parsing/fields.

Q3: Buckets lifecycle? Hot→Warm→Cold→Frozen. Hot = writing; warm/cold searchable; frozen deleted/archived. Retention & size are per-index.

Q4: HEC best practices? TLS, scoped tokens, batching with ack, set correct host/source/sourcetype, validate timestamps, backoff on 503.

42) Q5–8: Parsing & CIM

Q5: Index-time vs search-time extraction? Index-time affects storage (rare; risky to change later). Search-time is flexible and safer. Do index-time only when essential.

Q6: Multiline handling? Disable SHOULD_LINEMERGE, define LINE_BREAKER; or use event boundaries. Test with sample files.

Q7: Why CIM? Normalizes fields so apps (ES/ITSI) can work across vendors. Speeds correlation and dashboards.

Q8: tstats vs stats? tstats reads accelerated summaries/tsidx—much faster. stats scans raw events; flexible but slower on big data.

43) Q9–12: SPL Performance

Q9: Speed up slow searches? Constrain by index/time, reduce fields, avoid join/transaction when possible, use tstats/summary indexes.

Q10: When use transaction? To stitch events lacking a common ID (e.g., start/stop). Prefer stats+eval if you have IDs; transaction is expensive.

Q11: Lookup vs KV Store? CSV for static small tables; KV Store for mutable, larger, or API-driven enrichments.

Q12: Base search + post-process? Share a heavy base search; panels run lightweight post-processing to save compute.

44) Q13–16: Admin & Scaling

Q13: RF/SF meaning? Replication Factor (copies of raw data) and Search Factor (searchable copies). RF≥2, SF≥2 for HA.

Q14: SHC app deployment? Use Deployer; never edit members directly. Keep app versions in VCS and promote via pipelines.

Q15: License violations? Exceed daily ingest → violation window; repeated violations restrict searching. Fix by reducing ingest or increasing license, and wait out the window.

Q16: Data privacy? Route PII to restricted indexes, mask at ingest with transforms, and enforce RBAC roles.

45) Q17–20: Dashboards, Alerts, Ops

Q17: Fast dashboards? Base searches, post-processing, tstats/acceleration, and narrow time ranges. Avoid per-panel big joins.

Q18: Alert noise? Throttle, deduplicate, group episodes (ITSI), and use RBA in ES. Include rich context for triage.

Q19: Metrics vs logs choice? Metrics for numeric TS with fixed dims (fast & cheap); logs for detailed context. Use both.

Q20: Common onboarding mistakes? Wrong sourcetypes, missing timestamps, multiline errors, no CIM mapping, and unbounded ingest causing license pain.

46) Cheat: Time Constraints

Use earliest/latest for speed: earliest=-15m latest=now, @d for day boundaries. Favor relative times in saved searches.

index=web earliest=-1h@h latest=@h | stats count

47) Cheat: Field Normalization

Use coalesce and rename to normalize variants across sources before aggregation or joins.

... | eval user=coalesce(user, username, usr)
| rename clientip as src_ip

48) Cheat: Thresholding & Anomaly Hints

Compute moving baselines with streamstats/eventstats and trigger alerts on deviations to reduce static threshold noise.

... | timechart span=5m count as c
| streamstats window=12 avg(c) as avg stdev(c) as sd
| where c > avg + 3*sd

49) Cheat: Summary Index Pattern

Schedule heavy job hourly to write summaries; dashboards read summaries for sub-second panels while raw data remains for drilldowns.

# write summary
... | stats count by uri | collect index=summaries sourcetype=web_sum

50) Final Tips

Get sourcetypes and timestamps perfect, design indexes with clear retention, use CIM for app compatibility, prefer tstats/acceleration for scale, and keep searches scoped. Treat Splunk as a shared platform: version control apps, guard ingest, and build a catalog of reusable knowledge objects and macros.

# macro usage
`http_5xx`
| stats count by uri