Datadog Pocket Book — Uplatz
50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is Datadog?
Datadog is a cloud-first observability and security platform providing metrics, traces (APM), logs, RUM, synthetics, infrastructure monitoring, security monitoring, and dashboards in one place. It integrates with clouds, containers, databases, and apps with rich tagging for slicing and correlation.
# Install agent (Linux quickstart)
DD_API_KEY=<key> bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
2) Why Datadog? Strengths & Tradeoffs
Strengths: unified telemetry, strong integrations, fast UI, powerful tagging, anomaly detection, SLOs, and managed backend. Tradeoffs: cost at scale, vendor lock-in, sampling tradeoffs. Mitigate with tags hygiene, retention tiers, and sampling/filters.
# Verify agent status
sudo datadog-agent status
3) Core Data Types
Metrics (gauges, counters, histograms), Logs (ingest, process, index), Traces (APM spans), Profiles, RUM (browser/app), and Events. Correlate via tags like service
, env
, version
, host
, region
.
# DogStatsD metric (Python)
from datadog import initialize, statsd
initialize(statsd_host='127.0.0.1', statsd_port=8125)
statsd.increment('orders.created', tags=['service:checkout','env:prod'])
4) Tags & Facets
Tags are key-value labels for grouping/filtering across telemetry. Standardize service
, env
, version
, team
. Promote important attributes to facets in logs for fast filtering and analytics.
# Agent config (datadog.yaml)
tags:
- env:prod
- region:ap-south-1
5) Dashboards
Build timeboards/screenboards with widgets (timeseries, query values, toplists, heatmaps, geomaps). Use template variables bound to tags for reuse across services/envs.
# Create dashboard (API snippet)
curl -X POST "https://api.datadoghq.com/api/v1/dashboard" \
-H "DD-API-KEY: $DD_API_KEY" -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-H "Content-Type: application/json" -d @dashboard.json
6) Monitors
Monitors alert on metrics, logs, traces, synthetics, RUM, and more. Use threshold, anomaly, forecast, outlier, composite monitors. Route notifications to on-call and ChatOps channels.
# Simple metric monitor (API)
curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
-H "DD-API-KEY: $DD_API_KEY" -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d '{"type":"metric alert","query":"avg(last_5m):avg:system.cpu.user{env:prod}>80","name":"High CPU","message":"@pagerduty"}'
7) APM Tracing
Datadog APM captures distributed traces, spans, and service maps. Auto-instrument popular frameworks; set service
, env
, version
for deployment tracking and correlation to logs/metrics.
# Node.js tracer
npm i dd-trace
DD_SERVICE=checkout DD_ENV=prod node -r dd-trace/init app.js
8) Log Management
Ship logs via agent or forwarders, parse with pipelines, remap fields, mask PII, and index selectively. Use exclusion filters and archives (S3/GCS) to control cost.
# Agent log config (conf.d/myapp.d/conf.yaml)
logs:
- type: file
path: /var/log/myapp/*.log
service: checkout
source: nodejs
9) Synthetics & RUM
Synthetics: HTTP/browser tests, multistep user journeys, and APIs. RUM: real-user browser/mobile telemetry (errors, resources, vitals). Correlate RUM sessions to backend traces.
<!-- RUM Browser SDK -->
<script src="https://www.datadoghq-browser-agent.com/datadog-rum.js"></script>
<script>DD_RUM.init({ applicationId:'...', clientToken:'...', site:'datadoghq.com', service:'web', env:'prod' })</script>
10) Q&A — “What makes Datadog ‘unified’?”
Answer: A single backend and UI for metrics, traces, logs, and user telemetry with consistent tags enables cross-correlation: find a spike in metrics, jump to related traces, then to logs and RUM sessions—all filtered by the same tags.
11) Agent & Integrations
Agent runs on hosts/containers and collects metrics/logs/traces. 700+ integrations (Nginx, Postgres, Kafka, Redis). Use conf.d/<integration>.d/
to configure checks and autodiscovery on Kubernetes.
# Postgres integration (conf.d/postgres.d/conf.yaml)
init_config:
instances:
- host: 127.0.0.1
port: 5432
dbm: true
12) Kubernetes
Deploy the Datadog cluster agent + node agents via Helm/manifest. Autodiscovery via annotations; tag with kube_*
and pod_labels
. Collect container logs, metrics, APM, and process checks.
# Helm install (example)
helm repo add datadog https://helm.datadoghq.com
helm install dd datadog/datadog -f values.yaml
13) DogStatsD & StatsD
Submit custom metrics via UDP/UDS to the local agent. Supports histograms/distributions for percentiles. Always include consistent tags for aggregation.
# Go example
statsd, _ := statsd.New("unix:///var/run/datadog/dsd.socket")
statsd.Histogram("latency.ms", 123, []string{"service:api","env:prod"}, 1)
14) APM Sampling & Retention
Use head-based sampling at the agent; configure rules to keep errors and high-latency traces. Enable ingestion controls and retention filters for cost and compliance.
# datadog.yaml (APM)
apm_config:
analyzed_spans:
checkout|http.request: 1
15) Log Pipelines
Parse, remap, and enrich logs. Grok/JSON processors, status remapper, date remapper, geo-IP, and URL parsers. Promote fields as facets; mask PII with redaction.
# Example pipeline grok rule (conceptual)
rule: %{ip:client} - - \[%{date}\] "%{word:method} %{notSpace:path} HTTP/%{number}" %{int:status}
16) SLOs & SLIs
Define SLOs on monitors or events (time-based or request-based). Track burn rate with multi-window, multi-burn alerts. Use tag-scoped SLOs per service/env.
# API sketch: create SLO on monitor IDs
curl -X POST https://api.datadoghq.com/api/v1/slo -d @slo.json
17) Security Monitoring
Detect threats from logs/cloud/configs with rules and signals. Use out-of-the-box rules for common threats and customize for your environment. Route to SIEM/SOAR.
# Example detection rule (conceptual JSON)
{ "name":"Suspicious Login", "query":"source:auth status:failed country:!IN", "type":"log_detection" }
18) Database Monitoring (DBM)
Deep visibility into DB performance: query samples, plans, wait events. Requires agent integration and permissions. Tag queries by service/env.
# Enable DBM in postgres conf
instances:
- host: 127.0.0.1
port: 5432
dbm: true
19) Error Tracking & Profiling
Aggregate exceptions from APM/logs and link to code versions and deployments. Continuous Profiler captures CPU/memory/lock profiles in prod with low overhead.
# Python profiler
pip install ddtrace
DD_PROFILING_ENABLED=true ddtrace-run gunicorn app:app
20) Q&A — “Metrics vs Logs vs Traces?”
Answer: Metrics are numeric time-series for fast aggregates and alerts; logs are detailed records for forensics; traces show request flow across services. Use all three with consistent tags for rapid triage.
21) Alert Design Patterns
Prefer symptom-based alerts (user impact) over noise (CPU blips). Use anomaly/forecast monitors for seasonality, composite monitors to gate alerts, and mute windows during maintenance.
# Composite monitor idea
(A) errors_rate > 5% AND (B) p99_latency > 1s AND (C) traffic > 100 rps
22) Anomaly & Forecast
Datadog models time series to detect deviations and forecast future levels, handling weekly/daily seasonality. Great for capacity planning and early warning.
# Anomaly query example
anomalies(avg:system.load.1{env:prod}, 'basic', 2, direction='above')
23) Sampling Strategies
For high-traffic services, sample traces/logs while keeping error/slow traces at 100%. Use agent-side filters and retain full fidelity during incidents via dynamic rules.
# datadog.yaml
apm_config:
ignore_resources: ["GET /healthz"]
logs_config:
sample_rate: 0.2
24) Incident Management
Use Monitors → Incidents with templates, roles, and timelines. Capture evidence (graphs, logs, traces). Automate Slack/PagerDuty notifications and postmortem exports.
# API: create an incident (sketch)
curl -X POST https://api.datadoghq.com/api/v2/incidents -d @incident.json
25) SLO Burn Rate Alerts
Multi-window burn alerts detect fast/slow burns. Example: 1h/6h windows with different thresholds—alerts only when both breach to reduce noise but react quickly.
# SLO burn idea
burn_rate(1h) > 2 AND burn_rate(6h) > 1
26) Canary & Deploy Tracking
Tag telemetry with version
and deployment_id
. Compare error/latency between canary and baseline. Roll back if canary degrades SLOs or error budgets.
DD_VERSION=1.12.3 DD_ENV=prod DD_SERVICE=checkout ddtrace-run node app.js
27) Auto-Scaling Signals
Export Datadog metrics to autoscalers (KEDA/HPA) for scale decisions based on RPS, queue depth, or p95 latency. Avoid scaling from noisy CPU alone.
# KEDA ScaledObject referencing Datadog (conceptual)
triggers:
- type: datadog
metadata: { query: "avg:last_2m:requests{service:api}", queryValue: "100" }
28) Synthetic CI Patterns
Run API/browser tests in CI before deploy; publish results to Datadog. Block releases if core journeys fail. Tag tests by app/version for rollout insights.
# CI step (datadog-ci)
npx @datadog/datadog-ci synthetics run-tests --config tests.synthetics.json
29) Cost Guardrails
Tiered retention, log rehydration from archives, exclude chatty logs, focus on higher-cardinality indexes, and use metrics-from-logs for lightweight KPIs.
# Pipeline processor to drop DEBUG logs in prod
if (level == "DEBUG" && env == "prod") drop()
30) Q&A — “How to reduce alert fatigue?”
Answer: Alert on user-impact with composite conditions, use anomalies/forecasts, apply mute windows, deduplicate via composites, and route by severity. Review weekly and prune noisy monitors.
31) Cloud Integrations
Connect AWS/Azure/GCP for CloudWatch/Monitor/Stackdriver metrics, logs, cost, and resource inventory. Tag propagation from cloud to Datadog enables unified slices.
# Terraform sketch (AWS)
resource "datadog_integration_aws" "main" {
account_id = "..."; role_name = "datadog-integration"
}
32) Serverless (Lambda, Cloud Functions)
Use Datadog forwarder/extensions for metrics, traces, and logs. Wrap functions with layers; correlate cold starts and errors to releases.
# Lambda layer (Node)
npm i dd-trace
require('dd-trace').init({}); exports.handler = async (e) => { ... }
33) CI Visibility
Ingest test results and pipeline spans. Identify flaky tests, slow steps, and failure hotspots. Tag by repo/branch/commit to map trends over time.
# datadog-ci for JUnit uploads
npx @datadog/datadog-ci junit upload --service web --env prod reports/*.xml
34) Network & Real User Monitoring
NPM/NPM+ provide flow/topology and eBPF visibility. RUM shows Core Web Vitals, errors, and resources per session; link RUM spans to backend traces for end-to-end.
// RUM React snippet (conceptual)
DD_RUM.addAction("add_to_cart",{ sku, price })
35) Application Security (ASM)
Datadog ASM adds runtime protection (WAF/IAST) integrated with APM. Detect attacks, block patterns, and tie to services/versions for remediation.
DD_APPSEC_ENABLED=true ddtrace-run python app.py
36) Data Ingestion API
Push custom events, metrics, logs, and spans directly to Datadog HTTP APIs. Use app keys and rate limits; batch for efficiency.
curl -X POST "https://http-intake.logs.datadoghq.com/api/v2/logs" \
-H "DD-API-KEY: $DD_API_KEY" -H "Content-Type: application/json" \
-d '[{"message":"hello","ddsource":"custom","service":"api"}]'
37) Terraform & Config as Code
Manage monitors, dashboards, SLOs, and synthetics via IaC. Enable review/PRs, drift detection, and consistent environments.
resource "datadog_monitor" "cpu" {
name = "High CPU"
type = "metric alert"
query = "avg(last_5m):avg:system.cpu.user{env:prod} > 80"
}
38) Notebooks & Dashboards-as-Docs
Use Datadog Notebooks to combine text, graphs, and queries for investigations and runbooks. Link incidents and postmortems for learning loops.
# API sketch
curl -X POST https://api.datadoghq.com/api/v1/notebooks -d @nb.json
39) Metrics from Logs
Create aggregated metrics from logs without indexing every log line. Good for reducing storage while keeping high-level KPIs available for monitors/dashboards.
# Pipeline processor to create metric
create_metric("orders.fail.rate", group_by=["service","env"])
40) Q&A — “When to use synthetics vs RUM?”
Answer: Synthetics simulate scripted user journeys for proactive checks and SLAs; RUM reflects real users in production for actual UX, errors, and performance by device/geo. Use both for coverage.
41) Access Control & RBAC
Use fine-grained roles for read/write/modify monitors, dashboards, and org settings. Apply team-based ownership and approval flows for changes via IaC.
# Assign roles via API (sketch)
curl -X POST https://api.datadoghq.com/api/v2/roles/{role_id}/users -d @users.json
42) Sensitive Data & PII
Mask/redact sensitive fields at ingestion. Use log pipelines to hash or drop PII. Limit access via facets/permissions. Comply with retention and data residency policies.
# Redaction rule idea
if (field == "email") redact()
43) Testing Observability
Write unit tests for parsers, schema for logs, and golden dashboards. Validate monitors in staging and use synthetic CI gates. Snapshot queries in notebooks for regression.
# Example: test a grok pattern with sample logs in CI
44) Deploy & Promote
Use Terraform/Datadog provider to promote dashboards/monitors across environments. Tag resources with env
, team
, and owner
. Keep runbooks linked from monitors.
# Workspace variables pattern:
variable "env" {} # dev/stage/prod
45) Observability KPIs
Track MTTA/MTTR, alert volume, SLO burn, error budget spend, and dashboard usage. Review weekly with teams to prune noise and improve runbooks.
# System tables/usage APIs (conceptual) to feed KPI dashboards
46) Troubleshooting Playbook
Start at service overview → latency/errors → related logs → traces → infrastructure maps → recent deploys. Use tag filters (service, env, version
) and time compare (now vs last release).
# Quick tag filter set: service:api env:prod version:1.12.*
47) Cost Visibility
Break down by product (logs/APM/metrics), index, and team tags. Alert on ingestion spikes; auto-open incidents on runaway logs (looping errors) with rate caps.
# Monitor idea: logs intake rate anomaly for service:api
48) Production Checklist
- Agent healthy on all nodes/containers
- Golden dashboards per service with templates
- Critical SLOs + burn alerts
- Error tracking + profiling enabled
- Parsing pipelines + PII redaction
- Terraform-managed monitors & synthetics
49) Common Pitfalls
Inconsistent tags, unbounded log ingestion, no sampling, alerting on infrastructure-only symptoms, ignoring seasonality, and missing linkage between logs/traces/metrics.
50) Interview Q&A — 20 Practical Questions (Expanded)
1) Why tags? They enable slicing/correlation across metrics, logs, and traces—critical for multi-service views.
2) Control log cost? Index selectively, archive, create metrics-from-logs, and drop chatty DEBUG in prod.
3) APM vs logs for errors? Use APM for rates/latency and service maps; logs for stack traces and context.
4) What is burn rate? Speed at which an SLO’s error budget is consumed; alert on multi-window burns.
5) When anomaly vs threshold? Anomaly for seasonal metrics; thresholds for hard constraints (e.g., 5xx > 1%).
6) Sampling best practice? Always keep errors and slow traces; sample healthy traffic.
7) Dashboards vs notebooks? Dashboards for real-time ops; notebooks for investigations/runbooks.
8) What are facets? Indexed log attributes to filter/aggregate quickly.
9) Reduce toil? Composite monitors, mute windows, and auto-remediation runbooks.
10) Trace-log correlation? Inject trace IDs into logs and configure pipelines to parse them.
11) CI Visibility value? Detect flaky tests, slow stages, and correlate to prod regressions.
12) DBM when? For DB bottlenecks: lock waits, slow queries, and plan analysis.
13) Network monitoring? NPM/eBPF for flows, latency, and dependencies across hosts/pods.
14) RUM KPIs? LCP, FID/INP, CLS; segment by device/geo and release.
15) ASM benefit? Block threats at runtime with APM-integrated context.
16) Golden signals? Latency, traffic, errors, saturation—dashboards + alerts per service.
17) On-call hygiene? Rotations, escalation, actionable alerts, and postmortems.
18) Multi-tenant tagging? Add team
and owner
tags; restrict via RBAC.
19) Infra drift? Use agent inventory and cloud integrations to detect missing coverage.
20) When not Datadog? Very small apps or tight budgets where built-in cloud monitors suffice.