Datadog Pocket Book

Datadog Pocket Book — Uplatz

50 deep-dive flashcards • Single column • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is Datadog?

Datadog is a cloud-first observability and security platform providing metrics, traces (APM), logs, RUM, synthetics, infrastructure monitoring, security monitoring, and dashboards in one place. It integrates with clouds, containers, databases, and apps with rich tagging for slicing and correlation.

# Install agent (Linux quickstart)
DD_API_KEY=<key> bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

2) Why Datadog? Strengths & Tradeoffs

Strengths: unified telemetry, strong integrations, fast UI, powerful tagging, anomaly detection, SLOs, and managed backend. Tradeoffs: cost at scale, vendor lock-in, sampling tradeoffs. Mitigate with tags hygiene, retention tiers, and sampling/filters.

# Verify agent status
sudo datadog-agent status

3) Core Data Types

Metrics (gauges, counters, histograms), Logs (ingest, process, index), Traces (APM spans), Profiles, RUM (browser/app), and Events. Correlate via tags like service, env, version, host, region.

# DogStatsD metric (Python)
from datadog import initialize, statsd
initialize(statsd_host='127.0.0.1', statsd_port=8125)
statsd.increment('orders.created', tags=['service:checkout','env:prod'])

4) Tags & Facets

Tags are key-value labels for grouping/filtering across telemetry. Standardize service, env, version, team. Promote important attributes to facets in logs for fast filtering and analytics.

# Agent config (datadog.yaml)
tags:
  - env:prod
  - region:ap-south-1

5) Dashboards

Build timeboards/screenboards with widgets (timeseries, query values, toplists, heatmaps, geomaps). Use template variables bound to tags for reuse across services/envs.

# Create dashboard (API snippet)
curl -X POST "https://api.datadoghq.com/api/v1/dashboard" \
 -H "DD-API-KEY: $DD_API_KEY" -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
 -H "Content-Type: application/json" -d @dashboard.json

6) Monitors

Monitors alert on metrics, logs, traces, synthetics, RUM, and more. Use threshold, anomaly, forecast, outlier, composite monitors. Route notifications to on-call and ChatOps channels.

# Simple metric monitor (API)
curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
 -H "DD-API-KEY: $DD_API_KEY" -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
 -d '{"type":"metric alert","query":"avg(last_5m):avg:system.cpu.user{env:prod}>80","name":"High CPU","message":"@pagerduty"}'

7) APM Tracing

Datadog APM captures distributed traces, spans, and service maps. Auto-instrument popular frameworks; set service, env, version for deployment tracking and correlation to logs/metrics.

# Node.js tracer
npm i dd-trace
DD_SERVICE=checkout DD_ENV=prod node -r dd-trace/init app.js

8) Log Management

Ship logs via agent or forwarders, parse with pipelines, remap fields, mask PII, and index selectively. Use exclusion filters and archives (S3/GCS) to control cost.

# Agent log config (conf.d/myapp.d/conf.yaml)
logs:
  - type: file
    path: /var/log/myapp/*.log
    service: checkout
    source: nodejs

9) Synthetics & RUM

Synthetics: HTTP/browser tests, multistep user journeys, and APIs. RUM: real-user browser/mobile telemetry (errors, resources, vitals). Correlate RUM sessions to backend traces.

<!-- RUM Browser SDK -->
<script src="https://www.datadoghq-browser-agent.com/datadog-rum.js"></script>
<script>DD_RUM.init({ applicationId:'...', clientToken:'...', site:'datadoghq.com', service:'web', env:'prod' })</script>

10) Q&A — “What makes Datadog ‘unified’?”

Answer: A single backend and UI for metrics, traces, logs, and user telemetry with consistent tags enables cross-correlation: find a spike in metrics, jump to related traces, then to logs and RUM sessions—all filtered by the same tags.

Section 2 — Core APIs & Modules

11) Agent & Integrations

Agent runs on hosts/containers and collects metrics/logs/traces. 700+ integrations (Nginx, Postgres, Kafka, Redis). Use conf.d/<integration>.d/ to configure checks and autodiscovery on Kubernetes.

# Postgres integration (conf.d/postgres.d/conf.yaml)
init_config:
instances:
  - host: 127.0.0.1
    port: 5432
    dbm: true

12) Kubernetes

Deploy the Datadog cluster agent + node agents via Helm/manifest. Autodiscovery via annotations; tag with kube_* and pod_labels. Collect container logs, metrics, APM, and process checks.

# Helm install (example)
helm repo add datadog https://helm.datadoghq.com
helm install dd datadog/datadog -f values.yaml

13) DogStatsD & StatsD

Submit custom metrics via UDP/UDS to the local agent. Supports histograms/distributions for percentiles. Always include consistent tags for aggregation.

# Go example
statsd, _ := statsd.New("unix:///var/run/datadog/dsd.socket")
statsd.Histogram("latency.ms", 123, []string{"service:api","env:prod"}, 1)

14) APM Sampling & Retention

Use head-based sampling at the agent; configure rules to keep errors and high-latency traces. Enable ingestion controls and retention filters for cost and compliance.

# datadog.yaml (APM)
apm_config:
  analyzed_spans:
    checkout|http.request: 1

15) Log Pipelines

Parse, remap, and enrich logs. Grok/JSON processors, status remapper, date remapper, geo-IP, and URL parsers. Promote fields as facets; mask PII with redaction.

# Example pipeline grok rule (conceptual)
rule: %{ip:client} - - \[%{date}\] "%{word:method} %{notSpace:path} HTTP/%{number}" %{int:status}

16) SLOs & SLIs

Define SLOs on monitors or events (time-based or request-based). Track burn rate with multi-window, multi-burn alerts. Use tag-scoped SLOs per service/env.

# API sketch: create SLO on monitor IDs
curl -X POST https://api.datadoghq.com/api/v1/slo -d @slo.json

17) Security Monitoring

Detect threats from logs/cloud/configs with rules and signals. Use out-of-the-box rules for common threats and customize for your environment. Route to SIEM/SOAR.

# Example detection rule (conceptual JSON)
{ "name":"Suspicious Login", "query":"source:auth status:failed country:!IN", "type":"log_detection" }

18) Database Monitoring (DBM)

Deep visibility into DB performance: query samples, plans, wait events. Requires agent integration and permissions. Tag queries by service/env.

# Enable DBM in postgres conf
instances:
  - host: 127.0.0.1
    port: 5432
    dbm: true

19) Error Tracking & Profiling

Aggregate exceptions from APM/logs and link to code versions and deployments. Continuous Profiler captures CPU/memory/lock profiles in prod with low overhead.

# Python profiler
pip install ddtrace
DD_PROFILING_ENABLED=true ddtrace-run gunicorn app:app

20) Q&A — “Metrics vs Logs vs Traces?”

Answer: Metrics are numeric time-series for fast aggregates and alerts; logs are detailed records for forensics; traces show request flow across services. Use all three with consistent tags for rapid triage.

Section 3 — Async, Patterns & Concurrency

21) Alert Design Patterns

Prefer symptom-based alerts (user impact) over noise (CPU blips). Use anomaly/forecast monitors for seasonality, composite monitors to gate alerts, and mute windows during maintenance.

# Composite monitor idea
(A) errors_rate > 5% AND (B) p99_latency > 1s AND (C) traffic > 100 rps

22) Anomaly & Forecast

Datadog models time series to detect deviations and forecast future levels, handling weekly/daily seasonality. Great for capacity planning and early warning.

# Anomaly query example
anomalies(avg:system.load.1{env:prod}, 'basic', 2, direction='above')

23) Sampling Strategies

For high-traffic services, sample traces/logs while keeping error/slow traces at 100%. Use agent-side filters and retain full fidelity during incidents via dynamic rules.

# datadog.yaml
apm_config:
  ignore_resources: ["GET /healthz"]
logs_config:
  sample_rate: 0.2

24) Incident Management

Use Monitors → Incidents with templates, roles, and timelines. Capture evidence (graphs, logs, traces). Automate Slack/PagerDuty notifications and postmortem exports.

# API: create an incident (sketch)
curl -X POST https://api.datadoghq.com/api/v2/incidents -d @incident.json

25) SLO Burn Rate Alerts

Multi-window burn alerts detect fast/slow burns. Example: 1h/6h windows with different thresholds—alerts only when both breach to reduce noise but react quickly.

# SLO burn idea
burn_rate(1h) > 2 AND burn_rate(6h) > 1

26) Canary & Deploy Tracking

Tag telemetry with version and deployment_id. Compare error/latency between canary and baseline. Roll back if canary degrades SLOs or error budgets.

DD_VERSION=1.12.3 DD_ENV=prod DD_SERVICE=checkout ddtrace-run node app.js

27) Auto-Scaling Signals

Export Datadog metrics to autoscalers (KEDA/HPA) for scale decisions based on RPS, queue depth, or p95 latency. Avoid scaling from noisy CPU alone.

# KEDA ScaledObject referencing Datadog (conceptual)
triggers:
- type: datadog
  metadata: { query: "avg:last_2m:requests{service:api}", queryValue: "100" }

28) Synthetic CI Patterns

Run API/browser tests in CI before deploy; publish results to Datadog. Block releases if core journeys fail. Tag tests by app/version for rollout insights.

# CI step (datadog-ci)
npx @datadog/datadog-ci synthetics run-tests --config tests.synthetics.json

29) Cost Guardrails

Tiered retention, log rehydration from archives, exclude chatty logs, focus on higher-cardinality indexes, and use metrics-from-logs for lightweight KPIs.

# Pipeline processor to drop DEBUG logs in prod
if (level == "DEBUG" && env == "prod") drop()

30) Q&A — “How to reduce alert fatigue?”

Answer: Alert on user-impact with composite conditions, use anomalies/forecasts, apply mute windows, deduplicate via composites, and route by severity. Review weekly and prune noisy monitors.

Section 4 — Frameworks, Data & APIs

31) Cloud Integrations

Connect AWS/Azure/GCP for CloudWatch/Monitor/Stackdriver metrics, logs, cost, and resource inventory. Tag propagation from cloud to Datadog enables unified slices.

# Terraform sketch (AWS)
resource "datadog_integration_aws" "main" {
  account_id = "..."; role_name = "datadog-integration"
}

32) Serverless (Lambda, Cloud Functions)

Use Datadog forwarder/extensions for metrics, traces, and logs. Wrap functions with layers; correlate cold starts and errors to releases.

# Lambda layer (Node)
npm i dd-trace
require('dd-trace').init({}); exports.handler = async (e) => { ... }

33) CI Visibility

Ingest test results and pipeline spans. Identify flaky tests, slow steps, and failure hotspots. Tag by repo/branch/commit to map trends over time.

# datadog-ci for JUnit uploads
npx @datadog/datadog-ci junit upload --service web --env prod reports/*.xml

34) Network & Real User Monitoring

NPM/NPM+ provide flow/topology and eBPF visibility. RUM shows Core Web Vitals, errors, and resources per session; link RUM spans to backend traces for end-to-end.

// RUM React snippet (conceptual)
DD_RUM.addAction("add_to_cart",{ sku, price })

35) Application Security (ASM)

Datadog ASM adds runtime protection (WAF/IAST) integrated with APM. Detect attacks, block patterns, and tie to services/versions for remediation.

DD_APPSEC_ENABLED=true ddtrace-run python app.py

36) Data Ingestion API

Push custom events, metrics, logs, and spans directly to Datadog HTTP APIs. Use app keys and rate limits; batch for efficiency.

curl -X POST "https://http-intake.logs.datadoghq.com/api/v2/logs" \
 -H "DD-API-KEY: $DD_API_KEY" -H "Content-Type: application/json" \
 -d '[{"message":"hello","ddsource":"custom","service":"api"}]'

37) Terraform & Config as Code

Manage monitors, dashboards, SLOs, and synthetics via IaC. Enable review/PRs, drift detection, and consistent environments.

resource "datadog_monitor" "cpu" {
  name = "High CPU"
  type = "metric alert"
  query = "avg(last_5m):avg:system.cpu.user{env:prod} > 80"
}

38) Notebooks & Dashboards-as-Docs

Use Datadog Notebooks to combine text, graphs, and queries for investigations and runbooks. Link incidents and postmortems for learning loops.

# API sketch
curl -X POST https://api.datadoghq.com/api/v1/notebooks -d @nb.json

39) Metrics from Logs

Create aggregated metrics from logs without indexing every log line. Good for reducing storage while keeping high-level KPIs available for monitors/dashboards.

# Pipeline processor to create metric
create_metric("orders.fail.rate", group_by=["service","env"])

40) Q&A — “When to use synthetics vs RUM?”

Answer: Synthetics simulate scripted user journeys for proactive checks and SLAs; RUM reflects real users in production for actual UX, errors, and performance by device/geo. Use both for coverage.

Section 5 — Security, Testing, Deployment, Observability & Interview Q&A

41) Access Control & RBAC

Use fine-grained roles for read/write/modify monitors, dashboards, and org settings. Apply team-based ownership and approval flows for changes via IaC.

# Assign roles via API (sketch)
curl -X POST https://api.datadoghq.com/api/v2/roles/{role_id}/users -d @users.json

42) Sensitive Data & PII

Mask/redact sensitive fields at ingestion. Use log pipelines to hash or drop PII. Limit access via facets/permissions. Comply with retention and data residency policies.

# Redaction rule idea
if (field == "email") redact()

43) Testing Observability

Write unit tests for parsers, schema for logs, and golden dashboards. Validate monitors in staging and use synthetic CI gates. Snapshot queries in notebooks for regression.

# Example: test a grok pattern with sample logs in CI

44) Deploy & Promote

Use Terraform/Datadog provider to promote dashboards/monitors across environments. Tag resources with env, team, and owner. Keep runbooks linked from monitors.

# Workspace variables pattern:
variable "env" {}  # dev/stage/prod

45) Observability KPIs

Track MTTA/MTTR, alert volume, SLO burn, error budget spend, and dashboard usage. Review weekly with teams to prune noise and improve runbooks.

# System tables/usage APIs (conceptual) to feed KPI dashboards

46) Troubleshooting Playbook

Start at service overview → latency/errors → related logs → traces → infrastructure maps → recent deploys. Use tag filters (service, env, version) and time compare (now vs last release).

# Quick tag filter set: service:api env:prod version:1.12.*

47) Cost Visibility

Break down by product (logs/APM/metrics), index, and team tags. Alert on ingestion spikes; auto-open incidents on runaway logs (looping errors) with rate caps.

# Monitor idea: logs intake rate anomaly for service:api

48) Production Checklist

  • Agent healthy on all nodes/containers
  • Golden dashboards per service with templates
  • Critical SLOs + burn alerts
  • Error tracking + profiling enabled
  • Parsing pipelines + PII redaction
  • Terraform-managed monitors & synthetics

49) Common Pitfalls

Inconsistent tags, unbounded log ingestion, no sampling, alerting on infrastructure-only symptoms, ignoring seasonality, and missing linkage between logs/traces/metrics.

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Why tags? They enable slicing/correlation across metrics, logs, and traces—critical for multi-service views.

2) Control log cost? Index selectively, archive, create metrics-from-logs, and drop chatty DEBUG in prod.

3) APM vs logs for errors? Use APM for rates/latency and service maps; logs for stack traces and context.

4) What is burn rate? Speed at which an SLO’s error budget is consumed; alert on multi-window burns.

5) When anomaly vs threshold? Anomaly for seasonal metrics; thresholds for hard constraints (e.g., 5xx > 1%).

6) Sampling best practice? Always keep errors and slow traces; sample healthy traffic.

7) Dashboards vs notebooks? Dashboards for real-time ops; notebooks for investigations/runbooks.

8) What are facets? Indexed log attributes to filter/aggregate quickly.

9) Reduce toil? Composite monitors, mute windows, and auto-remediation runbooks.

10) Trace-log correlation? Inject trace IDs into logs and configure pipelines to parse them.

11) CI Visibility value? Detect flaky tests, slow stages, and correlate to prod regressions.

12) DBM when? For DB bottlenecks: lock waits, slow queries, and plan analysis.

13) Network monitoring? NPM/eBPF for flows, latency, and dependencies across hosts/pods.

14) RUM KPIs? LCP, FID/INP, CLS; segment by device/geo and release.

15) ASM benefit? Block threats at runtime with APM-integrated context.

16) Golden signals? Latency, traffic, errors, saturation—dashboards + alerts per service.

17) On-call hygiene? Rotations, escalation, actionable alerts, and postmortems.

18) Multi-tenant tagging? Add team and owner tags; restrict via RBAC.

19) Infra drift? Use agent inventory and cloud integrations to detect missing coverage.

20) When not Datadog? Very small apps or tight budgets where built-in cloud monitors suffice.