DevOps — Interview Questions Booklet (50 Q&A)
Culture & DORA • Git & Branching • CI/CD & Releases • IaC • Containers & K8s • Observability & SRE • DevSecOps • Cloud & FinOps • Automation • Real-world Scenarios
1) What is DevOps, and what primary outcomes should it achieve?
Answer: DevOps is the union of people, process, and tooling to deliver software faster and safer. Outcomes: shorter lead time, higher deployment frequency, lower change-failure rate, and faster MTTR.
2) How does DevOps differ from Site Reliability Engineering (SRE)?
Answer: DevOps is cultural/practice-focused across dev and ops; SRE operationalizes reliability via SLOs, error budgets, and automation. They complement each other.
3) What does the CALMS framework represent in DevOps?
Answer: Culture, Automation, Lean, Measurement, and Sharing — a lens to assess/devise improvements across people, process, and tooling.
4) Why is value stream mapping important in DevOps transformations?
Answer: It visualizes idea-to-prod flow, exposes waste/bottlenecks (handoffs, queues), and prioritizes improvements with measurable impact.
5) What are the DORA metrics, and how are they used?
Answer: Deployment frequency, lead time for changes, change failure rate, MTTR. Track them to benchmark, detect regressions, and guide investments.
6) Why is trunk-based development often preferred over long-lived branches?
Answer: It reduces merge debt, encourages small changes, and enables continuous integration with feature flags for unfinished work.
7) When would you choose GitFlow instead of GitHub Flow?
Answer: GitFlow suits scheduled releases/hotfix streams; GitHub Flow suits continuous delivery with a single main branch and short-lived PRs.
8) What are code review best practices that improve delivery speed and quality?
Answer: Keep PRs small, automate checks, use checklists, focus on design/risks not nits, and timebox reviews to avoid queueing delays.
9) What are the trade-offs between a monorepo and multiple repositories (polyrepo)?
Answer: Monorepo aids shared tooling and atomic changes; polyrepo isolates blast radius and access. Choose based on coupling and team autonomy.
10) What is GitOps, and how does it change operations?
Answer: Git is the single source of truth for desired state; controllers reconcile cluster/cloud state. Ops becomes pull-based, auditable, and rollback-friendly.
11) What core stages belong in a modern CI/CD pipeline?
Answer: Checkout → build → unit/lint → security scans → package → integration/e2e tests → artifact publish → deploy (staged) → post-deploy checks.
12) How do you ensure build reproducibility across environments?
Answer: Pin dependencies, use lockfiles/containers, hermetic builds, artifact repositories, and immutable base images.
13) How should artifacts be versioned and promoted through environments?
Answer: Use immutable, semantically versioned artifacts promoted from Dev→QA→Prod; never rebuild for higher envs; record provenance/metadata.
14) When should teams use blue-green versus canary deployments?
Answer: Blue-green swaps entire fleets for fast rollback; canary shifts traffic gradually to detect issues with minimal blast radius.
15) How do feature flags support continuous delivery without long-lived branches?
Answer: They decouple deploy from release, enable progressive rollout/kill-switches, and allow trunk-based development safely.
16) Why is Infrastructure as Code a foundational DevOps practice?
Answer: It codifies infra for repeatability, review, testing, and versioned rollbacks; it reduces drift and speeds provisioning.
17) How do you manage Terraform state safely at scale?
Answer: Use remote backends with locking, state encryption, workspaces per env, least-privileged credentials, and pipeline-driven applies.
18) What is the difference between declarative and imperative IaC approaches?
Answer: Declarative (Terraform/K8s) defines desired state; engines reconcile. Imperative (scripts) executes stepwise commands; harder to audit/rollback.
19) How do immutable servers differ from mutable ones, and why choose immutable?
Answer: Immutable servers are rebuilt for changes (images), reducing drift and snowflakes; mutable servers are patched in place and prone to config skew.
20) What is policy-as-code, and where does it fit in pipelines?
Answer: Tools like OPA validate IaC and cluster policies pre-merge and pre-deploy to enforce security/compliance automatically.
21) Why use containers instead of virtual machines for application delivery?
Answer: Containers package app + deps with fast start and dense utilization, enabling consistent runs across laptops, CI, and prod.
22) Which Kubernetes objects are essential for stateless services?
Answer: Deployment
for pods/rollouts, Service
for stable access, ConfigMap
/Secret
for config, and Ingress/Gateway
for traffic.
23) How do Kubernetes rolling updates work, and when would you choose a recreate strategy?
Answer: Rolling updates gradually replace pods respecting surge/unavailable. Recreate stops all first — used for incompatible state or schema changes.
24) How are stateful workloads handled on Kubernetes?
Answer: Use StatefulSet
for stable IDs and PVCs, readiness checks, and backup/restore plans with storage classes and snapshots.
25) What benefits and costs come with adopting a service mesh?
Answer: Benefits: mTLS, traffic shaping, retries/timeouts, telemetry. Costs: complexity, resource overhead, and ops learning curve.
26) How do monitoring and observability differ in practice?
Answer: Monitoring tracks known signals (dashboards/alerts); observability enables answering unknowns via logs, metrics, traces, and rich context.
27) What are SLIs, SLOs, and error budgets, and how do they guide release pace?
Answer: SLIs measure user-visible health; SLOs set targets; the error budget (1−SLO) caps risk — burn rate drives release/throttle decisions.
28) What alerting principles avoid pager fatigue while catching real issues?
Answer: Alert on SLO symptoms, not every cause; use multi-window burn rates, deduplicate, add runbooks, and silence during planned work.
29) How does distributed tracing help, and how do you instrument it?
Answer: Tracing reveals cross-service latency and bottlenecks. Instrument via OpenTelemetry SDKs/agents and propagate trace headers end-to-end.
30) What is the difference between HPA and VPA in Kubernetes autoscaling?
Answer: HPA scales replicas based on metrics (CPU/RAM/custom); VPA adjusts pod requests/limits. Often combine with cluster autoscaler.
31) What does “shift-left security” mean for CI/CD pipelines?
Answer: Security checks (SAST/SCA, IaC scans) run pre-merge and in CI so issues are fixed early, cheaper, and before prod exposure.
32) How should secrets be managed in cloud-native systems?
Answer: Store in secret managers (Vault/KMS/Secrets Manager), encrypt at rest/in transit, use short-lived tokens, and rotate automatically.
33) How do SBOMs and image signing protect the software supply chain?
Answer: SBOM lists dependencies for CVE checks; signing (e.g., cosign) attests provenance so only trusted images run in clusters.
34) How do SAST, DAST, and IAST differ, and where do they fit?
Answer: SAST scans source during CI; DAST tests running apps; IAST observes at runtime. Use all for layered coverage.
35) How do you enforce least privilege in Kubernetes and cloud IAM?
Answer: Use RBAC roles bound narrowly, namespace isolation, network policies, and scoped cloud IAM roles with periodic access reviews.
36) What factors drive the choice between single-cloud and multi-cloud strategies?
Answer: Skills, vendor features, data gravity, portability needs, compliance, and cost. Multi-cloud adds resilience but increases complexity.
37) What FinOps practices keep cloud spend predictable and efficient?
Answer: Tagging/chargeback, rightsizing, autoscaling, spot/Savings Plans, turning off idle resources, and cost anomaly alerts tied to owners.
38) How do you design for resiliency across AZs and regions?
Answer: Deploy multi-AZ by default, replicate state (DB/read replicas), test failover, and use active-active or pilot-light across regions.
39) What distinguishes backup, restore, and disaster recovery testing?
Answer: Backups capture data; restores verify recoverability/MTTR; DR tests validate end-to-end failover against RPO/RTO objectives.
40) When should you choose serverless functions over containers?
Answer: For event-driven, spiky, short-lived workloads with minimal ops. Containers fit steady, long-running, or custom runtime needs.
41) What criteria guide the selection of a CI system for an organization?
Answer: Ecosystem support, scalability, caching speed, secrets integration, policy controls, self-hosted vs SaaS, and total cost.
42) How do Makefiles, task runners, and pipelines complement each other?
Answer: Makefiles/Taskfiles encode local dev tasks; pipelines orchestrate CI/CD. Share the same commands to avoid drift.
43) What is ChatOps, and how does it accelerate incident response?
Answer: Operating infra via chat bots/commands centralizes context, automates runbooks, and reduces mean time to coordinate and act.
44) How do you detect and remediate infrastructure drift?
Answer: Use IaC plan
/policy checks, configuration scans, and GitOps reconcilers; alert on drift and auto-reconcile or open PRs.
45) When is configuration management (Ansible/Chef/Puppet) still the right choice?
Answer: For in-place OS config, legacy servers, and app installs where full immutability is impractical or cost-prohibitive.
46) A production deployment spiked latency; what immediate steps do you take?
Answer: Halt rollout, route small canary traffic back, check SLO burn, compare metrics/traces to baseline, feature-flag suspect code, and rollback if user impact persists.
47) Containers are CrashLoopBackOff after release; how do you triage?
Answer: Inspect kubectl describe
/logs
, check readiness/liveness, config/env/secret changes, image tag, limits/OOM, and dependent service availability.
48) A secret was accidentally committed to a public repo; what is your response plan?
Answer: Revoke/rotate immediately, purge from history with tooling, add detection (pre-commit/CI), and monitor for misuse; never rely on “security by obscurity.”
49) What must a blameless post-mortem include to be effective?
Answer: Timeline, impact/SLO breach, contributing factors, where defenses failed, concrete actions (owners/dates), and learnings to prevent recurrence.
50) A blue/green rollback did not reduce errors; what do you investigate next?
Answer: Shared dependencies (DB/schema, cache, feature flags), traffic routing/DNS/TTL, config drift, and capacity regressions introduced outside the app.