Interview Questions Booklet – DevOps

DevOps — Interview Questions Booklet (50 Q&A)

Culture & DORA • Git & Branching • CI/CD & Releases • IaC • Containers & K8s • Observability & SRE • DevSecOps • Cloud & FinOps • Automation • Real-world Scenarios

Section 1 — DevOps Fundamentals

1) What is DevOps, and what primary outcomes should it achieve?

Answer: DevOps is the union of people, process, and tooling to deliver software faster and safer. Outcomes: shorter lead time, higher deployment frequency, lower change-failure rate, and faster MTTR.

2) How does DevOps differ from Site Reliability Engineering (SRE)?

Answer: DevOps is cultural/practice-focused across dev and ops; SRE operationalizes reliability via SLOs, error budgets, and automation. They complement each other.

3) What does the CALMS framework represent in DevOps?

Answer: Culture, Automation, Lean, Measurement, and Sharing — a lens to assess/devise improvements across people, process, and tooling.

4) Why is value stream mapping important in DevOps transformations?

Answer: It visualizes idea-to-prod flow, exposes waste/bottlenecks (handoffs, queues), and prioritizes improvements with measurable impact.

5) What are the DORA metrics, and how are they used?

Answer: Deployment frequency, lead time for changes, change failure rate, MTTR. Track them to benchmark, detect regressions, and guide investments.

Section 2 — Source Control & Branching

6) Why is trunk-based development often preferred over long-lived branches?

Answer: It reduces merge debt, encourages small changes, and enables continuous integration with feature flags for unfinished work.

7) When would you choose GitFlow instead of GitHub Flow?

Answer: GitFlow suits scheduled releases/hotfix streams; GitHub Flow suits continuous delivery with a single main branch and short-lived PRs.

8) What are code review best practices that improve delivery speed and quality?

Answer: Keep PRs small, automate checks, use checklists, focus on design/risks not nits, and timebox reviews to avoid queueing delays.

9) What are the trade-offs between a monorepo and multiple repositories (polyrepo)?

Answer: Monorepo aids shared tooling and atomic changes; polyrepo isolates blast radius and access. Choose based on coupling and team autonomy.

10) What is GitOps, and how does it change operations?

Answer: Git is the single source of truth for desired state; controllers reconcile cluster/cloud state. Ops becomes pull-based, auditable, and rollback-friendly.

Section 3 — CI/CD & Release Engineering

11) What core stages belong in a modern CI/CD pipeline?

Answer: Checkout → build → unit/lint → security scans → package → integration/e2e tests → artifact publish → deploy (staged) → post-deploy checks.

12) How do you ensure build reproducibility across environments?

Answer: Pin dependencies, use lockfiles/containers, hermetic builds, artifact repositories, and immutable base images.

13) How should artifacts be versioned and promoted through environments?

Answer: Use immutable, semantically versioned artifacts promoted from Dev→QA→Prod; never rebuild for higher envs; record provenance/metadata.

14) When should teams use blue-green versus canary deployments?

Answer: Blue-green swaps entire fleets for fast rollback; canary shifts traffic gradually to detect issues with minimal blast radius.

15) How do feature flags support continuous delivery without long-lived branches?

Answer: They decouple deploy from release, enable progressive rollout/kill-switches, and allow trunk-based development safely.

Section 4 — Infrastructure as Code (IaC)

16) Why is Infrastructure as Code a foundational DevOps practice?

Answer: It codifies infra for repeatability, review, testing, and versioned rollbacks; it reduces drift and speeds provisioning.

17) How do you manage Terraform state safely at scale?

Answer: Use remote backends with locking, state encryption, workspaces per env, least-privileged credentials, and pipeline-driven applies.

18) What is the difference between declarative and imperative IaC approaches?

Answer: Declarative (Terraform/K8s) defines desired state; engines reconcile. Imperative (scripts) executes stepwise commands; harder to audit/rollback.

19) How do immutable servers differ from mutable ones, and why choose immutable?

Answer: Immutable servers are rebuilt for changes (images), reducing drift and snowflakes; mutable servers are patched in place and prone to config skew.

20) What is policy-as-code, and where does it fit in pipelines?

Answer: Tools like OPA validate IaC and cluster policies pre-merge and pre-deploy to enforce security/compliance automatically.

Section 5 — Containers & Orchestration

21) Why use containers instead of virtual machines for application delivery?

Answer: Containers package app + deps with fast start and dense utilization, enabling consistent runs across laptops, CI, and prod.

22) Which Kubernetes objects are essential for stateless services?

Answer: Deployment for pods/rollouts, Service for stable access, ConfigMap/Secret for config, and Ingress/Gateway for traffic.

23) How do Kubernetes rolling updates work, and when would you choose a recreate strategy?

Answer: Rolling updates gradually replace pods respecting surge/unavailable. Recreate stops all first — used for incompatible state or schema changes.

24) How are stateful workloads handled on Kubernetes?

Answer: Use StatefulSet for stable IDs and PVCs, readiness checks, and backup/restore plans with storage classes and snapshots.

25) What benefits and costs come with adopting a service mesh?

Answer: Benefits: mTLS, traffic shaping, retries/timeouts, telemetry. Costs: complexity, resource overhead, and ops learning curve.

Section 6 — Observability & Reliability (SRE)

26) How do monitoring and observability differ in practice?

Answer: Monitoring tracks known signals (dashboards/alerts); observability enables answering unknowns via logs, metrics, traces, and rich context.

27) What are SLIs, SLOs, and error budgets, and how do they guide release pace?

Answer: SLIs measure user-visible health; SLOs set targets; the error budget (1−SLO) caps risk — burn rate drives release/throttle decisions.

28) What alerting principles avoid pager fatigue while catching real issues?

Answer: Alert on SLO symptoms, not every cause; use multi-window burn rates, deduplicate, add runbooks, and silence during planned work.

29) How does distributed tracing help, and how do you instrument it?

Answer: Tracing reveals cross-service latency and bottlenecks. Instrument via OpenTelemetry SDKs/agents and propagate trace headers end-to-end.

30) What is the difference between HPA and VPA in Kubernetes autoscaling?

Answer: HPA scales replicas based on metrics (CPU/RAM/custom); VPA adjusts pod requests/limits. Often combine with cluster autoscaler.

Section 7 — Security & DevSecOps

31) What does “shift-left security” mean for CI/CD pipelines?

Answer: Security checks (SAST/SCA, IaC scans) run pre-merge and in CI so issues are fixed early, cheaper, and before prod exposure.

32) How should secrets be managed in cloud-native systems?

Answer: Store in secret managers (Vault/KMS/Secrets Manager), encrypt at rest/in transit, use short-lived tokens, and rotate automatically.

33) How do SBOMs and image signing protect the software supply chain?

Answer: SBOM lists dependencies for CVE checks; signing (e.g., cosign) attests provenance so only trusted images run in clusters.

34) How do SAST, DAST, and IAST differ, and where do they fit?

Answer: SAST scans source during CI; DAST tests running apps; IAST observes at runtime. Use all for layered coverage.

35) How do you enforce least privilege in Kubernetes and cloud IAM?

Answer: Use RBAC roles bound narrowly, namespace isolation, network policies, and scoped cloud IAM roles with periodic access reviews.

Section 8 — Cloud Architecture & Cost (FinOps)

36) What factors drive the choice between single-cloud and multi-cloud strategies?

Answer: Skills, vendor features, data gravity, portability needs, compliance, and cost. Multi-cloud adds resilience but increases complexity.

37) What FinOps practices keep cloud spend predictable and efficient?

Answer: Tagging/chargeback, rightsizing, autoscaling, spot/Savings Plans, turning off idle resources, and cost anomaly alerts tied to owners.

38) How do you design for resiliency across AZs and regions?

Answer: Deploy multi-AZ by default, replicate state (DB/read replicas), test failover, and use active-active or pilot-light across regions.

39) What distinguishes backup, restore, and disaster recovery testing?

Answer: Backups capture data; restores verify recoverability/MTTR; DR tests validate end-to-end failover against RPO/RTO objectives.

40) When should you choose serverless functions over containers?

Answer: For event-driven, spiky, short-lived workloads with minimal ops. Containers fit steady, long-running, or custom runtime needs.

Section 9 — Automation & Platform Enablement

41) What criteria guide the selection of a CI system for an organization?

Answer: Ecosystem support, scalability, caching speed, secrets integration, policy controls, self-hosted vs SaaS, and total cost.

42) How do Makefiles, task runners, and pipelines complement each other?

Answer: Makefiles/Taskfiles encode local dev tasks; pipelines orchestrate CI/CD. Share the same commands to avoid drift.

43) What is ChatOps, and how does it accelerate incident response?

Answer: Operating infra via chat bots/commands centralizes context, automates runbooks, and reduces mean time to coordinate and act.

44) How do you detect and remediate infrastructure drift?

Answer: Use IaC plan/policy checks, configuration scans, and GitOps reconcilers; alert on drift and auto-reconcile or open PRs.

45) When is configuration management (Ansible/Chef/Puppet) still the right choice?

Answer: For in-place OS config, legacy servers, and app installs where full immutability is impractical or cost-prohibitive.

Section 10 — Real-World Scenarios & Troubleshooting

46) A production deployment spiked latency; what immediate steps do you take?

Answer: Halt rollout, route small canary traffic back, check SLO burn, compare metrics/traces to baseline, feature-flag suspect code, and rollback if user impact persists.

47) Containers are CrashLoopBackOff after release; how do you triage?

Answer: Inspect kubectl describe/logs, check readiness/liveness, config/env/secret changes, image tag, limits/OOM, and dependent service availability.

48) A secret was accidentally committed to a public repo; what is your response plan?

Answer: Revoke/rotate immediately, purge from history with tooling, add detection (pre-commit/CI), and monitor for misuse; never rely on “security by obscurity.”

49) What must a blameless post-mortem include to be effective?

Answer: Timeline, impact/SLO breach, contributing factors, where defenses failed, concrete actions (owners/dates), and learnings to prevent recurrence.

50) A blue/green rollback did not reduce errors; what do you investigate next?

Answer: Shared dependencies (DB/schema, cache, feature flags), traffic routing/DNS/TTL, config drift, and capacity regressions introduced outside the app.