Amazon ECS Pocket Book

Amazon ECS Pocket Book — Uplatz

Concise fundamentals • Architecture & workflow • Commands & code • Deploy, scale, observe

Section 1 — Fundamentals

1) What is Amazon ECS?

Amazon Elastic Container Service (ECS) is AWS’s fully managed container orchestration. You define Task Definitions (container specs), run them in a Service (long-running) or as Tasks (one-off/batch), and host on Fargate (serverless) or EC2 (you manage instances). Deeply integrates with IAM, CloudWatch, ALB/NLB, and ECR.

# Create a cluster
aws ecs create-cluster --cluster-name app-cluster

2) Why ECS? Strengths & Tradeoffs

Strengths: No control plane to manage, tight AWS integrations, simple mental model, Fargate = zero servers to patch.

Tradeoffs: AWS lock-in; fewer third-party plugins vs Kubernetes; ECS concepts (Task/Service) to learn.

3) Key Concepts

  • Cluster: logical grouping of compute (Fargate/EC2).
  • Task Definition: JSON spec of container(s), CPU/Memory, Env, IAM role, logging.
  • Service: desired count + autoscaling + rolling deployments.
  • Capacity Provider: how tasks get capacity (Fargate/EC2/Spot).
# Register a task definition (snippet)
aws ecs register-task-definition --family web \
  --network-mode awsvpc --requires-compatibilities FARGATE \
  --cpu "256" --memory "512" \
  --execution-role-arn arn:aws:iam::123:role/ecsTaskExec \
  --container-definitions '[{"name":"web","image":"ACCOUNT.dkr.ecr.REGION.amazonaws.com/app:latest","portMappings":[{"containerPort":80}]}]'

4) Fargate vs EC2 Launch Types

Fargate: serverless, per-task billing, no AMIs/patching. Great for spiky or small teams. EC2: lower cost at scale, daemon sidecars, custom AMIs; you manage capacity and scaling.

5) Networking (awsvpc)

With awsvpc, each task gets its own ENI and private IP in your VPC subnets; easy security-grouping and ALB/NLB attachment. Use public subnets + IGW for internet-facing, private subnets + NAT for egress-only.

Section 2 — Architecture & Workflow

6) Typical Production Blueprint

ECR (images) → ECS Service (tasks) → ALB (HTTP) / NLB (TCP) → RDS/ElastiCache. Observability via CloudWatch Logs, metrics, and X-Ray; IaC with CDK/Terraform.

# Push image to ECR (high-level)
aws ecr create-repository --repository-name app
aws ecr get-login-password | docker login --username AWS --password-stdin ACCOUNT.dkr.ecr.REGION.amazonaws.com
docker build -t app .
docker tag app:latest ACCOUNT.dkr.ecr.REGION.amazonaws.com/app:latest
docker push ACCOUNT.dkr.ecr.REGION.amazonaws.com/app:latest

7) Create a Service (Fargate)

aws ecs create-service \
  --cluster app-cluster \
  --service-name web-svc \
  --task-definition web \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-1,subnet-2],securityGroups=[sg-1],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=web,containerPort=80"

Tip: attach an Auto Scaling policy on CPU/RequestCount to scale tasks.

8) Blue/Green & Rolling

Default is rolling (replace tasks gradually). For zero-downtime & quick rollback, use CodeDeploy Blue/Green with ECS + ALB weighted routing.

9) Secrets & Config

Store secrets in SSM Parameter Store or Secrets Manager and reference in the Task Definition. Bind non-secret config via env vars. Use separate task execution role (pull logs/images) and task role (app permissions).

10) Logs, Metrics, Traces

Send container stdout to awslogs (CloudWatch Logs). Track CPU/Memory/Network metrics in CloudWatch. For tracing, add AWS Distro for OpenTelemetry sidecar and export to X-Ray or OTLP backends.

Section 3 — Operations & Cost

11) Health, Draining, Graceful Shutdown

Expose /health & /ready endpoints; set ALB health checks. On deploy/scale-in, ECS sends stop; apps should stop accepting new requests, finish in-flight, then exit 0.

12) Autoscaling Strategies

Scale on CPU/Memory or ALB RequestCount/TargetResponseTime. For queues, scale by backlog size. Combine floor/ceiling with cooldowns to avoid flapping.

13) Cost Optimisation

  • Right-size CPU/Memory per task; avoid over-provisioning.
  • Prefer Fargate Spot for non-critical/batch tasks.
  • On EC2, use Reserved/Spot mix and capacity providers.

14) Image Hygiene

Small base images, multi-stage builds, pin versions, run as non-root, regular vulnerability scans (ECR scanning or Trivy), and immutable tags per release (e.g., Git SHA).

# Dockerfile (snippet)
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
USER node
CMD ["node","server.js"]

15) Common Pitfalls

  • Missing task execution role → image pull/logs fail.
  • Wrong subnets/SGs → tasks can’t reach DB or internet.
  • No health checks → ALB serves failing tasks.
  • Oversized tasks → wasted cost, lower bin-packing.

Section 4 — Quick Q&A

16) Interview Q&A — 8 Quick Ones

1) ECS vs EKS? ECS is simpler, AWS-native; EKS = Kubernetes portability/eco but more moving parts.

2) Fargate when? Small teams, spiky traffic, no ops overhead; pay per task/second.

3) Service vs Task? Service keeps desired count running; Task is ad-hoc/batch.

4) Attach to ALB? Use target group + container name/port in service definition.

5) Secrets? SSM/Secrets Manager refs in task definition; least-privilege task role.

6) Zero-downtime deploys? Rolling or Blue/Green via CodeDeploy; health checks mandatory.

7) Scale policy? CPU or RPS with sane cooldown; set min/max tasks.

8) Troubleshoot failed tasks? Check events, CW Logs, task role, subnets/SGs, and image pull perms.