Amazon ECS Pocket Book — Uplatz
Concise fundamentals • Architecture & workflow • Commands & code • Deploy, scale, observe
1) What is Amazon ECS?
Amazon Elastic Container Service (ECS) is AWS’s fully managed container orchestration. You define Task Definitions (container specs), run them in a Service (long-running) or as Tasks (one-off/batch), and host on Fargate (serverless) or EC2 (you manage instances). Deeply integrates with IAM, CloudWatch, ALB/NLB, and ECR.
# Create a cluster
aws ecs create-cluster --cluster-name app-cluster
2) Why ECS? Strengths & Tradeoffs
Strengths: No control plane to manage, tight AWS integrations, simple mental model, Fargate = zero servers to patch.
Tradeoffs: AWS lock-in; fewer third-party plugins vs Kubernetes; ECS concepts (Task/Service) to learn.
3) Key Concepts
- Cluster: logical grouping of compute (Fargate/EC2).
- Task Definition: JSON spec of container(s), CPU/Memory, Env, IAM role, logging.
- Service: desired count + autoscaling + rolling deployments.
- Capacity Provider: how tasks get capacity (Fargate/EC2/Spot).
# Register a task definition (snippet)
aws ecs register-task-definition --family web \
--network-mode awsvpc --requires-compatibilities FARGATE \
--cpu "256" --memory "512" \
--execution-role-arn arn:aws:iam::123:role/ecsTaskExec \
--container-definitions '[{"name":"web","image":"ACCOUNT.dkr.ecr.REGION.amazonaws.com/app:latest","portMappings":[{"containerPort":80}]}]'
4) Fargate vs EC2 Launch Types
Fargate: serverless, per-task billing, no AMIs/patching. Great for spiky or small teams. EC2: lower cost at scale, daemon sidecars, custom AMIs; you manage capacity and scaling.
5) Networking (awsvpc)
With awsvpc
, each task gets its own ENI and private IP in your VPC subnets; easy security-grouping and ALB/NLB attachment. Use public subnets + IGW for internet-facing, private subnets + NAT for egress-only.
6) Typical Production Blueprint
ECR (images) → ECS Service (tasks) → ALB (HTTP) / NLB (TCP) → RDS/ElastiCache. Observability via CloudWatch Logs, metrics, and X-Ray; IaC with CDK/Terraform.
# Push image to ECR (high-level)
aws ecr create-repository --repository-name app
aws ecr get-login-password | docker login --username AWS --password-stdin ACCOUNT.dkr.ecr.REGION.amazonaws.com
docker build -t app .
docker tag app:latest ACCOUNT.dkr.ecr.REGION.amazonaws.com/app:latest
docker push ACCOUNT.dkr.ecr.REGION.amazonaws.com/app:latest
7) Create a Service (Fargate)
aws ecs create-service \
--cluster app-cluster \
--service-name web-svc \
--task-definition web \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-1,subnet-2],securityGroups=[sg-1],assignPublicIp=DISABLED}" \
--load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=web,containerPort=80"
Tip: attach an Auto Scaling policy on CPU/RequestCount to scale tasks.
8) Blue/Green & Rolling
Default is rolling (replace tasks gradually). For zero-downtime & quick rollback, use CodeDeploy Blue/Green with ECS + ALB weighted routing.
9) Secrets & Config
Store secrets in SSM Parameter Store or Secrets Manager and reference in the Task Definition. Bind non-secret config via env vars. Use separate task execution role (pull logs/images) and task role (app permissions).
10) Logs, Metrics, Traces
Send container stdout to awslogs (CloudWatch Logs). Track CPU/Memory/Network metrics in CloudWatch. For tracing, add AWS Distro for OpenTelemetry sidecar and export to X-Ray or OTLP backends.
11) Health, Draining, Graceful Shutdown
Expose /health
& /ready
endpoints; set ALB health checks. On deploy/scale-in, ECS sends stop; apps should stop accepting new requests, finish in-flight, then exit 0.
12) Autoscaling Strategies
Scale on CPU/Memory or ALB RequestCount/TargetResponseTime. For queues, scale by backlog size. Combine floor/ceiling with cooldowns to avoid flapping.
13) Cost Optimisation
- Right-size CPU/Memory per task; avoid over-provisioning.
- Prefer Fargate Spot for non-critical/batch tasks.
- On EC2, use Reserved/Spot mix and capacity providers.
14) Image Hygiene
Small base images, multi-stage builds, pin versions, run as non-root, regular vulnerability scans (ECR scanning or Trivy), and immutable tags per release (e.g., Git SHA).
# Dockerfile (snippet)
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
USER node
CMD ["node","server.js"]
15) Common Pitfalls
- Missing task execution role → image pull/logs fail.
- Wrong subnets/SGs → tasks can’t reach DB or internet.
- No health checks → ALB serves failing tasks.
- Oversized tasks → wasted cost, lower bin-packing.
16) Interview Q&A — 8 Quick Ones
1) ECS vs EKS? ECS is simpler, AWS-native; EKS = Kubernetes portability/eco but more moving parts.
2) Fargate when? Small teams, spiky traffic, no ops overhead; pay per task/second.
3) Service vs Task? Service keeps desired count running; Task is ad-hoc/batch.
4) Attach to ALB? Use target group + container name/port in service definition.
5) Secrets? SSM/Secrets Manager refs in task definition; least-privilege task role.
6) Zero-downtime deploys? Rolling or Blue/Green via CodeDeploy; health checks mandatory.
7) Scale policy? CPU or RPS with sane cooldown; set min/max tasks.
8) Troubleshoot failed tasks? Check events, CW Logs, task role, subnets/SGs, and image pull perms.