Amazon SageMaker Pocket Book — Uplatz
Build • Train • Tune • Deploy • Monitor — end-to-end ML on AWS with practical snippets and ops tips
1) What is Amazon SageMaker?
Amazon SageMaker is a managed platform for the full ML lifecycle: data prep, training, hyperparameter tuning, deployment, batch/real-time inference, and monitoring. It includes purpose-built tools like Studio, Processing jobs, Training jobs, HPO, Pipelines, Endpoints, Batch Transform, Model Registry, and Clarify/Profiler/Monitor.
# Quick sanity: list domains (Studio)
aws sagemaker list-domains --query 'Domains[].DomainName'
2) Why SageMaker? Strengths & Tradeoffs
Strengths: Managed infra, built-in Docker images/algorithms, autoscaling endpoints, Pipelines for CI/CD of ML, Model Registry, lineage/experiments.
Tradeoffs: AWS lock-in; cost governance needed; container/Image/role concepts to learn.
3) Core Building Blocks
- Studio/Notebooks: IDE & notebooks with lifecycle configs.
- Processing: run ETL/feature jobs in containers on demand.
- Training: managed distributed training (Spot supported).
- Tuning (HPO): Bayesian/Random/Hyperband search over params.
- Endpoints: real-time inference with autoscaling/multi-model.
- Batch Transform: offline scoring for large datasets.
- Pipelines: DAGs for repeatable ML workflows.
- Model Registry: version, approve, and promote models.
4) Data & Storage
Use S3 for artifacts/datasets; attach EFS/FSx for shared POSIX if needed. Control access via IAM roles + KMS. Keep data and compute in the same region.
5) Security & Roles
Use least-privilege execution roles for Processing/Training/Endpoints. Encrypt data at rest (S3 KMS, EBS KMS) and in transit (TLS). For private endpoints, use VPC mode with SGs/subnets.
6) Typical Project Blueprint
S3 (raw) → Processing (clean/features) → S3 (features) → Training (Spot) → Tuning (optional) → Model Registry → Endpoint (real-time) or Batch Transform → Model Monitor/Clarify.
# Create an execution role (placeholder ARNs)
aws iam create-role --role-name sagemaker-exec-role --assume-role-policy-document file://trust.json
7) Training with the Python SDK (XGBoost)
# pip install sagemaker==2.*
from sagemaker import Session
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost
sess = Session()
role = "arn:aws:iam::123456789012:role/sagemaker-exec-role"
xgb = XGBoost(entry_point=None, framework_version="1.7-1",
instance_type="ml.m5.xlarge", instance_count=1,
output_path="s3://my-bucket/xgb/output", role=role)
xgb.fit({"train": TrainingInput("s3://my-bucket/xgb/train/"),
"validation": TrainingInput("s3://my-bucket/xgb/val/")})
8) Hyperparameter Tuning (HPO)
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter
xgb.set_hyperparameters(objective="binary:logistic", num_round=200)
tuner = HyperparameterTuner(
estimator=xgb,
objective_metric_name="validation:auc",
hyperparameter_ranges={
"max_depth": IntegerParameter(3, 10),
"eta": ContinuousParameter(0.01, 0.3)
},
objective_type="Maximize",
max_jobs=8, max_parallel_jobs=2
)
tuner.fit({"train": "s3://my-bucket/xgb/train/", "validation": "s3://my-bucket/xgb/val/"})
9) Deploy to Real-Time Endpoint
predictor = xgb.deploy(instance_type="ml.m5.large", initial_instance_count=2)
# Invoke
import json
result = predictor.predict(json.dumps({"instances":[[1.2,0.3,7.9]]}))
Use auto scaling on the endpoint (target RPS/latency) and set min/max capacities.
10) Batch Transform (Offline Inference)
transformer = xgb.transformer(instance_count=2, instance_type="ml.m5.xlarge",
output_path="s3://my-bucket/xgb/preds/")
transformer.transform(data="s3://my-bucket/xgb/test/", content_type="text/csv")
transformer.wait()
11) Pipelines & Model Registry
Define steps (Processing → Training → Evaluate → RegisterModel → Deploy). Approve models in the Model Registry and promote across dev/stage/prod with CI/CD.
# Skeleton (Python SDK)
from sagemaker.workflow.pipeline import Pipeline
pipe = Pipeline(name="churn-pipeline", steps=[...])
pipe.upsert(role_arn=role); pipe.start()
12) Monitoring & Explainability
Model Monitor watches data/quality drift and violations; Clarify explains predictions and checks bias; Debugger/Profiler inspects training bottlenecks. Ship logs/metrics to CloudWatch for alarms.
13) GenAI & Managed Inference
Use JumpStart for pre-built models/notebooks, or host custom LLMs on inference endpoints. For multi-model, use MME; for many small models, consider Serverless Inference.
14) Cost Controls
- Use Spot Training where interruption is tolerable; checkpoint to S3.
- Scale endpoints to zero with Serverless for spiky/low-traffic use cases.
- Right-size instances; turn off idle Studio/Notebooks.
- Compress/quantize models; prefer multi-model endpoints for many variants.
15) Common Pitfalls
- Over-permissive execution roles; restrict S3 prefixes and KMS keys.
- Endpoints left running with low traffic → unnecessary cost.
- Mismatched SDK/container versions; pin images and SDK versions.
- No drift monitoring after deploy; add Model Monitor + alarms.
16) Interview Q&A — 8 Quick Ones
1) When to use Batch Transform vs Endpoint? Batch for offline/large jobs, Endpoint for low-latency real-time.
2) How to cut training cost? Spot training + smaller instances + profiling + better data sampling.
3) Multi-model endpoints (MME)? Host many models behind one endpoint; load/unload on demand.
4) CI/CD for ML? SageMaker Pipelines + Model Registry + CodePipeline/CodeBuild.
5) Secure private inference? VPC-only endpoints, KMS, SG allowlists, least-privilege roles.
6) Explainability? Use Clarify for SHAP-based insights and bias detection.
7) Drift detection? Model Monitor with baseline stats + CloudWatch alarms.
8) Choose instance type? CPU for classic ML/low TPS; GPU for deep learning; inferentia for cost-efficient DL inference.