Azure Machine Learning Pocket Book — Uplatz
55+ deep-dive flashcards • Single column • Workspaces, Data & Compute • Training & Tuning • MLflow & Registry • Deployment & MLOps • Security & Cost • Interview Q&A
Cheat-friendly explanations • Readable CLI/SDK snippets • Production-oriented tips
1) What is Azure Machine Learning?
A managed platform for building, training, and deploying ML/AI at scale. It provides workspaces, data/versioned assets, managed compute, experiment tracking (MLflow), registries, pipelines, and model deployment to online/batch endpoints.
# CLI v2 install
az extension add -n ml -y
az ml -h
2) Workspace
Top-level boundary that contains assets (data, environments, components, models), compute, registries, and endpoints. Usually one per environment (dev/stage/prod).
az ml workspace create -n mlw-prod -g rg-ml
3) MLflow-Native
AML adopts MLflow for experiment tracking and model packaging (flavors). Use the tracking URI exposed by the workspace for runs and artifacts.
4) Assets (v2)
Data (URIs to files/tables), Environment (Docker+conda), Model (artifacts+metadata), Component (reusable step), Job (a run), Pipeline (DAG of components).
5) Regions & Limits
Choose region close to data. Mind quotas (vCPU/GPU) and SKU availability for compute clusters.
6) IDE & Notebooks
Use AML studio notebooks or local VS Code with Azure ML VS Code extension. Attach to compute instances/clusters.
7) When to Use AML?
Team collaboration, governed experiments, reproducibility, enterprise deployment, and MLOps with pipelines. For quick ad-hoc, notebooks alone may suffice.
8) Responsible AI
Leverage Model/Datapoint monitors, fairness/explanations (Interpret), content filters when serving LLMs, and data governance via Purview.
9) High-Level Flow
Ingest & register data → define environment → author training component → submit job to cluster → log runs (MLflow) → register model → deploy to endpoint → monitor & retrain.
10) Pricing Levers
Pay mainly for compute (CI/Compute Cluster/Endpoint), storage, networking. Save with spot instances, autoscale, and shutting down idle compute.
11) Create Workspace & Default Storage
AML links to a storage account (datastore), container registry, Key Vault, and Application Insights.
az group create -n rg-ml -l westeurope
az ml workspace create -g rg-ml -n mlw-dev
12) Datastores vs Data Assets
Datastore = connection to storage (Blob/ADLS). Data asset = versioned reference to files/tables within a datastore.
az ml datastore create --file datastore.yml
az ml data create --file data-churn.yml
13) Data Versions
Treat training data as immutable versions (name: churn; version: 4
). Improves reproducibility and lineage.
14) Environments
Docker image + conda environment; pin exact versions for portability.
az ml environment create --file env-sklearn.yml
15) Managed Datastores (ADLS Gen2)
Prefer ADLS Gen2 with VNET + private endpoints for governed lakes and Spark access.
16) Mount vs Download
Inputs can be mounted or downloaded to compute. Mount for large datasets; download for small/fast local access.
17) Compute Types
Compute Instance (dev notebook), Compute Cluster (batch training/parallel), AmlCompute (CPU/GPU), Attached (Databricks/K8s).
az ml compute create --name cpu-cluster --size STANDARD_DS3_V2 \
--type amlcompute --min-instances 0 --max-instances 8 --idle-time-before-scale-down 600
18) Command Jobs (v2)
Containerized script execution with inputs/outputs and environment.
az ml job create --file train.yml
# train.yml uses command: "python train.py --data ${{inputs.train}} --out ${{outputs.model}}"
19) Distributed Training
Use pytorch
/ mpi
/ tensorflow
distributions. Define node count and process per node.
20) MLflow Logging
Log metrics/params/artifacts; AML automatically captures run context.
import mlflow
mlflow.log_metric("accuracy", 0.912)
mlflow.sklearn.log_model(model, "model")
21) AutoML
Automated training/tuning for tabular/time-series/vision/NLP. Produces best model with explainability and guardrails.
22) Datasets for AutoML
Provide target column, validation strategy, and primary metric (e.g., AUC). Export pipelines for repeatability.
23) HyperDrive / Sweep
Define search space, sampling (random/bayesian/grid), early termination (Bandit). Parallel runs on cluster.
az ml job sweep create --file sweep.yml
24) Reusable Components
Package steps (e.g., featurize/train/evaluate) with inputs/outputs; version them; share across teams.
az ml component create --file component-featurize.yml
25) Pipelines (DAG)
Orchestrate components; pass data/artifacts between steps; schedule with jobs.
az ml job create --file pipeline.yml
26) Data Drift & Retraining
Monitor drift on key features; trigger pipeline on threshold (Logic Apps/GitHub Actions).
27) Parallel Jobs
Split work over shards (files/IDs); AML manages chunk execution and retries.
28) Caching & Reuse
Cached component outputs skip recomputation when inputs/env unchanged—huge time saver.
29) Responsible AI Toolkit
Use Interpret (SHAP), Fairlearn, and error analysis; log artifacts to MLflow and publish with model.
30) Prompt Flow / Generative AI (optional)
Build LLM flows with evaluation and tracing; deploy to managed endpoints with content safety filters.
31) Register Model
Versioned model with metadata and path to artifacts.
az ml model create --name churn-model --version 1 --path ./mlruns/..../artifacts/model
32) Model Registry & Promotion
Central registry across workspaces. Promote from dev→test→prod with approvals; track lineage.
33) Environments for Inference
Smaller images, CPU/GPU variants, health probes. Pin inference-requirements.txt
.
34) Online Endpoints
Real-time HTTPS inference; blue/green traffic splits; autoscale.
az ml online-endpoint create -n churn-api
az ml online-deployment create -n blue -e churn-api --model churn-model:1 --env infer-env:1 --instance-type Standard_DS3_v2 --instance-count 2
az ml online-endpoint set-traffic -n churn-api --traffic blue=100
35) Batch Endpoints
Asynchronous scoring over large datasets; schedule via pipelines.
az ml batch-endpoint create -n churn-batch
az ml batch-deployment create -n d1 -e churn-batch --model churn-model:1 --input data:churn/4
36) Scoring Script
Implement init()
and run(input)
; load model once in init
to avoid warmup per request.
37) Autoscale & SKUs
Scale by concurrent requests/CPU/GPU; choose compute SKUs based on model size/latency.
38) Canary & Rollback
Deploy a second slot (green), send 5–10% traffic, monitor, then ramp; instantly revert by traffic switch if metrics degrade.
39) Infra as Code
Use Bicep/Terraform for workspace/networking; store YAML assets in Git; PR-based changes for review.
40) GitHub Actions & Azure DevOps
Stages: build env → submit training → register model → create endpoint/deployment → tests → promote.
41) Model Cards & Metadata
Attach dataset versions, training code hash, metrics, explanations, and risks to each model version.
42) Data/Model Lineage
AML tracks lineage automatically—use it for audits and reproducibility.
43) Policies & Gates
Require bias/accuracy thresholds and vulnerability scans to pass before promotion to prod.
44) Feature Stores (optional)
Centralize features; ensure offline/online view consistency; log training-serving skew.
45) Data Contracts
Schemas and quality checks at pipeline boundaries; fail fast on drift or missing columns.
46) Private Networking
Use VNETs + private endpoints for storage/registry/Key Vault; managed online endpoints support private ingress; restrict public network access.
47) Identity
Prefer managed identities for jobs/endpoints; grant least-privilege RBAC to datastores and registries.
48) Secrets & Keys
Store in Key Vault; reference in YAML; never hardcode; rotate routinely.
49) Supply Chain Security
Pin base images, verify provenance (ACR content trust), scan dependencies, and sign models/artifacts where possible.
50) PII & Compliance
Mask or tokenize PII; encrypt at rest and in transit; log consent and retention; use Purview classifications.
51) Online Monitoring
Collect latency, throughput, error rate; enable data capture for shadow evaluation; watch p95/p99 and saturation.
52) Data & Performance Drift
Compare live features vs training baseline; alert on drift; trigger retraining automatically.
53) Cost Controls
Autoscale down to zero, use spot nodes for non-critical training, cache components, and shut down idle compute instances.
54) Reliability
Use readiness/liveness probes, circuit breakers in clients, retries with backoff, and multi-region DR for critical endpoints.
55) Pattern — Training to Serving
Train (command job) → register → deploy blue (0%) → shadow test → canary (10%) → full rollout → archive old model; automate via pipeline.
56) Pattern — Batch Scoring Window
Nightly batch endpoint scores fresh data from ADLS, writes results to curated tables, and emits metrics to MLflow.
57) Q — AML vs Databricks?
Answer: Databricks excels at collaborative Spark/Delta Lake. AML focuses on governed ML/MLOps with native registry/endpoints and integrates with DBX as an attached compute.
58) Q — Why MLflow?
Answer: Open standard for runs/models; portable across clouds; AML augments with enterprise registry, lineage, and deployment.
59) Q — Speed up training?
Answer: Use larger clusters/GPUs, spot nodes, data locality, mixed precision, distributed training, and cache heavy preprocessing.
60) Q — Typical pitfalls?
Answer: Unpinned environments, unmanaged data versions, no lineage, over-privileged identities, not monitoring drift, and endpoints without autoscale.