Azure Machine Learning Pocket Book


Azure Machine Learning Pocket Book — Uplatz

55+ deep-dive flashcards • Single column • Workspaces, Data & Compute • Training & Tuning • MLflow & Registry • Deployment & MLOps • Security & Cost • Interview Q&A

Cheat-friendly explanations • Readable CLI/SDK snippets • Production-oriented tips

Section 1 — Fundamentals

1) What is Azure Machine Learning?

A managed platform for building, training, and deploying ML/AI at scale. It provides workspaces, data/versioned assets, managed compute, experiment tracking (MLflow), registries, pipelines, and model deployment to online/batch endpoints.

# CLI v2 install
az extension add -n ml -y
az ml -h

2) Workspace

Top-level boundary that contains assets (data, environments, components, models), compute, registries, and endpoints. Usually one per environment (dev/stage/prod).

az ml workspace create -n mlw-prod -g rg-ml

3) MLflow-Native

AML adopts MLflow for experiment tracking and model packaging (flavors). Use the tracking URI exposed by the workspace for runs and artifacts.

4) Assets (v2)

Data (URIs to files/tables), Environment (Docker+conda), Model (artifacts+metadata), Component (reusable step), Job (a run), Pipeline (DAG of components).

5) Regions & Limits

Choose region close to data. Mind quotas (vCPU/GPU) and SKU availability for compute clusters.

6) IDE & Notebooks

Use AML studio notebooks or local VS Code with Azure ML VS Code extension. Attach to compute instances/clusters.

7) When to Use AML?

Team collaboration, governed experiments, reproducibility, enterprise deployment, and MLOps with pipelines. For quick ad-hoc, notebooks alone may suffice.

8) Responsible AI

Leverage Model/Datapoint monitors, fairness/explanations (Interpret), content filters when serving LLMs, and data governance via Purview.

9) High-Level Flow

Ingest & register data → define environment → author training component → submit job to cluster → log runs (MLflow) → register model → deploy to endpoint → monitor & retrain.

10) Pricing Levers

Pay mainly for compute (CI/Compute Cluster/Endpoint), storage, networking. Save with spot instances, autoscale, and shutting down idle compute.

Section 2 — Workspace, Data & Environments

11) Create Workspace & Default Storage

AML links to a storage account (datastore), container registry, Key Vault, and Application Insights.

az group create -n rg-ml -l westeurope
az ml workspace create -g rg-ml -n mlw-dev

12) Datastores vs Data Assets

Datastore = connection to storage (Blob/ADLS). Data asset = versioned reference to files/tables within a datastore.

az ml datastore create --file datastore.yml
az ml data create --file data-churn.yml

13) Data Versions

Treat training data as immutable versions (name: churn; version: 4). Improves reproducibility and lineage.

14) Environments

Docker image + conda environment; pin exact versions for portability.

az ml environment create --file env-sklearn.yml

15) Managed Datastores (ADLS Gen2)

Prefer ADLS Gen2 with VNET + private endpoints for governed lakes and Spark access.

16) Mount vs Download

Inputs can be mounted or downloaded to compute. Mount for large datasets; download for small/fast local access.

Section 3 — Compute & Training Jobs

17) Compute Types

Compute Instance (dev notebook), Compute Cluster (batch training/parallel), AmlCompute (CPU/GPU), Attached (Databricks/K8s).

az ml compute create --name cpu-cluster --size STANDARD_DS3_V2 \
  --type amlcompute --min-instances 0 --max-instances 8 --idle-time-before-scale-down 600

18) Command Jobs (v2)

Containerized script execution with inputs/outputs and environment.

az ml job create --file train.yml
# train.yml uses command: "python train.py --data ${{inputs.train}} --out ${{outputs.model}}"

19) Distributed Training

Use pytorch / mpi / tensorflow distributions. Define node count and process per node.

20) MLflow Logging

Log metrics/params/artifacts; AML automatically captures run context.

import mlflow
mlflow.log_metric("accuracy", 0.912)
mlflow.sklearn.log_model(model, "model")

21) AutoML

Automated training/tuning for tabular/time-series/vision/NLP. Produces best model with explainability and guardrails.

22) Datasets for AutoML

Provide target column, validation strategy, and primary metric (e.g., AUC). Export pipelines for repeatability.

Section 4 — Tuning, Components & Pipelines

23) HyperDrive / Sweep

Define search space, sampling (random/bayesian/grid), early termination (Bandit). Parallel runs on cluster.

az ml job sweep create --file sweep.yml

24) Reusable Components

Package steps (e.g., featurize/train/evaluate) with inputs/outputs; version them; share across teams.

az ml component create --file component-featurize.yml

25) Pipelines (DAG)

Orchestrate components; pass data/artifacts between steps; schedule with jobs.

az ml job create --file pipeline.yml

26) Data Drift & Retraining

Monitor drift on key features; trigger pipeline on threshold (Logic Apps/GitHub Actions).

27) Parallel Jobs

Split work over shards (files/IDs); AML manages chunk execution and retries.

28) Caching & Reuse

Cached component outputs skip recomputation when inputs/env unchanged—huge time saver.

29) Responsible AI Toolkit

Use Interpret (SHAP), Fairlearn, and error analysis; log artifacts to MLflow and publish with model.

30) Prompt Flow / Generative AI (optional)

Build LLM flows with evaluation and tracing; deploy to managed endpoints with content safety filters.

Section 5 — Models, Registry & Deployment

31) Register Model

Versioned model with metadata and path to artifacts.

az ml model create --name churn-model --version 1 --path ./mlruns/..../artifacts/model

32) Model Registry & Promotion

Central registry across workspaces. Promote from dev→test→prod with approvals; track lineage.

33) Environments for Inference

Smaller images, CPU/GPU variants, health probes. Pin inference-requirements.txt.

34) Online Endpoints

Real-time HTTPS inference; blue/green traffic splits; autoscale.

az ml online-endpoint create -n churn-api
az ml online-deployment create -n blue -e churn-api --model churn-model:1 --env infer-env:1 --instance-type Standard_DS3_v2 --instance-count 2
az ml online-endpoint set-traffic -n churn-api --traffic blue=100

35) Batch Endpoints

Asynchronous scoring over large datasets; schedule via pipelines.

az ml batch-endpoint create -n churn-batch
az ml batch-deployment create -n d1 -e churn-batch --model churn-model:1 --input data:churn/4

36) Scoring Script

Implement init() and run(input); load model once in init to avoid warmup per request.

37) Autoscale & SKUs

Scale by concurrent requests/CPU/GPU; choose compute SKUs based on model size/latency.

38) Canary & Rollback

Deploy a second slot (green), send 5–10% traffic, monitor, then ramp; instantly revert by traffic switch if metrics degrade.

Section 6 — MLOps, CI/CD & Governance

39) Infra as Code

Use Bicep/Terraform for workspace/networking; store YAML assets in Git; PR-based changes for review.

40) GitHub Actions & Azure DevOps

Stages: build env → submit training → register model → create endpoint/deployment → tests → promote.

41) Model Cards & Metadata

Attach dataset versions, training code hash, metrics, explanations, and risks to each model version.

42) Data/Model Lineage

AML tracks lineage automatically—use it for audits and reproducibility.

43) Policies & Gates

Require bias/accuracy thresholds and vulnerability scans to pass before promotion to prod.

44) Feature Stores (optional)

Centralize features; ensure offline/online view consistency; log training-serving skew.

45) Data Contracts

Schemas and quality checks at pipeline boundaries; fail fast on drift or missing columns.

Section 7 — Security, Networking & Compliance

46) Private Networking

Use VNETs + private endpoints for storage/registry/Key Vault; managed online endpoints support private ingress; restrict public network access.

47) Identity

Prefer managed identities for jobs/endpoints; grant least-privilege RBAC to datastores and registries.

48) Secrets & Keys

Store in Key Vault; reference in YAML; never hardcode; rotate routinely.

49) Supply Chain Security

Pin base images, verify provenance (ACR content trust), scan dependencies, and sign models/artifacts where possible.

50) PII & Compliance

Mask or tokenize PII; encrypt at rest and in transit; log consent and retention; use Purview classifications.

Section 8 — Monitoring, Cost & Reliability

51) Online Monitoring

Collect latency, throughput, error rate; enable data capture for shadow evaluation; watch p95/p99 and saturation.

52) Data & Performance Drift

Compare live features vs training baseline; alert on drift; trigger retraining automatically.

53) Cost Controls

Autoscale down to zero, use spot nodes for non-critical training, cache components, and shut down idle compute instances.

54) Reliability

Use readiness/liveness probes, circuit breakers in clients, retries with backoff, and multi-region DR for critical endpoints.

Section 9 — Patterns & Interview Q&A

55) Pattern — Training to Serving

Train (command job) → register → deploy blue (0%) → shadow test → canary (10%) → full rollout → archive old model; automate via pipeline.

56) Pattern — Batch Scoring Window

Nightly batch endpoint scores fresh data from ADLS, writes results to curated tables, and emits metrics to MLflow.

57) Q — AML vs Databricks?

Answer: Databricks excels at collaborative Spark/Delta Lake. AML focuses on governed ML/MLOps with native registry/endpoints and integrates with DBX as an attached compute.

58) Q — Why MLflow?

Answer: Open standard for runs/models; portable across clouds; AML augments with enterprise registry, lineage, and deployment.

59) Q — Speed up training?

Answer: Use larger clusters/GPUs, spot nodes, data locality, mixed precision, distributed training, and cache heavy preprocessing.

60) Q — Typical pitfalls?

Answer: Unpinned environments, unmanaged data versions, no lineage, over-privileged identities, not monitoring drift, and endpoints without autoscale.