DVC Pocket Book — Uplatz
50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is DVC?
Data Version Control (DVC) brings Git-like versioning to large data, models, and ML pipelines. It tracks files via metafiles and stores content in remote object storage, enabling reproducible experiments and collaboration.
pip install dvc
dvc --version
2) Why DVC? Strengths & Tradeoffs
Strengths: Git-native workflow, reproducible pipelines, data/model sharing, cheap cloud storage, experiment tracking. Tradeoffs: extra commands to learn; remote and cache need setup discipline.
# Start in an existing Git repo
git init
dvc init
3) Repo Structure
Code in Git, big data in DVC. Metafiles (.dvc
, dvc.yaml
, dvc.lock
) live in Git; actual blobs go to cache/remote.
.git/
.dvc/ # cache config
data/ # large files tracked by DVC
models/ # model artifacts tracked by DVC
4) DVC vs Git LFS
Git LFS keeps blobs in Git hosting; DVC uses your object storage (S3/GS/Azure/etc.) and adds pipelines/metrics/experiments—more MLOps features.
# DVC stores pointers, not blobs, in Git
5) Remotes (Storage)
Remotes hold actual data: S3, GCS, Azure, SSH, WebDAV, local NAS, etc. You can have multiple; set a default.
dvc remote add -d storage s3://my-bucket/proj
dvc remote modify storage profile default
6) What DVC Tracks
Datasets, models, intermediate artifacts, metrics (.json/.csv/.tsv
), plots (.json/.tsv
), and parameters (params.yaml
).
dvc add data/raw/imagenet
7) Core Files
.dvc (single file tracking for non-pipeline artifacts), dvc.yaml (pipeline), dvc.lock (frozen state for reproducibility).
git add data/raw.dvc dvc.yaml dvc.lock
8) Install & Init
Install DVC and initialize in your Git repo. Commit the created files.
pipx install dvc[all]
dvc init
git commit -m "Init DVC"
9) Terminology
Stage (a pipeline step), deps (inputs), outs (outputs), params (tunable vars), cache (deduped content store).
dvc stage add -n train -d data -d code -o models python train.py
10) Q&A — “When prefer DVC over plain Git?”
Answer: When files are big or change often, when you need pipelines/experiments, or want cheap object storage rather than bloating your Git repo.
11) dvc add
Track a dataset or file without a pipeline. Produces a .dvc
metafile and pushes content to cache/remote.
dvc add data/raw.csv
git add data/raw.csv.dvc .gitignore
git commit -m "Track raw data"
12) dvc commit
After changing an output manually, update DVC metadata to record the new hash.
# Modify generated file, then:
dvc commit data/raw.csv.dvc
13) Push / Pull / Fetch
Sync data with the default remote. push uploads cache objects; pull downloads needed files; fetch gets cache without placing in workspace.
dvc push
dvc pull
dvc fetch
14) Status
Compare workspace to the tracked state (lockfile). Great to know what needs re-running or pushing.
dvc status
dvc status -c # compare to cloud/remote
15) Diff
See what data/outs changed between Git commits, branches, or tags.
dvc diff HEAD~1 HEAD
16) Parameters
Store tunables in params.yaml
; DVC tracks usage and diffs changes.
# params.yaml
train:
lr: 0.001
epochs: 10
17) Metrics
Log metrics as JSON/CSV/TSV; DVC can show and diff them across commits/experiments.
dvc metrics show
dvc metrics diff main..exp-123
18) Plots
Track curves (loss, ROC, PR). Generate HTML/JSON plots and diff across revisions.
dvc plots diff -t simple main..HEAD
19) Data Sharing
Use dvc get
to fetch a file/dir from another DVC repo; dvc import
to link and update later; import-url
for external URLs.
dvc get https://github.com/org/data-registry data/iris.csv
dvc import https://github.com/org/data-registry data/iris.csv
20) Q&A — “How do teammates get the data?”
Answer: They pull code with Git, then run dvc pull
(with the right remote creds). DVC recreates files from the shared cache/remote.
21) Stages (Basics)
Define a stage with deps/outs/params and a command. DVC executes and tracks outputs.
dvc stage add -n preprocess \
-d data/raw.csv -o data/clean.csv \
python scripts/preprocess.py
22) dvc.yaml Structure
Stages are stored in dvc.yaml
. Lockfile has exact hashes for reproducibility.
stages:
preprocess:
cmd: python scripts/preprocess.py
deps: [data/raw.csv, scripts/preprocess.py]
outs: [data/clean.csv]
23) outs, outs_no_cache, persist
outs
go to cache; outs_no_cache
keep on disk only; persist
keeps files between runs.
outs_no_cache:
- reports/large.html
24) Cache
Content-addressed store under .dvc/cache
. Dedupes identical files across commits/branches.
du -sh .dvc/cache # monitor size
25) Reproduce
Rerun necessary stages when deps/params change. DVC determines minimal work.
dvc repro # rebuild affected stages only
26) Experiments (exp run)
Run variations without committing to Git immediately. Track params, metrics, and artifacts.
dvc exp run -S train.lr=0.0005 -S train.epochs=20
27) Manage Experiments
List, show diffs, apply the best, or branch it.
dvc exp show
dvc exp diff
dvc exp apply <exp>
dvc exp branch <exp> best-lr
28) Queued & Parallel Exps
Queue multiple param sets and run in sequence or parallel (on CI/compute).
dvc exp run -S a=1 -S b=2 --queue
dvc exp run --run-all
29) Checkpoints & dvclive
Log metrics per step (checkpoints) and visualize learning curves. Useful for long DL training.
pip install dvclive
from dvclive import Live
with Live() as live:
for step in range(100):
live.log_metric("loss", loss)
live.next_step()
30) Q&A — “Pipelines vs Notebooks?”
Answer: Notebooks are great for exploration; DVC pipelines codify steps for reproducibility, automation, CI, and collaboration.
31) Configure S3 Remote
Use AWS creds/profile; set as default. Server-side encryption can be handled by S3 config/policy.
dvc remote add -d s3store s3://ml-bucket/proj
dvc remote modify s3store profile default
32) Other Remotes
Azure, GCS, SSH, WebDAV, local path (NAS). Choose based on team infra and cost.
dvc remote add azstore azure://cont/path
dvc remote add gsstore gs://bucket/path
dvc remote add nfs /mnt/mlstore
33) Credentials & Security
Keep secrets in env vars or vaults; don’t commit keys. Use IAM roles/managed identities when possible.
export AWS_PROFILE=ml-team
export AZURE_STORAGE_CONNECTION_STRING=***
34) Team Workflow
Git for code/metadata; DVC for data. Open PRs with dvc.lock
changes, push artifacts to remote, reviewers run dvc pull
locally.
git push origin feature-x
dvc push
35) DVC Studio (Optional)
Visualize experiments/metrics across branches and PRs; compare models; share dashboards with stakeholders.
# Connect repo + remote; experiments auto-sync
36) CI with CML / GitHub Actions
Automate dvc pull
, dvc repro
, dvc push
in CI runners; post results to PR.
# .github/workflows/ml.yml (snippet)
- uses: iterative/setup-dvc@v1
- run: dvc pull
- run: dvc repro
- run: dvc push
37) Orchestration
Wrap dvc repro
in Makefiles or task runners; schedule with Airflow/Argo if needed.
make train # runs: dvc repro train
38) Partial Retrieval
Grab a single file/dir from a specific revision without cloning repo history.
dvc get <repo-url> path/to/file -r v1.2.0
39) Performance Tuning
Choose cache link type (reflink/hardlink/symlink/copy), enable parallel jobs, avoid giant monolithic files; shard datasets.
dvc config cache.type symlink,hardlink,copy
dvc pull -j 8
40) Q&A — “Huge datasets (TBs)?”
Answer: Store in object storage, use import-url
for external sources, shard by directories, and pull only needed subsets for each job.
41) Repo Patterns
Single-repo (code+data pointers) or data-registry pattern (central data repo consumed via dvc import
). Pick based on team size and reuse.
# Data registry → many model repos consume
42) Reproducibility Tips
Pin package versions, set seeds, capture params, and lock data/model hashes in dvc.lock
.
python -m pip freeze > requirements.txt
echo "seed: 42" >> params.yaml
43) Tracking vs MLflow
DVC focuses on data/pipelines/versioning; MLflow specializes in experiment UI/registry. Many teams use both (e.g., dvclive → MLflow).
# dvclive logs → parse in MLflow dashboard if desired
44) Common Pitfalls
Committing big blobs to Git, forgetting dvc push
, breaking cache links, non-deterministic pipelines, and storing secrets in repo.
Fix: .gitignore + env vars + pinned deps
45) Garbage Collection
Clean unused cache from disk/remote while keeping objects referenced by workspace/branches/tags.
dvc gc -w -a -T # keep workspace, all branches, tags
46) Import & Update
Link external data repos and update later as they evolve.
dvc import https://github.com/org/data-registry data/images
dvc update data/images.dvc
47) Troubleshooting
Diagnose environment and remote issues; verify creds and remote path.
dvc doctor
dvc remote list
dvc status -c
48) Production Checklist
- Remote configured + IAM/keys secure
- Cache link type optimized
- Params/metrics/plots standardized
- CI runs
dvc repro
+ pushes artifacts - GC policy documented
- Data registry or import strategy defined
49) Sample Pipeline
End-to-end stages from data prep to evaluation.
dvc stage add -n preprocess -d data/raw -o data/clean python prep.py
dvc stage add -n featurize -d data/clean -o data/feats.npy python feats.py
dvc stage add -n train -d data/feats.npy -p train.lr,train.epochs -o models/model.pkl python train.py
dvc stage add -n eval -d models/model.pkl -d data/feats.npy -M metrics.json python eval.py
50) Interview Q&A — 20 Practical Questions (Expanded)
1) Why DVC for ML? Git-native workflows with reproducible pipelines and large-file versioning.
2) How does DVC store data? Content-addressed cache and a remote (e.g., S3); Git stores only pointers.
3) DVC vs Git LFS? DVC adds pipelines/experiments/metrics and uses your own object storage.
4) What’s in dvc.yaml? Stages with cmd, deps, outs, params, metrics, plots.
5) When to use outs_no_cache
? For large artifacts you don’t want deduped in cache (e.g., reports).
6) How do experiments differ from branches? Lightweight, transient by default; can be applied or branched when chosen.
7) How to share data? Push to remote; teammates run dvc pull
.
8) How to compare runs? dvc metrics diff
and dvc plots diff
across commits/experiments.
9) Can I run in CI? Yes—pull, repro, push; use CML/GitHub Actions.
10) How to pin params? Store in params.yaml
; DVC tracks and diffs.
11) Handle secrets? Env vars/vault, never commit credentials; rely on cloud IAM.
12) Cache too big? Run dvc gc
, prune old branches/tags, shard datasets.
13) Partial download? dvc get
for single files/dirs at a revision.
14) Repro slow? Ensure stages produce minimal outs and use smart deps; leverage parallelism.
15) Roll back a dataset? Checkout older Git commit and dvc pull
.
16) Track notebooks? Yes—treat them as deps; better yet, convert steps to scripts for pipelines.
17) Multiple remotes? Add several and push selectively (e.g., cloud + on-prem).
18) Resolve merge conflicts? Regular Git for code; re-run DVC stages and commit updated dvc.lock
.
19) What is dvc.lock
? Frozen snapshot of exact data/params/deps used in a run.
20) Auditing results? Keep metrics/plots in Git; PRs show diffs and provenance.