DVC (Data Version Control) Pocket Book

DVC Pocket Book

DVC Pocket Book — Uplatz

50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is DVC?

Data Version Control (DVC) brings Git-like versioning to large data, models, and ML pipelines. It tracks files via metafiles and stores content in remote object storage, enabling reproducible experiments and collaboration.

pip install dvc
dvc --version

2) Why DVC? Strengths & Tradeoffs

Strengths: Git-native workflow, reproducible pipelines, data/model sharing, cheap cloud storage, experiment tracking. Tradeoffs: extra commands to learn; remote and cache need setup discipline.

# Start in an existing Git repo
git init
dvc init

3) Repo Structure

Code in Git, big data in DVC. Metafiles (.dvc, dvc.yaml, dvc.lock) live in Git; actual blobs go to cache/remote.

.git/
.dvc/        # cache config
data/        # large files tracked by DVC
models/      # model artifacts tracked by DVC

4) DVC vs Git LFS

Git LFS keeps blobs in Git hosting; DVC uses your object storage (S3/GS/Azure/etc.) and adds pipelines/metrics/experiments—more MLOps features.

# DVC stores pointers, not blobs, in Git

5) Remotes (Storage)

Remotes hold actual data: S3, GCS, Azure, SSH, WebDAV, local NAS, etc. You can have multiple; set a default.

dvc remote add -d storage s3://my-bucket/proj
dvc remote modify storage profile default

6) What DVC Tracks

Datasets, models, intermediate artifacts, metrics (.json/.csv/.tsv), plots (.json/.tsv), and parameters (params.yaml).

dvc add data/raw/imagenet

7) Core Files

.dvc (single file tracking for non-pipeline artifacts), dvc.yaml (pipeline), dvc.lock (frozen state for reproducibility).

git add data/raw.dvc dvc.yaml dvc.lock

8) Install & Init

Install DVC and initialize in your Git repo. Commit the created files.

pipx install dvc[all]
dvc init
git commit -m "Init DVC"

9) Terminology

Stage (a pipeline step), deps (inputs), outs (outputs), params (tunable vars), cache (deduped content store).

dvc stage add -n train -d data -d code -o models python train.py

10) Q&A — “When prefer DVC over plain Git?”

Answer: When files are big or change often, when you need pipelines/experiments, or want cheap object storage rather than bloating your Git repo.

Section 2 — Core Commands & Tracking

11) dvc add

Track a dataset or file without a pipeline. Produces a .dvc metafile and pushes content to cache/remote.

dvc add data/raw.csv
git add data/raw.csv.dvc .gitignore
git commit -m "Track raw data"

12) dvc commit

After changing an output manually, update DVC metadata to record the new hash.

# Modify generated file, then:
dvc commit data/raw.csv.dvc

13) Push / Pull / Fetch

Sync data with the default remote. push uploads cache objects; pull downloads needed files; fetch gets cache without placing in workspace.

dvc push
dvc pull
dvc fetch

14) Status

Compare workspace to the tracked state (lockfile). Great to know what needs re-running or pushing.

dvc status
dvc status -c   # compare to cloud/remote

15) Diff

See what data/outs changed between Git commits, branches, or tags.

dvc diff HEAD~1 HEAD

16) Parameters

Store tunables in params.yaml; DVC tracks usage and diffs changes.

# params.yaml
train:
  lr: 0.001
  epochs: 10

17) Metrics

Log metrics as JSON/CSV/TSV; DVC can show and diff them across commits/experiments.

dvc metrics show
dvc metrics diff main..exp-123

18) Plots

Track curves (loss, ROC, PR). Generate HTML/JSON plots and diff across revisions.

dvc plots diff -t simple main..HEAD

19) Data Sharing

Use dvc get to fetch a file/dir from another DVC repo; dvc import to link and update later; import-url for external URLs.

dvc get https://github.com/org/data-registry data/iris.csv
dvc import https://github.com/org/data-registry data/iris.csv

20) Q&A — “How do teammates get the data?”

Answer: They pull code with Git, then run dvc pull (with the right remote creds). DVC recreates files from the shared cache/remote.

Section 3 — Pipelines & Experiments

21) Stages (Basics)

Define a stage with deps/outs/params and a command. DVC executes and tracks outputs.

dvc stage add -n preprocess \
  -d data/raw.csv -o data/clean.csv \
  python scripts/preprocess.py

22) dvc.yaml Structure

Stages are stored in dvc.yaml. Lockfile has exact hashes for reproducibility.

stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps: [data/raw.csv, scripts/preprocess.py]
    outs: [data/clean.csv]

23) outs, outs_no_cache, persist

outs go to cache; outs_no_cache keep on disk only; persist keeps files between runs.

outs_no_cache:
  - reports/large.html

24) Cache

Content-addressed store under .dvc/cache. Dedupes identical files across commits/branches.

du -sh .dvc/cache   # monitor size

25) Reproduce

Rerun necessary stages when deps/params change. DVC determines minimal work.

dvc repro   # rebuild affected stages only

26) Experiments (exp run)

Run variations without committing to Git immediately. Track params, metrics, and artifacts.

dvc exp run -S train.lr=0.0005 -S train.epochs=20

27) Manage Experiments

List, show diffs, apply the best, or branch it.

dvc exp show
dvc exp diff
dvc exp apply <exp>
dvc exp branch <exp> best-lr

28) Queued & Parallel Exps

Queue multiple param sets and run in sequence or parallel (on CI/compute).

dvc exp run -S a=1 -S b=2 --queue
dvc exp run --run-all

29) Checkpoints & dvclive

Log metrics per step (checkpoints) and visualize learning curves. Useful for long DL training.

pip install dvclive
from dvclive import Live
with Live() as live:
  for step in range(100):
    live.log_metric("loss", loss)
    live.next_step()

30) Q&A — “Pipelines vs Notebooks?”

Answer: Notebooks are great for exploration; DVC pipelines codify steps for reproducibility, automation, CI, and collaboration.

Section 4 — Remotes, CI/CD, Performance & Collaboration

31) Configure S3 Remote

Use AWS creds/profile; set as default. Server-side encryption can be handled by S3 config/policy.

dvc remote add -d s3store s3://ml-bucket/proj
dvc remote modify s3store profile default

32) Other Remotes

Azure, GCS, SSH, WebDAV, local path (NAS). Choose based on team infra and cost.

dvc remote add azstore azure://cont/path
dvc remote add gsstore gs://bucket/path
dvc remote add nfs /mnt/mlstore

33) Credentials & Security

Keep secrets in env vars or vaults; don’t commit keys. Use IAM roles/managed identities when possible.

export AWS_PROFILE=ml-team
export AZURE_STORAGE_CONNECTION_STRING=***

34) Team Workflow

Git for code/metadata; DVC for data. Open PRs with dvc.lock changes, push artifacts to remote, reviewers run dvc pull locally.

git push origin feature-x
dvc push

35) DVC Studio (Optional)

Visualize experiments/metrics across branches and PRs; compare models; share dashboards with stakeholders.

# Connect repo + remote; experiments auto-sync

36) CI with CML / GitHub Actions

Automate dvc pull, dvc repro, dvc push in CI runners; post results to PR.

# .github/workflows/ml.yml (snippet)
- uses: iterative/setup-dvc@v1
- run: dvc pull
- run: dvc repro
- run: dvc push

37) Orchestration

Wrap dvc repro in Makefiles or task runners; schedule with Airflow/Argo if needed.

make train   # runs: dvc repro train

38) Partial Retrieval

Grab a single file/dir from a specific revision without cloning repo history.

dvc get <repo-url> path/to/file -r v1.2.0

39) Performance Tuning

Choose cache link type (reflink/hardlink/symlink/copy), enable parallel jobs, avoid giant monolithic files; shard datasets.

dvc config cache.type symlink,hardlink,copy
dvc pull -j 8

40) Q&A — “Huge datasets (TBs)?”

Answer: Store in object storage, use import-url for external sources, shard by directories, and pull only needed subsets for each job.

Section 5 — MLOps, Maintenance, Troubleshooting & Interview Q&A

41) Repo Patterns

Single-repo (code+data pointers) or data-registry pattern (central data repo consumed via dvc import). Pick based on team size and reuse.

# Data registry → many model repos consume

42) Reproducibility Tips

Pin package versions, set seeds, capture params, and lock data/model hashes in dvc.lock.

python -m pip freeze > requirements.txt
echo "seed: 42" >> params.yaml

43) Tracking vs MLflow

DVC focuses on data/pipelines/versioning; MLflow specializes in experiment UI/registry. Many teams use both (e.g., dvclive → MLflow).

# dvclive logs → parse in MLflow dashboard if desired

44) Common Pitfalls

Committing big blobs to Git, forgetting dvc push, breaking cache links, non-deterministic pipelines, and storing secrets in repo.

Fix: .gitignore + env vars + pinned deps

45) Garbage Collection

Clean unused cache from disk/remote while keeping objects referenced by workspace/branches/tags.

dvc gc -w -a -T   # keep workspace, all branches, tags

46) Import & Update

Link external data repos and update later as they evolve.

dvc import https://github.com/org/data-registry data/images
dvc update data/images.dvc

47) Troubleshooting

Diagnose environment and remote issues; verify creds and remote path.

dvc doctor
dvc remote list
dvc status -c

48) Production Checklist

  • Remote configured + IAM/keys secure
  • Cache link type optimized
  • Params/metrics/plots standardized
  • CI runs dvc repro + pushes artifacts
  • GC policy documented
  • Data registry or import strategy defined

49) Sample Pipeline

End-to-end stages from data prep to evaluation.

dvc stage add -n preprocess -d data/raw -o data/clean python prep.py
dvc stage add -n featurize -d data/clean -o data/feats.npy python feats.py
dvc stage add -n train -d data/feats.npy -p train.lr,train.epochs -o models/model.pkl python train.py
dvc stage add -n eval -d models/model.pkl -d data/feats.npy -M metrics.json python eval.py

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Why DVC for ML? Git-native workflows with reproducible pipelines and large-file versioning.

2) How does DVC store data? Content-addressed cache and a remote (e.g., S3); Git stores only pointers.

3) DVC vs Git LFS? DVC adds pipelines/experiments/metrics and uses your own object storage.

4) What’s in dvc.yaml? Stages with cmd, deps, outs, params, metrics, plots.

5) When to use outs_no_cache? For large artifacts you don’t want deduped in cache (e.g., reports).

6) How do experiments differ from branches? Lightweight, transient by default; can be applied or branched when chosen.

7) How to share data? Push to remote; teammates run dvc pull.

8) How to compare runs? dvc metrics diff and dvc plots diff across commits/experiments.

9) Can I run in CI? Yes—pull, repro, push; use CML/GitHub Actions.

10) How to pin params? Store in params.yaml; DVC tracks and diffs.

11) Handle secrets? Env vars/vault, never commit credentials; rely on cloud IAM.

12) Cache too big? Run dvc gc, prune old branches/tags, shard datasets.

13) Partial download? dvc get for single files/dirs at a revision.

14) Repro slow? Ensure stages produce minimal outs and use smart deps; leverage parallelism.

15) Roll back a dataset? Checkout older Git commit and dvc pull.

16) Track notebooks? Yes—treat them as deps; better yet, convert steps to scripts for pipelines.

17) Multiple remotes? Add several and push selectively (e.g., cloud + on-prem).

18) Resolve merge conflicts? Regular Git for code; re-run DVC stages and commit updated dvc.lock.

19) What is dvc.lock? Frozen snapshot of exact data/params/deps used in a run.

20) Auditing results? Keep metrics/plots in Git; PRs show diffs and provenance.