FAISS Pocket Book

FAISS Pocket Book

FAISS Pocket Book — Uplatz

50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable Python/C++ snippets

Section 1 — Fundamentals

1) What is FAISS?

Facebook AI Similarity Search (FAISS) is a C++/Python library for efficient vector similarity search and clustering. It offers exact and approximate indexes, CPU/GPU backends, and composable building blocks (quantizers, IVF, PQ, HNSW) to scale k‑NN search to billions of embeddings.

pip install faiss-cpu   # or faiss-gpu

2) Metrics: L2 vs Inner Product vs Cosine

FAISS natively supports L2 (euclidean) and inner product. For cosine, L2‑normalize vectors then use inner product or L2 with the appropriate transform.

import faiss, numpy as np
x = np.random.randn(1000, 768).astype('float32')
faiss.normalize_L2(x)

3) Exact Search (Flat)

IndexFlatL2/IndexFlatIP perform brute‑force exact k‑NN; simplest and strong baseline for recall evaluation.

index = faiss.IndexFlatL2(d)
index.add(x)  # add base vectors
D, I = index.search(q, k)

4) ID Management

Use IndexIDMap/IndexIDMap2 to attach your own integer IDs to vectors; otherwise FAISS uses 0..n-1.

index = faiss.IndexIDMap(faiss.IndexFlatIP(d))
index.add_with_ids(x, ids)

5) Training vs Adding

IVF/PQ/HNSW may require training on representative samples via index.train(x_train) before add().

index = faiss.index_factory(d, 'IVF4096,PQ64')
index.train(x_train)
index.add(x)

6) IndexFactory

Build complex indexes from strings (e.g., “IVF4096,PQ64”, “HNSW32,Flat”) for rapid prototyping.

index = faiss.index_factory(d, 'IVF2048,PQ32', faiss.METRIC_L2)

7) IVFFlat

Inverted File (IVF) partitions the space into nlist Voronoi cells via k‑means; probes a subset (nprobe) at query time.

quant = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quant, d, nlist)
index.train(x_train)
index.nprobe = 16

8) Product Quantization (PQ)

Compress vectors into m subquantizers with nbits each (e.g., PQ64@8 bits ≈ 64 bytes). Huge memory savings with small recall loss.

index = faiss.IndexIVFPQ(quant, d, nlist, m=64, nbits=8)

9) OPQ (Rotation)

Optimized PQ learns a rotation to reduce quantization error before PQ.

index = faiss.index_factory(d, 'OPQ64,IVF4096,PQ64')

10) Q&A — “Cosine or IP?”

Answer: Cosine similarity over unit‑norm vectors equals inner product. Normalize once at ingest and query for best performance.

Section 2 — Advanced Indexes, HNSW & GPU

11) HNSW Graph

Hierarchical Navigable Small World graphs deliver strong recall/latency trade‑offs; parameters: M (degree), efConstruction, efSearch.

index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efSearch = 128

12) IVF‑HNSW‑PQ

Combine IVF coarse quantizer with HNSW and PQ codes for large‑scale ANN under limited RAM.

index = faiss.index_factory(d, 'IVF4096_HNSW32,PQ64')

13) Scalar Quantization (SQ8)

Per‑component quantization to 8 bits; smaller than float32 while enabling fast distance computation.

index = faiss.index_factory(d, 'IVF4096,SQ8')

14) GPU Basics

Use StandardGpuResources and clone from CPU to GPU. Multiple GPUs support sharding/replication.

res = faiss.StandardGpuResources()
cpu = faiss.IndexFlatIP(d)
gpu = faiss.index_cpu_to_gpu(res, 0, cpu)

15) GPU IVFPQ

Train on CPU, then move to GPU or train directly on GPU for speed with large samples.

cpu = faiss.index_factory(d, 'IVF65536,PQ64')
cpu.train(x_train)
gpu = faiss.index_cpu_to_gpu(res, 0, cpu)

16) Sharding vs Replication

Shard to scale capacity (split vectors across GPUs); replicate to scale throughput (same vectors on many GPUs). Use IndexShards/IndexReplicas.

rep = faiss.IndexReplicas()
rep.addIndex(faiss.index_cpu_to_gpu(res, 0, cpu))
rep.addIndex(faiss.index_cpu_to_gpu(res, 1, cpu))

17) IVF Hyperparameters

nlist ~ sqrt(N) as a starting point; tune nprobe for recall/latency. More nlist increases training and memory; higher nprobe increases query cost.

index.nprobe = 8  # try 1, 4, 8, 16, 32…

18) PQ Hyperparameters

Choose m (subquantizers) and nbits per code to fit memory/speed targets. Typical: m∈{32,64}, nbits∈{6,8}.

index = faiss.IndexIVFPQ(quant, d, 4096, 64, 8)

19) HNSW Tuning

Larger efSearch ⇒ higher recall but slower; larger M increases memory and build time but can improve quality.

index.hnsw.efConstruction = 200
index.hnsw.efSearch = 128

20) Q&A — “GPU or CPU?”

Answer: Use GPU for very large indexes/training and high QPS; CPU is simpler and often sufficient up to tens of millions. Benchmark with your vectors and latency SLOs.

Section 3 — Ingestion, Search, Persistence & Filtering

21) Add & Search

Typical loop: train (if needed) → add vectors → search; batch operations for throughput.

index.add(x)
D, I = index.search(q, k)

22) Range Search

Retrieve neighbors within a distance threshold; useful for dedup/near‑duplicates.

lims, D, I = index.range_search(q, radius)

23) Removing Vectors

Some indexes support remove_ids(); otherwise mark deleted and rebuild periodically.

sel = faiss.IDSelectorBatch(np.array([42, 77], 'int64'))
index.remove_ids(sel)

24) Bitset / IDSelector Filtering

Filter search results to a subset using IDSelector or bitset masks where supported (compile‑time option).

mask = faiss.IDSelectorRange(1000, 2000)
D, I = index.search(q, k, params=None, sel=mask)

25) Save/Load Index

Persist to disk with write_index/read_index; remember to store your own metadata elsewhere.

faiss.write_index(index, 'vectors.faiss')
index = faiss.read_index('vectors.faiss')

26) On‑Disk Inverted Lists

Use on‑disk IVF for massive datasets; keep coarse quantizer in RAM. Tune OS cache and I/O scheduler.

# created via faiss.contrib.ondisk helpers (when available)

27) PreTransform

Compose transforms (e.g., OPQ, PCA, L2‑norm) before an index using IndexPreTransform.

opq = faiss.OPQMatrix(d, 64)
index = faiss.IndexPreTransform(opq, faiss.IndexFlatIP(d))

28) PCA for Dim Reduction

Reduce d to d’ to speed up search and reduce memory; retrain PQ on reduced space.

pca = faiss.PCAMatrix(d, 256)
index = faiss.IndexPreTransform(pca, faiss.IndexFlatL2(256))

29) IVF Cache & Prefetch

Warm up inverted lists (centroids) and tune nprobe; cache hot lists when possible.

index.nprobe = 32  # higher for recall, lower for latency

30) Q&A — “Where do I store metadata?”

Answer: FAISS stores only vectors and integer IDs. Keep metadata (text, JSON) in your DB; join on IDs after search.

Section 4 — Evaluation, Clustering & Production Patterns

31) Recall & Latency

Measure recall@k vs a Flat baseline and track P50/P95/P99 latencies. Optimize nprobe/efSearch for your SLOs.

# compute recall@k comparing I (ANN) to I0 (Flat)
recall = (I == I0).any(axis=1).mean()

32) Ground‑Truth with Flat

Use IndexFlat on a sample to estimate upper‑bound recall and validate preprocessing consistency.

gt = faiss.IndexFlatIP(d); gt.add(x)
D0, I0 = gt.search(q, k)

33) K‑means Clustering

FAISS provides fast k‑means for large datasets (CPU/GPU). Useful for centroids and analytics.

km = faiss.Kmeans(d, k=1024, niter=20)
km.train(x)
centroids = km.centroids

34) Add‑Only vs Mutable

Some indexes are add‑only; for frequent deletes/updates, consider periodic rebuilds or maintain a tombstone set.

# keep a deleted-id set and filter in post-processing

35) Hybrid Reranking

Retrieve top‑K with ANN, then rerank with a heavier scorer (cross‑encoder, re‑normalized cosine) for quality.

D, I = index.search(q, 100)
# fetch docs by I, rerank outside FAISS

36) Deduplication

Use range search with a small radius on unit vectors to find near‑duplicates; keep one per cluster.

lims, D, I = index.range_search(x, 1e-4)

37) Batch Size & BLAS

Larger batch sizes improve throughput; ensure FAISS links against optimized BLAS (MKL/OpenBLAS) for CPU.

faiss.get_num_gpus()

38) Quantizer Training Set

Use millions of samples for large nlist/PQ; random, representative, and preprocessed exactly like production.

x_train = x[np.random.choice(len(x), 1_000_000, replace=False)]

39) Memory Planning

Estimate bytes per vector (PQ code + overhead) × N, plus IVF lists and quantizer. Validate with index.sa_code_size() when available.

# rule of thumb: PQ64@8bits ≈ 64B/vector

40) Q&A — “How big should nlist be?”

Answer: Start near √N (e.g., N=100M ⇒ nlist≈10K). Tune alongside nprobe to reach target recall/latency; avoid extreme values without enough training data.

Section 5 — Ops, Safety, Troubleshooting & Interview Q&A

41) Persistence & Backups

Snapshot the index file alongside your metadata DB snapshots; verify checksums and restore procedures regularly.

faiss.write_index(index, '/mnt/snapshots/idx_2025-08-26.faiss')

42) Monitoring

Track QPS, tail latency, memory, GPU utilization, training time, and recall against a canary set; alert on drops.

# expose custom metrics from your service layer

43) Concurrency & Threading

Use FAISS’s internal threading settings judiciously; also parallelize at the service layer for multi‑core CPUs.

faiss.omp_set_num_threads(8)

44) Common Errors

Mismatched dims, untrained IVF/PQ, forgetting L2 normalization for cosine, tiny nprobe, or missing ID maps. Validate pipeline end‑to‑end.

assert x.shape[1] == d

45) Multi‑Tenant Isolation

Use ID ranges or separate indexes per tenant; filter by IDSelector or pre‑route queries.

sel = faiss.IDSelectorRange(lo, hi)

46) Production Checklist

  • Vector preprocessing is identical in train/ingest/query
  • Ground‑truth Flat baseline & recall dashboard
  • Appropriate nlist/nprobe (or efSearch) tuning
  • Backups & deterministic rebuild scripts
  • Resource sizing for RAM/GPU/IO; canary tests
  • Safe rollout with shadow queries & AB tests

47) Serving Architecture

Put a stateless API in front of FAISS; cache hot queries, pool memory, and colocate with feature store/DB for low latency.

# e.g., FastAPI/Express service wrapping FAISS index

48) Evaluation Harness

Maintain fixed query sets and measure recall/latency across builds; store curves for regressions.

# log recall@k and P95/P99 per version

49) Reproducibility

Fix random seeds for k‑means/PQ; record training sample hashes and index factory strings.

faiss.rand_seed(1234)

50) Interview Q&A — 20 Practical Questions (FAISS‑focused)

1) Why IVF? Sublinear search by probing a subset of inverted lists; trades recall for speed.

2) Why PQ? Compress vectors to fit RAM/GPU and accelerate distance evals with acceptable recall loss.

3) How to use cosine? L2‑normalize vectors and use inner product.

4) nlist/nprobe intuition? nlist partitions, nprobe probes at query time; higher nprobe ⇒ better recall, slower.

5) OPQ vs PQ? OPQ rotates space to minimize quantization error, often improving recall at the same code size.

6) HNSW vs IVF? HNSW is graph‑based with efSearch; IVF is centroid‑based. Try both; hybrids exist.

7) When Flat? Small N or need exact results; also for evaluation.

8) How to delete? remove_ids where supported or rebuild; maintain tombstones.

9) How to store metadata? External DB keyed by IDs; FAISS is vectors+IDs only.

10) GPU sharding vs replication? Shard for capacity; replicate for QPS.

11) What is efSearch? HNSW parameter controlling breadth of search; higher improves recall.

12) Typical PQ code sizes? 16–64 bytes common; 64–128 bytes for higher quality.

13) What is IndexFactory? String grammar to compose indexes quickly.

14) Range search use cases? Near‑duplicate detection, clustering.

15) Why PreTransform? Compose PCA/OPQ/normalization before the index for cleaner pipelines.

16) Recall measurement? Compare ANN neighbors to Flat; compute recall@k.

17) Memory planning? Estimate code size × N; include IVF/HNSW overhead.

18) Hot parameters to tune? nprobe, m/nbits, efSearch, batch size.

19) How to ensure consistency? Same tokenize/normalize/projection across train/ingest/query.

20) How to pick an index? Start with Flat baseline → try IVFFlat → IVFPQ/OPQ → HNSW/Hybrids; choose by SLOs & cost.