FAISS Pocket Book — Uplatz
50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable Python/C++ snippets
1) What is FAISS?
Facebook AI Similarity Search (FAISS) is a C++/Python library for efficient vector similarity search and clustering. It offers exact and approximate indexes, CPU/GPU backends, and composable building blocks (quantizers, IVF, PQ, HNSW) to scale k‑NN search to billions of embeddings.
pip install faiss-cpu # or faiss-gpu
2) Metrics: L2 vs Inner Product vs Cosine
FAISS natively supports L2 (euclidean) and inner product. For cosine, L2‑normalize vectors then use inner product or L2 with the appropriate transform.
import faiss, numpy as np
x = np.random.randn(1000, 768).astype('float32')
faiss.normalize_L2(x)
3) Exact Search (Flat)
IndexFlatL2/IndexFlatIP perform brute‑force exact k‑NN; simplest and strong baseline for recall evaluation.
index = faiss.IndexFlatL2(d)
index.add(x) # add base vectors
D, I = index.search(q, k)
4) ID Management
Use IndexIDMap/IndexIDMap2 to attach your own integer IDs to vectors; otherwise FAISS uses 0..n-1.
index = faiss.IndexIDMap(faiss.IndexFlatIP(d))
index.add_with_ids(x, ids)
5) Training vs Adding
IVF/PQ/HNSW may require training on representative samples via index.train(x_train) before add().
index = faiss.index_factory(d, 'IVF4096,PQ64')
index.train(x_train)
index.add(x)
6) IndexFactory
Build complex indexes from strings (e.g., “IVF4096,PQ64”, “HNSW32,Flat”) for rapid prototyping.
index = faiss.index_factory(d, 'IVF2048,PQ32', faiss.METRIC_L2)
7) IVFFlat
Inverted File (IVF) partitions the space into nlist Voronoi cells via k‑means; probes a subset (nprobe) at query time.
quant = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quant, d, nlist)
index.train(x_train)
index.nprobe = 16
8) Product Quantization (PQ)
Compress vectors into m subquantizers with nbits each (e.g., PQ64@8 bits ≈ 64 bytes). Huge memory savings with small recall loss.
index = faiss.IndexIVFPQ(quant, d, nlist, m=64, nbits=8)
9) OPQ (Rotation)
Optimized PQ learns a rotation to reduce quantization error before PQ.
index = faiss.index_factory(d, 'OPQ64,IVF4096,PQ64')
10) Q&A — “Cosine or IP?”
Answer: Cosine similarity over unit‑norm vectors equals inner product. Normalize once at ingest and query for best performance.
11) HNSW Graph
Hierarchical Navigable Small World graphs deliver strong recall/latency trade‑offs; parameters: M (degree), efConstruction, efSearch.
index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efSearch = 128
12) IVF‑HNSW‑PQ
Combine IVF coarse quantizer with HNSW and PQ codes for large‑scale ANN under limited RAM.
index = faiss.index_factory(d, 'IVF4096_HNSW32,PQ64')
13) Scalar Quantization (SQ8)
Per‑component quantization to 8 bits; smaller than float32 while enabling fast distance computation.
index = faiss.index_factory(d, 'IVF4096,SQ8')
14) GPU Basics
Use StandardGpuResources and clone from CPU to GPU. Multiple GPUs support sharding/replication.
res = faiss.StandardGpuResources()
cpu = faiss.IndexFlatIP(d)
gpu = faiss.index_cpu_to_gpu(res, 0, cpu)
15) GPU IVFPQ
Train on CPU, then move to GPU or train directly on GPU for speed with large samples.
cpu = faiss.index_factory(d, 'IVF65536,PQ64')
cpu.train(x_train)
gpu = faiss.index_cpu_to_gpu(res, 0, cpu)
16) Sharding vs Replication
Shard to scale capacity (split vectors across GPUs); replicate to scale throughput (same vectors on many GPUs). Use IndexShards/IndexReplicas.
rep = faiss.IndexReplicas()
rep.addIndex(faiss.index_cpu_to_gpu(res, 0, cpu))
rep.addIndex(faiss.index_cpu_to_gpu(res, 1, cpu))
17) IVF Hyperparameters
nlist ~ sqrt(N) as a starting point; tune nprobe for recall/latency. More nlist increases training and memory; higher nprobe increases query cost.
index.nprobe = 8 # try 1, 4, 8, 16, 32…
18) PQ Hyperparameters
Choose m (subquantizers) and nbits per code to fit memory/speed targets. Typical: m∈{32,64}, nbits∈{6,8}.
index = faiss.IndexIVFPQ(quant, d, 4096, 64, 8)
19) HNSW Tuning
Larger efSearch ⇒ higher recall but slower; larger M increases memory and build time but can improve quality.
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 128
20) Q&A — “GPU or CPU?”
Answer: Use GPU for very large indexes/training and high QPS; CPU is simpler and often sufficient up to tens of millions. Benchmark with your vectors and latency SLOs.
21) Add & Search
Typical loop: train (if needed) → add vectors → search; batch operations for throughput.
index.add(x)
D, I = index.search(q, k)
22) Range Search
Retrieve neighbors within a distance threshold; useful for dedup/near‑duplicates.
lims, D, I = index.range_search(q, radius)
23) Removing Vectors
Some indexes support remove_ids(); otherwise mark deleted and rebuild periodically.
sel = faiss.IDSelectorBatch(np.array([42, 77], 'int64'))
index.remove_ids(sel)
24) Bitset / IDSelector Filtering
Filter search results to a subset using IDSelector or bitset masks where supported (compile‑time option).
mask = faiss.IDSelectorRange(1000, 2000)
D, I = index.search(q, k, params=None, sel=mask)
25) Save/Load Index
Persist to disk with write_index/read_index; remember to store your own metadata elsewhere.
faiss.write_index(index, 'vectors.faiss')
index = faiss.read_index('vectors.faiss')
26) On‑Disk Inverted Lists
Use on‑disk IVF for massive datasets; keep coarse quantizer in RAM. Tune OS cache and I/O scheduler.
# created via faiss.contrib.ondisk helpers (when available)
27) PreTransform
Compose transforms (e.g., OPQ, PCA, L2‑norm) before an index using IndexPreTransform.
opq = faiss.OPQMatrix(d, 64)
index = faiss.IndexPreTransform(opq, faiss.IndexFlatIP(d))
28) PCA for Dim Reduction
Reduce d to d’ to speed up search and reduce memory; retrain PQ on reduced space.
pca = faiss.PCAMatrix(d, 256)
index = faiss.IndexPreTransform(pca, faiss.IndexFlatL2(256))
29) IVF Cache & Prefetch
Warm up inverted lists (centroids) and tune nprobe; cache hot lists when possible.
index.nprobe = 32 # higher for recall, lower for latency
30) Q&A — “Where do I store metadata?”
Answer: FAISS stores only vectors and integer IDs. Keep metadata (text, JSON) in your DB; join on IDs after search.
31) Recall & Latency
Measure recall@k vs a Flat baseline and track P50/P95/P99 latencies. Optimize nprobe/efSearch for your SLOs.
# compute recall@k comparing I (ANN) to I0 (Flat)
recall = (I == I0).any(axis=1).mean()
32) Ground‑Truth with Flat
Use IndexFlat on a sample to estimate upper‑bound recall and validate preprocessing consistency.
gt = faiss.IndexFlatIP(d); gt.add(x)
D0, I0 = gt.search(q, k)
33) K‑means Clustering
FAISS provides fast k‑means for large datasets (CPU/GPU). Useful for centroids and analytics.
km = faiss.Kmeans(d, k=1024, niter=20)
km.train(x)
centroids = km.centroids
34) Add‑Only vs Mutable
Some indexes are add‑only; for frequent deletes/updates, consider periodic rebuilds or maintain a tombstone set.
# keep a deleted-id set and filter in post-processing
35) Hybrid Reranking
Retrieve top‑K with ANN, then rerank with a heavier scorer (cross‑encoder, re‑normalized cosine) for quality.
D, I = index.search(q, 100)
# fetch docs by I, rerank outside FAISS
36) Deduplication
Use range search with a small radius on unit vectors to find near‑duplicates; keep one per cluster.
lims, D, I = index.range_search(x, 1e-4)
37) Batch Size & BLAS
Larger batch sizes improve throughput; ensure FAISS links against optimized BLAS (MKL/OpenBLAS) for CPU.
faiss.get_num_gpus()
38) Quantizer Training Set
Use millions of samples for large nlist/PQ; random, representative, and preprocessed exactly like production.
x_train = x[np.random.choice(len(x), 1_000_000, replace=False)]
39) Memory Planning
Estimate bytes per vector (PQ code + overhead) × N, plus IVF lists and quantizer. Validate with index.sa_code_size() when available.
# rule of thumb: PQ64@8bits ≈ 64B/vector
40) Q&A — “How big should nlist be?”
Answer: Start near √N (e.g., N=100M ⇒ nlist≈10K). Tune alongside nprobe to reach target recall/latency; avoid extreme values without enough training data.
41) Persistence & Backups
Snapshot the index file alongside your metadata DB snapshots; verify checksums and restore procedures regularly.
faiss.write_index(index, '/mnt/snapshots/idx_2025-08-26.faiss')
42) Monitoring
Track QPS, tail latency, memory, GPU utilization, training time, and recall against a canary set; alert on drops.
# expose custom metrics from your service layer
43) Concurrency & Threading
Use FAISS’s internal threading settings judiciously; also parallelize at the service layer for multi‑core CPUs.
faiss.omp_set_num_threads(8)
44) Common Errors
Mismatched dims, untrained IVF/PQ, forgetting L2 normalization for cosine, tiny nprobe, or missing ID maps. Validate pipeline end‑to‑end.
assert x.shape[1] == d
45) Multi‑Tenant Isolation
Use ID ranges or separate indexes per tenant; filter by IDSelector or pre‑route queries.
sel = faiss.IDSelectorRange(lo, hi)
46) Production Checklist
- Vector preprocessing is identical in train/ingest/query
- Ground‑truth Flat baseline & recall dashboard
- Appropriate nlist/nprobe (or efSearch) tuning
- Backups & deterministic rebuild scripts
- Resource sizing for RAM/GPU/IO; canary tests
- Safe rollout with shadow queries & AB tests
47) Serving Architecture
Put a stateless API in front of FAISS; cache hot queries, pool memory, and colocate with feature store/DB for low latency.
# e.g., FastAPI/Express service wrapping FAISS index
48) Evaluation Harness
Maintain fixed query sets and measure recall/latency across builds; store curves for regressions.
# log recall@k and P95/P99 per version
49) Reproducibility
Fix random seeds for k‑means/PQ; record training sample hashes and index factory strings.
faiss.rand_seed(1234)
50) Interview Q&A — 20 Practical Questions (FAISS‑focused)
1) Why IVF? Sublinear search by probing a subset of inverted lists; trades recall for speed.
2) Why PQ? Compress vectors to fit RAM/GPU and accelerate distance evals with acceptable recall loss.
3) How to use cosine? L2‑normalize vectors and use inner product.
4) nlist/nprobe intuition? nlist partitions, nprobe probes at query time; higher nprobe ⇒ better recall, slower.
5) OPQ vs PQ? OPQ rotates space to minimize quantization error, often improving recall at the same code size.
6) HNSW vs IVF? HNSW is graph‑based with efSearch; IVF is centroid‑based. Try both; hybrids exist.
7) When Flat? Small N or need exact results; also for evaluation.
8) How to delete? remove_ids where supported or rebuild; maintain tombstones.
9) How to store metadata? External DB keyed by IDs; FAISS is vectors+IDs only.
10) GPU sharding vs replication? Shard for capacity; replicate for QPS.
11) What is efSearch? HNSW parameter controlling breadth of search; higher improves recall.
12) Typical PQ code sizes? 16–64 bytes common; 64–128 bytes for higher quality.
13) What is IndexFactory? String grammar to compose indexes quickly.
14) Range search use cases? Near‑duplicate detection, clustering.
15) Why PreTransform? Compose PCA/OPQ/normalization before the index for cleaner pipelines.
16) Recall measurement? Compare ANN neighbors to Flat; compute recall@k.
17) Memory planning? Estimate code size × N; include IVF/HNSW overhead.
18) Hot parameters to tune? nprobe, m/nbits, efSearch, batch size.
19) How to ensure consistency? Same tokenize/normalize/projection across train/ingest/query.
20) How to pick an index? Start with Flat baseline → try IVFFlat → IVFPQ/OPQ → HNSW/Hybrids; choose by SLOs & cost.