ChromaDB Pocket Book — Uplatz
50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is ChromaDB?
ChromaDB is an open-source vector database built for AI-native applications. It enables storage and retrieval of embeddings generated by machine learning models, supporting semantic search, recommendation engines, and generative AI workflows. Its core strength lies in performance, scalability, and developer-friendly APIs across Python, JavaScript, and REST.
pip install chromadb
import chromadb
2) Why ChromaDB? Core Strengths & Tradeoffs
Strengths: Simple developer APIs, tight integration with LLM workflows, fast approximate nearest neighbor (ANN) search, and open-source flexibility. Tradeoffs: Early-stage compared to established DBs; operational maturity and ecosystem integrations are evolving.
# Python client
client = chromadb.Client()
collection = client.create_collection("docs")
3) Vector Database: Mental Model
Think of a vector DB as a specialized search engine. Instead of keyword lookups, it matches high-dimensional embeddings to find semantically similar items. Each row is an embedding vector with metadata and an ID. ANN algorithms index these for millisecond queries at scale.
collection.add(
documents=["hello world"],
embeddings=[[0.1,0.2,0.3,...]],
ids=["doc1"])
4) Core Components
Key parts: Collections (namespaces for embeddings), Documents (text/data), Embeddings (vectors), Metadata (filters), and the Indexer (ANN engine). ChromaDB stores and serves embeddings with optional persistence backends.
results = collection.query(
query_texts=["hi world"],
n_results=3)
5) ChromaDB vs Other Vector DBs
Compared to Pinecone (hosted SaaS) or Weaviate (feature-rich, semantic graph), ChromaDB emphasizes developer simplicity and local-first usage. It integrates easily with LangChain and LlamaIndex, making it popular in open-source LLM projects.
# Example with LangChain
from langchain.vectorstores import Chroma
6) Storage Backends
ChromaDB defaults to SQLite for persistence. It can scale with DuckDB or Postgres for larger workloads. For ephemeral experiments, it can run fully in-memory. Choose the backend based on data size, durability, and integration needs.
client = chromadb.PersistentClient(path="./chroma")
7) Data Model
Each collection stores documents with IDs, embeddings, and optional metadata. Queries support semantic search, metadata filtering, and hybrid modes (vector + keyword). Ensure embeddings are consistent in dimension with the model used.
collection.add(ids=["1"], documents=["AI rocks"],
embeddings=[[0.23,0.11,...]],
metadatas=[{"topic":"ml"}])
8) Releases & Versions
ChromaDB evolves rapidly. Pin versions in production for stability. Validate breaking changes before upgrades. Track changelogs for new ANN algorithms and API improvements.
pip install chromadb==0.x.y
9) Authentication & Multi-Tenancy
When deployed as a service, enforce API key authentication and namespace collections per tenant. For local development, auth is optional. Secure the backend if exposed beyond localhost.
# Example pseudo-config
auth:
enabled: true
keys: ["abc123"]
10) Q&A — “How does ChromaDB scale?”
Answer: By sharding collections, using ANN indexes, and scaling persistence backends (Postgres, DuckDB). For large-scale, deploy distributed workers behind an API, and tune index parameters for recall/latency tradeoffs.
11) Python API Basics
Most common client. Create a client, collections, and perform add/query operations. Tight integration with ML/AI workflows in Python ecosystems.
import chromadb
client = chromadb.Client()
12) JavaScript API
Use ChromaDB in Node.js or browser via REST client. Ideal for AI-enabled web apps. JS clients wrap the same API concepts as Python.
import { ChromaClient } from "chromadb";
const client = new ChromaClient()
13) REST API
All functionality exposed over HTTP. Build polyglot integrations without native SDKs. REST enables cloud-native scaling, multi-language support, and auth control.
POST /collections
{"name":"mydocs"}
14) Adding Documents
Documents can be text, JSON blobs, or embeddings. ChromaDB does not generate embeddings itself; you must supply them from models like OpenAI, Hugging Face, or local encoders.
collection.add(ids=["2"], documents=["ChromaDB guide"],
embeddings=[[0.3,0.2,0.9,...]])
15) Querying
Queries accept either query_texts
(auto-embedded if model connected) or explicit embeddings. Results include document IDs, metadata, and similarity scores.
results = collection.query(query_texts=["AI"], n_results=5)
16) Metadata Filtering
Attach metadata at insert, then filter during queries. Common filters: topic, source, user, timestamp. Combine vector similarity with boolean filters.
results = collection.query(query_texts=["AI"],
where={"topic":"ml"})
17) Deleting & Updating
Remove items by IDs or update their metadata/embeddings. Useful for evolving knowledge bases and chatbots that need freshness.
collection.delete(ids=["2"])
collection.update(ids=["1"], documents=["AI is powerful"])
18) Persistence
Use PersistentClient with a path to store embeddings across restarts. Back up the DB files or use Postgres for durability and scaling.
client = chromadb.PersistentClient(path="./store")
19) Index Tuning
Underlying ANN libraries (like FAISS, HNSW) power ChromaDB. Parameters such as number of neighbors, recall vs latency, and memory footprint can be tuned depending on workload.
collection.query(query_embeddings=[...], n_results=10)
20) Q&A — “Does ChromaDB embed text itself?”
Answer: No. You must supply embeddings from external models (e.g., OpenAI, SentenceTransformers). ChromaDB stores and indexes them efficiently, but embedding generation is outside its scope.
21) LangChain
ChromaDB is a default vectorstore in LangChain. Use it for document retrieval, RAG pipelines, and agent memory. Developers can connect LangChain loaders and retrievers directly to ChromaDB.
from langchain.vectorstores import Chroma
22) LlamaIndex
Integrates seamlessly with LlamaIndex for building retrieval-augmented LLM apps. Use Chroma as a storage backend for nodes and queries in pipelines.
from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_vector_store(Chroma(...))