Chroma Db Pocket Book

ChromaDB Pocket Book — Uplatz

50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is ChromaDB?

ChromaDB is an open-source vector database built for AI-native applications. It enables storage and retrieval of embeddings generated by machine learning models, supporting semantic search, recommendation engines, and generative AI workflows. Its core strength lies in performance, scalability, and developer-friendly APIs across Python, JavaScript, and REST.

pip install chromadb
import chromadb

2) Why ChromaDB? Core Strengths & Tradeoffs

Strengths: Simple developer APIs, tight integration with LLM workflows, fast approximate nearest neighbor (ANN) search, and open-source flexibility. Tradeoffs: Early-stage compared to established DBs; operational maturity and ecosystem integrations are evolving.

# Python client
client = chromadb.Client()
collection = client.create_collection("docs")

3) Vector Database: Mental Model

Think of a vector DB as a specialized search engine. Instead of keyword lookups, it matches high-dimensional embeddings to find semantically similar items. Each row is an embedding vector with metadata and an ID. ANN algorithms index these for millisecond queries at scale.

collection.add(
  documents=["hello world"],
  embeddings=[[0.1,0.2,0.3,...]],
  ids=["doc1"])

4) Core Components

Key parts: Collections (namespaces for embeddings), Documents (text/data), Embeddings (vectors), Metadata (filters), and the Indexer (ANN engine). ChromaDB stores and serves embeddings with optional persistence backends.

results = collection.query(
  query_texts=["hi world"],
  n_results=3)

5) ChromaDB vs Other Vector DBs

Compared to Pinecone (hosted SaaS) or Weaviate (feature-rich, semantic graph), ChromaDB emphasizes developer simplicity and local-first usage. It integrates easily with LangChain and LlamaIndex, making it popular in open-source LLM projects.

# Example with LangChain
from langchain.vectorstores import Chroma

6) Storage Backends

ChromaDB defaults to SQLite for persistence. It can scale with DuckDB or Postgres for larger workloads. For ephemeral experiments, it can run fully in-memory. Choose the backend based on data size, durability, and integration needs.

client = chromadb.PersistentClient(path="./chroma")

7) Data Model

Each collection stores documents with IDs, embeddings, and optional metadata. Queries support semantic search, metadata filtering, and hybrid modes (vector + keyword). Ensure embeddings are consistent in dimension with the model used.

collection.add(ids=["1"], documents=["AI rocks"],
  embeddings=[[0.23,0.11,...]],
  metadatas=[{"topic":"ml"}])

8) Releases & Versions

ChromaDB evolves rapidly. Pin versions in production for stability. Validate breaking changes before upgrades. Track changelogs for new ANN algorithms and API improvements.

pip install chromadb==0.x.y

9) Authentication & Multi-Tenancy

When deployed as a service, enforce API key authentication and namespace collections per tenant. For local development, auth is optional. Secure the backend if exposed beyond localhost.

# Example pseudo-config
auth:
  enabled: true
  keys: ["abc123"]

10) Q&A — “How does ChromaDB scale?”

Answer: By sharding collections, using ANN indexes, and scaling persistence backends (Postgres, DuckDB). For large-scale, deploy distributed workers behind an API, and tune index parameters for recall/latency tradeoffs.

Section 2 — APIs & Operations

11) Python API Basics

Most common client. Create a client, collections, and perform add/query operations. Tight integration with ML/AI workflows in Python ecosystems.

import chromadb
client = chromadb.Client()

12) JavaScript API

Use ChromaDB in Node.js or browser via REST client. Ideal for AI-enabled web apps. JS clients wrap the same API concepts as Python.

import { ChromaClient } from "chromadb";
const client = new ChromaClient()

13) REST API

All functionality exposed over HTTP. Build polyglot integrations without native SDKs. REST enables cloud-native scaling, multi-language support, and auth control.

POST /collections
{"name":"mydocs"}

14) Adding Documents

Documents can be text, JSON blobs, or embeddings. ChromaDB does not generate embeddings itself; you must supply them from models like OpenAI, Hugging Face, or local encoders.

collection.add(ids=["2"], documents=["ChromaDB guide"],
  embeddings=[[0.3,0.2,0.9,...]])

15) Querying

Queries accept either query_texts (auto-embedded if model connected) or explicit embeddings. Results include document IDs, metadata, and similarity scores.

results = collection.query(query_texts=["AI"], n_results=5)

16) Metadata Filtering

Attach metadata at insert, then filter during queries. Common filters: topic, source, user, timestamp. Combine vector similarity with boolean filters.

results = collection.query(query_texts=["AI"],
  where={"topic":"ml"})

17) Deleting & Updating

Remove items by IDs or update their metadata/embeddings. Useful for evolving knowledge bases and chatbots that need freshness.

collection.delete(ids=["2"])
collection.update(ids=["1"], documents=["AI is powerful"])

18) Persistence

Use PersistentClient with a path to store embeddings across restarts. Back up the DB files or use Postgres for durability and scaling.

client = chromadb.PersistentClient(path="./store")

19) Index Tuning

Underlying ANN libraries (like FAISS, HNSW) power ChromaDB. Parameters such as number of neighbors, recall vs latency, and memory footprint can be tuned depending on workload.

collection.query(query_embeddings=[...], n_results=10)

20) Q&A — “Does ChromaDB embed text itself?”

Answer: No. You must supply embeddings from external models (e.g., OpenAI, SentenceTransformers). ChromaDB stores and indexes them efficiently, but embedding generation is outside its scope.

Section 3 — AI & LLM Integration

21) LangChain

ChromaDB is a default vectorstore in LangChain. Use it for document retrieval, RAG pipelines, and agent memory. Developers can connect LangChain loaders and retrievers directly to ChromaDB.

from langchain.vectorstores import Chroma

22) LlamaIndex

Integrates seamlessly with LlamaIndex for building retrieval-augmented LLM apps. Use Chroma as a storage backend for nodes and queries in pipelines.

from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_vector_store(Chroma(...))

23) RAG (Retrieval-Augmented Generation)

ChromaDB stores knowledge chunks for retrieval during prompts. Embeddings are fetched, reranked, and injected into prompts to ground LLM answers in factual context.

docs = collection.query(query_texts=["climate change"], n_results=3)