Cohere Pocket Book — Uplatz
50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable code examples
1) What is Cohere?
Cohere provides enterprise-grade NLP/GenAI services: text generation, embeddings, reranking, and retrieval for secure, private deployments. It emphasizes data control, safety, latency, and integration with existing stacks (RAG, search, analytics).
# Python
pip install cohere
2) Core Building Blocks
Three pillars: Generate (LLMs for writing/agents), Embed (semantic vectors for search/RAG), Rerank (relevance boosts for search results). Compose them for robust retrieval pipelines.
# JS (Node)
npm i cohere-ai
3) Typical Use Cases
RAG chatbots, semantic search, document Q&A, support copilots, content routing, deduplication, similarity clustering, and ranking quality improvements (e-commerce, knowledge bases, analytics).
# Examples
- RAG over PDFs
- FAQ answerer
- Codebase semantic search
4) Cohere vs Generic LLM APIs
Cohere focuses on secure deployment, retrieval quality (embeddings+rerank), predictable costs, and enterprise controls (data retention options, region choices). Pairs well with vector DBs and existing search.
# Deciding
- Need strong search relevance? → Use Embed + Rerank
- Need private inference? → Enterprise options
5) Key Terms
Embedding: numeric vector for meaning; RAG: retrieve then generate; Rerank: reorder candidates by relevance; Context window: input tokens available to the model.
# Mental model:
Query → Embed → Vector Search → Top-k → Rerank → LLM
6) Authentication & Regions
Use an API key scoped to environment. For enterprises, choose region/data-control options. Rotate keys and store in secret managers.
export COHERE_API_KEY="***"
7) Pricing & Cost Basics
Costs come from tokens (generate) and vector ops (embed) plus rerank calls. Control usage with caching, truncation, and top-k limits. Log token counts.
# Pseudocode
max_tokens=512; top_k=20; rerank_top_n=5
8) Latency Considerations
Minimize round-trips: batch embeddings; cache frequent queries; reduce top-k; run rerank only on candidates. Prefer nearby regions.
# Batch embed
embed(texts=[...100 docs...])
9) Data Governance
Use retention flags per policy. Avoid sending PII; mask or tokenize sensitive values. Keep an audit trail of prompts and retrieved docs.
# Pseudocode
redact(user_input) → prompt
10) Q&A — “Why Cohere for search-heavy apps?”
Answer: Rerank + high-quality embeddings materially improve relevance. That means fewer hallucinations in RAG, better top answers, and measurable gains on click-through/deflection metrics.
11) SDK Setup (Python)
Initialize a client and test a simple call to ensure keys and firewalls are configured correctly.
import cohere, os
co = cohere.Client(os.getenv("COHERE_API_KEY"))
12) Generate: Prompting
Provide system instructions, add few-shot examples, and constrain length/temperature for consistent style. Stream responses when building chat UIs.
resp = co.generate(prompt="Write a short FAQ...")
print(resp.generations[0].text)
13) Generate: Parameters
Common knobs: max_tokens
, temperature
, k
/p
sampling, stop sequences, and safety filters. Start deterministic, add creativity later.
co.generate(prompt=p, max_tokens=300, temperature=0.3)
14) Embeddings (Python)
Convert text to dense vectors for similarity search and clustering. Normalize vectors to cosine space if your DB expects it.
emb = co.embed(texts=["Doc A","Doc B"])
vecs = emb.embeddings
15) Embeddings (JS)
Node apps can embed documents at ingest time and store vectors alongside metadata (source, URI, permissions).
import { CohereClient } from "cohere-ai";
const co = new CohereClient({ token: process.env.COHERE_API_KEY });
const { embeddings } = await co.embed({ texts: docs });
16) Rerank API
Rerank improves ordering by scoring the relevance of each candidate to the query. Use it after a fast vector or keyword retrieval.
scores = co.rerank(query="reset password", documents=candidates)
17) Chat/Command-style Interfaces
Maintain conversation state (system, user, assistant). For RAG, inject retrieved snippets as “context” messages and cite sources in the final answer.
history=[{"role":"system","content":"You are helpful."}]
18) Tokenization & Limits
Keep prompts under context limits. Truncate long docs with a splitter and summarize irreducible sections. Track token usage per request.
# Splitter sizes ~500-1000 tokens
19) Batch Ops
Batch embeddings for throughput; queue large jobs (ingest pipelines). Respect rate limits and use retries with jitter.
for chunk in chunks(docs, 128): co.embed(texts=chunk)
20) Q&A — “When to Rerank?”
Answer: Use it when the first-pass retrieval (BM25 or vector top-k) is noisy. Rerank a smaller candidate set (e.g., top 50 → top 5) for latency-efficient quality gains.
21) RAG Blueprint
Ingest → chunk → embed → store → at query: embed query → top-k retrieve → rerank → assemble context → generate answer + citations.
# Pseudocode RAG
ctx = retrieve(query)
ans = generate(context=ctx, prompt=query)
22) Chunking Strategy
Use semantic or fixed-size chunks; overlap slightly (10–20%) to preserve meaning across boundaries. Keep metadata (doc id, section).
chunk_size=800; overlap=120
23) Vector DB Choices
Works with Pinecone, Weaviate, Milvus, Qdrant, PGVector, Redis. Choose based on scale, filtering needs, and ops maturity.
# Store
upsert(id, vector, metadata)
24) Hybrid Retrieval
Combine keyword (BM25) + vector. Use unions or reciprocal rank fusion; rerank the merged set for relevance and lexical recall.
cands = bm25 ∪ vector_topk; rerank(cands)
25) Filters & Permissions
Add metadata filters (department, language, date) to retrieval queries. Enforce ACLs both at retrieval and in the UI.
where = { team:"support", region:"EU" }
26) Context Assembly
Concatenate top snippets with titles and sources. Deduplicate overlapping text; compress with map-reduce summarization if too long.
context = "\n\n".join(top_snippets)
27) Grounded Generation
Instruct the model to only answer from provided context; otherwise say “not found.” Cite sources to build trust.
system: "Answer only with given context."
28) Evals: Relevance
Track NDCG, Recall@k, MRR on a labeled set. For quick loops, use LLM-as-judge but confirm with human review for critical paths.
metrics = { ndcg:0.61, recall5:0.78 }
29) Evals: Answer Quality
Measure faithfulness (no hallucination), completeness, and citation accuracy. Use adversarial queries during testing.
# Judge rubric: faithfulness, sources, style
30) Q&A — “Why do my answers drift off-topic?”
Answer: Context too long/noisy, weak retrieval, or temperature too high. Fix by better chunking, rerank, stricter instructions, and lower sampling entropy.
31) API Patterns
Build a thin API layer: /embed
, /search
, /rerank
, /chat
. Centralize auth, quotas, logging, and retries. Make each call idempotent.
POST /search { query, filters }
32) Streaming UIs
Stream tokens for chat responsiveness. Show retrieved sources first, then the answer as it streams. Handle cancel/abort cleanly.
AbortController().abort()
33) Observability
Log prompt, token counts, latency, top-k size, rerank scores, selected sources. Correlate with request IDs. Build dashboards per route.
log.info({ route:"/chat", ttfb_ms })
34) Safety & Guardrails
Classify inputs, filter unsafe content, and set policy refusals. Mask or drop PII. Add allowlists for tools and connectors.
if (isUnsafe(text)) return policy_refusal()
35) Prompt Engineering
Use role instructions, style guides, and few-shot exemplars. Keep prompts modular; version them; A/B test changes.
system: "You are a helpful support agent."
36) Retrieval Caching
Cache query→doc ids and doc id→text. Invalidate on re-ingest. Memoize embedding of identical texts across tenants.
cache.set(hash(query), top_ids)
37) Multilingual
Use multilingual embeddings and detect language automatically. Store lang
metadata and prefer same-language results first.
metadata:{ lang:"fr" }
38) Evaluation Loops
Nightly jobs compute retrieval and answer quality metrics. Fail builds on metric regressions; promote configs via flags.
if ndcg < 0.55: fail_ci()
39) Finetuning & Adapters
When domain language is niche, consider light-weight adaptation or instruction tuning. Keep evals to confirm uplift over prompting-only.
# Track: overfit risk, data leakage
40) Q&A — “Vector DB vs Classic Search?”
Answer: Vector is semantic (great recall on paraphrases), keyword is lexical (precision on exact terms). Hybrid + rerank blends both strengths for enterprise docs.
41) Security Foundations
Store keys in secret managers, enforce TLS, restrict egress, validate inputs, and sanitize outputs. Add org/tenant scoping to every call path.
headers: { Authorization: "Bearer ***" }
42) Compliance & Retention
Align with internal retention policies. Provide user-visible notices for data usage. Offer opt-out for training where applicable.
retention_days=30; redact=true
43) Testing Strategy
Unit: prompt builders, retrievers. Integration: end-to-end RAG on a fixture corpus. Regression: snapshot expected answers/citations.
assert "Reset steps" in answer
44) Perf Tuning
Reduce top-k, compress context, use streaming, batch embeddings, and colocate services. Profile TTFB and full-render.
top_k=20 → 10; ctx_tokens=1500 → 900
45) Deployment Options
VM/containers with autoscaling; serverless for bursty chat; on-prem/private endpoints for strict data control. Add health/readiness endpoints.
GET /health → { ok:true }
46) Cost Controls
Cap tokens, cache results, pre-compute embeddings, and use rerank only when needed. Alert on anomalous token spikes.
if token_usage > budget: throttle()
47) SLOs & Runbooks
Define p95 latency targets, accuracy thresholds, and on-call runbooks (timeouts, retries, degraded modes without rerank).
SLO: p95 < 800ms (search+rnk)
48) Production Checklist
- Secrets + rotation
- Rate limits + quotas
- Input/output filters
- Vector + keyword hybrid
- Rerank on merged set
- Dashboards & alerts
49) Common Pitfalls
Overlong context, no rerank, weak filters, missing ACL checks, no evals, and prompt drift. Fix with chunking, hybrid retrieval, tests, and policy prompts.
50) Interview Q&A — 20 Practical Questions (Expanded)
1) Why Cohere for RAG? Strong embeddings + rerank improve retrieval precision, reducing hallucinations and increasing answer utility.
2) When to use rerank? After an initial broad retrieval; rerank focuses on relevance within a manageable candidate set.
3) How to stop hallucinations? Grounded prompts, strict instructions, cite sources, and return “not found” when needed.
4) Embedding best practices? Chunk consistently, store metadata, normalize vectors if your DB needs it, and batch for throughput.
5) Hybrid vs vector-only? Hybrid improves recall on exact terms and broader semantics; rerank organizes combined results.
6) How to evaluate? Track retrieval (Recall@k, NDCG) and answer (faithfulness, completeness, citation accuracy).
7) Token cost control? Truncate inputs, summarize long context, cap max_tokens
, and cache frequent answers.
8) Latency improvements? Stream outputs, reduce top-k, colocate services, and avoid unnecessary second calls.
9) Safety approaches? Classify/deny unsafe content, redact PII, and log policy decisions with rationales.
10) Multilingual tactics? Use multilingual embeddings, detect language, prioritize same-language sources.
11) Handling ACLs? Filter at retrieval and regenerate only from authorized snippets; audit access.
12) Fine-tune vs prompt? Start with prompting; consider tuning when consistent domain phrasing or format is crucial.
13) Prevent prompt drift? Version prompts, add tests, and use strict system instructions.
14) Vector DB selection? Based on ops team skills, filters, scale, and cost; benchmark recall/latency.
15) Streaming UX tips? Show sources first, then stream the answer; allow user interrupts.
16) Retry strategy? Exponential backoff with jitter; idempotency keys to avoid duplicate writes.
17) Logging essentials? Request IDs, token counts, latencies, selected docs, and rerank scores.
18) Batch ingestion? Use queues, batch embeddings, and parallel upserts; checkpoint progress.
19) On-prem considerations? Network egress control, latency tradeoffs, and compliance audits.
20) KPIs for success? Self-serve deflection, first-contact resolution, search CTR, doc coverage, and time-to-answer.