Cohere Pocket Book

Cohere Pocket Book

Cohere Pocket Book — Uplatz

50 deep-dive flashcards • Wide layout • Fewer scrolls • 20+ Interview Q&A • Readable code examples

Section 1 — Fundamentals

1) What is Cohere?

Cohere provides enterprise-grade NLP/GenAI services: text generation, embeddings, reranking, and retrieval for secure, private deployments. It emphasizes data control, safety, latency, and integration with existing stacks (RAG, search, analytics).

# Python
pip install cohere

2) Core Building Blocks

Three pillars: Generate (LLMs for writing/agents), Embed (semantic vectors for search/RAG), Rerank (relevance boosts for search results). Compose them for robust retrieval pipelines.

# JS (Node)
npm i cohere-ai

3) Typical Use Cases

RAG chatbots, semantic search, document Q&A, support copilots, content routing, deduplication, similarity clustering, and ranking quality improvements (e-commerce, knowledge bases, analytics).

# Examples
- RAG over PDFs
- FAQ answerer
- Codebase semantic search

4) Cohere vs Generic LLM APIs

Cohere focuses on secure deployment, retrieval quality (embeddings+rerank), predictable costs, and enterprise controls (data retention options, region choices). Pairs well with vector DBs and existing search.

# Deciding
- Need strong search relevance? → Use Embed + Rerank
- Need private inference? → Enterprise options

5) Key Terms

Embedding: numeric vector for meaning; RAG: retrieve then generate; Rerank: reorder candidates by relevance; Context window: input tokens available to the model.

# Mental model:
Query → Embed → Vector Search → Top-k → Rerank → LLM

6) Authentication & Regions

Use an API key scoped to environment. For enterprises, choose region/data-control options. Rotate keys and store in secret managers.

export COHERE_API_KEY="***"

7) Pricing & Cost Basics

Costs come from tokens (generate) and vector ops (embed) plus rerank calls. Control usage with caching, truncation, and top-k limits. Log token counts.

# Pseudocode
max_tokens=512; top_k=20; rerank_top_n=5

8) Latency Considerations

Minimize round-trips: batch embeddings; cache frequent queries; reduce top-k; run rerank only on candidates. Prefer nearby regions.

# Batch embed
embed(texts=[...100 docs...])

9) Data Governance

Use retention flags per policy. Avoid sending PII; mask or tokenize sensitive values. Keep an audit trail of prompts and retrieved docs.

# Pseudocode
redact(user_input) → prompt

10) Q&A — “Why Cohere for search-heavy apps?”

Answer: Rerank + high-quality embeddings materially improve relevance. That means fewer hallucinations in RAG, better top answers, and measurable gains on click-through/deflection metrics.

Section 2 — Core APIs & Models

11) SDK Setup (Python)

Initialize a client and test a simple call to ensure keys and firewalls are configured correctly.

import cohere, os
co = cohere.Client(os.getenv("COHERE_API_KEY"))

12) Generate: Prompting

Provide system instructions, add few-shot examples, and constrain length/temperature for consistent style. Stream responses when building chat UIs.

resp = co.generate(prompt="Write a short FAQ...")
print(resp.generations[0].text)

13) Generate: Parameters

Common knobs: max_tokens, temperature, k/p sampling, stop sequences, and safety filters. Start deterministic, add creativity later.

co.generate(prompt=p, max_tokens=300, temperature=0.3)

14) Embeddings (Python)

Convert text to dense vectors for similarity search and clustering. Normalize vectors to cosine space if your DB expects it.

emb = co.embed(texts=["Doc A","Doc B"])
vecs = emb.embeddings

15) Embeddings (JS)

Node apps can embed documents at ingest time and store vectors alongside metadata (source, URI, permissions).

import { CohereClient } from "cohere-ai";
const co = new CohereClient({ token: process.env.COHERE_API_KEY });
const { embeddings } = await co.embed({ texts: docs });

16) Rerank API

Rerank improves ordering by scoring the relevance of each candidate to the query. Use it after a fast vector or keyword retrieval.

scores = co.rerank(query="reset password", documents=candidates)

17) Chat/Command-style Interfaces

Maintain conversation state (system, user, assistant). For RAG, inject retrieved snippets as “context” messages and cite sources in the final answer.

history=[{"role":"system","content":"You are helpful."}]

18) Tokenization & Limits

Keep prompts under context limits. Truncate long docs with a splitter and summarize irreducible sections. Track token usage per request.

# Splitter sizes ~500-1000 tokens

19) Batch Ops

Batch embeddings for throughput; queue large jobs (ingest pipelines). Respect rate limits and use retries with jitter.

for chunk in chunks(docs, 128): co.embed(texts=chunk)

20) Q&A — “When to Rerank?”

Answer: Use it when the first-pass retrieval (BM25 or vector top-k) is noisy. Rerank a smaller candidate set (e.g., top 50 → top 5) for latency-efficient quality gains.

Section 3 — Retrieval, RAG & Evaluation

21) RAG Blueprint

Ingest → chunk → embed → store → at query: embed query → top-k retrieve → rerank → assemble context → generate answer + citations.

# Pseudocode RAG
ctx = retrieve(query)
ans = generate(context=ctx, prompt=query)

22) Chunking Strategy

Use semantic or fixed-size chunks; overlap slightly (10–20%) to preserve meaning across boundaries. Keep metadata (doc id, section).

chunk_size=800; overlap=120

23) Vector DB Choices

Works with Pinecone, Weaviate, Milvus, Qdrant, PGVector, Redis. Choose based on scale, filtering needs, and ops maturity.

# Store
upsert(id, vector, metadata)

24) Hybrid Retrieval

Combine keyword (BM25) + vector. Use unions or reciprocal rank fusion; rerank the merged set for relevance and lexical recall.

cands = bm25 ∪ vector_topk; rerank(cands)

25) Filters & Permissions

Add metadata filters (department, language, date) to retrieval queries. Enforce ACLs both at retrieval and in the UI.

where = { team:"support", region:"EU" }

26) Context Assembly

Concatenate top snippets with titles and sources. Deduplicate overlapping text; compress with map-reduce summarization if too long.

context = "\n\n".join(top_snippets)

27) Grounded Generation

Instruct the model to only answer from provided context; otherwise say “not found.” Cite sources to build trust.

system: "Answer only with given context."

28) Evals: Relevance

Track NDCG, Recall@k, MRR on a labeled set. For quick loops, use LLM-as-judge but confirm with human review for critical paths.

metrics = { ndcg:0.61, recall5:0.78 }

29) Evals: Answer Quality

Measure faithfulness (no hallucination), completeness, and citation accuracy. Use adversarial queries during testing.

# Judge rubric: faithfulness, sources, style

30) Q&A — “Why do my answers drift off-topic?”

Answer: Context too long/noisy, weak retrieval, or temperature too high. Fix by better chunking, rerank, stricter instructions, and lower sampling entropy.

Section 4 — Integration, MLOps & System Design

31) API Patterns

Build a thin API layer: /embed, /search, /rerank, /chat. Centralize auth, quotas, logging, and retries. Make each call idempotent.

POST /search { query, filters }

32) Streaming UIs

Stream tokens for chat responsiveness. Show retrieved sources first, then the answer as it streams. Handle cancel/abort cleanly.

AbortController().abort()

33) Observability

Log prompt, token counts, latency, top-k size, rerank scores, selected sources. Correlate with request IDs. Build dashboards per route.

log.info({ route:"/chat", ttfb_ms })

34) Safety & Guardrails

Classify inputs, filter unsafe content, and set policy refusals. Mask or drop PII. Add allowlists for tools and connectors.

if (isUnsafe(text)) return policy_refusal()

35) Prompt Engineering

Use role instructions, style guides, and few-shot exemplars. Keep prompts modular; version them; A/B test changes.

system: "You are a helpful support agent."

36) Retrieval Caching

Cache query→doc ids and doc id→text. Invalidate on re-ingest. Memoize embedding of identical texts across tenants.

cache.set(hash(query), top_ids)

37) Multilingual

Use multilingual embeddings and detect language automatically. Store lang metadata and prefer same-language results first.

metadata:{ lang:"fr" }

38) Evaluation Loops

Nightly jobs compute retrieval and answer quality metrics. Fail builds on metric regressions; promote configs via flags.

if ndcg < 0.55: fail_ci()

39) Finetuning & Adapters

When domain language is niche, consider light-weight adaptation or instruction tuning. Keep evals to confirm uplift over prompting-only.

# Track: overfit risk, data leakage

40) Q&A — “Vector DB vs Classic Search?”

Answer: Vector is semantic (great recall on paraphrases), keyword is lexical (precision on exact terms). Hybrid + rerank blends both strengths for enterprise docs.

Section 5 — Security, Governance, Deployment, Ops & Interview Q&A

41) Security Foundations

Store keys in secret managers, enforce TLS, restrict egress, validate inputs, and sanitize outputs. Add org/tenant scoping to every call path.

headers: { Authorization: "Bearer ***" }

42) Compliance & Retention

Align with internal retention policies. Provide user-visible notices for data usage. Offer opt-out for training where applicable.

retention_days=30; redact=true

43) Testing Strategy

Unit: prompt builders, retrievers. Integration: end-to-end RAG on a fixture corpus. Regression: snapshot expected answers/citations.

assert "Reset steps" in answer

44) Perf Tuning

Reduce top-k, compress context, use streaming, batch embeddings, and colocate services. Profile TTFB and full-render.

top_k=20 → 10; ctx_tokens=1500 → 900

45) Deployment Options

VM/containers with autoscaling; serverless for bursty chat; on-prem/private endpoints for strict data control. Add health/readiness endpoints.

GET /health → { ok:true }

46) Cost Controls

Cap tokens, cache results, pre-compute embeddings, and use rerank only when needed. Alert on anomalous token spikes.

if token_usage > budget: throttle()

47) SLOs & Runbooks

Define p95 latency targets, accuracy thresholds, and on-call runbooks (timeouts, retries, degraded modes without rerank).

SLO: p95 < 800ms (search+rnk)

48) Production Checklist

  • Secrets + rotation
  • Rate limits + quotas
  • Input/output filters
  • Vector + keyword hybrid
  • Rerank on merged set
  • Dashboards & alerts

49) Common Pitfalls

Overlong context, no rerank, weak filters, missing ACL checks, no evals, and prompt drift. Fix with chunking, hybrid retrieval, tests, and policy prompts.

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Why Cohere for RAG? Strong embeddings + rerank improve retrieval precision, reducing hallucinations and increasing answer utility.

2) When to use rerank? After an initial broad retrieval; rerank focuses on relevance within a manageable candidate set.

3) How to stop hallucinations? Grounded prompts, strict instructions, cite sources, and return “not found” when needed.

4) Embedding best practices? Chunk consistently, store metadata, normalize vectors if your DB needs it, and batch for throughput.

5) Hybrid vs vector-only? Hybrid improves recall on exact terms and broader semantics; rerank organizes combined results.

6) How to evaluate? Track retrieval (Recall@k, NDCG) and answer (faithfulness, completeness, citation accuracy).

7) Token cost control? Truncate inputs, summarize long context, cap max_tokens, and cache frequent answers.

8) Latency improvements? Stream outputs, reduce top-k, colocate services, and avoid unnecessary second calls.

9) Safety approaches? Classify/deny unsafe content, redact PII, and log policy decisions with rationales.

10) Multilingual tactics? Use multilingual embeddings, detect language, prioritize same-language sources.

11) Handling ACLs? Filter at retrieval and regenerate only from authorized snippets; audit access.

12) Fine-tune vs prompt? Start with prompting; consider tuning when consistent domain phrasing or format is crucial.

13) Prevent prompt drift? Version prompts, add tests, and use strict system instructions.

14) Vector DB selection? Based on ops team skills, filters, scale, and cost; benchmark recall/latency.

15) Streaming UX tips? Show sources first, then stream the answer; allow user interrupts.

16) Retry strategy? Exponential backoff with jitter; idempotency keys to avoid duplicate writes.

17) Logging essentials? Request IDs, token counts, latencies, selected docs, and rerank scores.

18) Batch ingestion? Use queues, batch embeddings, and parallel upserts; checkpoint progress.

19) On-prem considerations? Network egress control, latency tradeoffs, and compliance audits.

20) KPIs for success? Self-serve deflection, first-contact resolution, search CTR, doc coverage, and time-to-answer.