MongoDB Pocket Book

MongoDB Pocket Book — Uplatz

50 in-depth cards • Wide layout • Readable examples • 20-Q interview block included

Section 1 — Foundations

1) What is MongoDB?

MongoDB is a document-oriented NoSQL database storing JSON-like BSON documents. It favors flexible schemas, horizontal scaling via sharding, and rich secondary indexes, with a powerful aggregation framework and transactions (multi-document ACID).

# shell
mongosh "mongodb://localhost:27017"

2) Core Concepts

Hierarchy: Database → Collections → Documents. Each document has an _id primary key (ObjectId by default). Schemas are dynamic unless validated. CRUD via drivers or mongosh.

db.users.insertOne({ name:"Ava", age:29, skills:["mongo","node"] })

3) BSON & Types

BSON adds types beyond JSON: ObjectId, Date, Decimal128, Binary, Timestamp. Choose Decimal128 for precise monetary values; Date for ISO dates stored in UTC.

db.tx.insertOne({ amount: NumberDecimal("12.34"), ts: new Date() })

4) Read & Write Concerns

Write concern controls durability (w, j, wtimeout). Read concern controls consistency (local, majority, linearizable, snapshot in txns). Tune per operation.

db.orders.insertOne(doc, { writeConcern: { w:"majority", j:true } })

5) Replica Sets

Replica set = primary + secondaries for HA. Primary handles writes; secondaries replicate via the oplog. Elections pick a new primary on failure. Reads can target secondaries with read preferences.

rs.initiate({ _id:"rs0", members:[{_id:0,host:"n1"},{_id:1,host:"n2"}] })

6) Sharding

Sharding distributes data across shards (each a replica set). A mongos router routes queries. Pick a good shard key (high cardinality, good distribution). Zones can pin data by range for locality/regulations.

sh.enableSharding("shop"); sh.shardCollection("shop.orders",{ region:1, _id:1 })

7) Storage Engine (WiredTiger)

WiredTiger provides document-level concurrency, compression, checkpoints, and journaling. Configure cache size (~50% RAM default). Compression reduces I/O at slight CPU cost.

storage.wiredTiger.engineConfig.cacheSizeGB: 8

8) Index Basics

Indexes speed reads. Default _id index exists. Create compound indexes for common query patterns; order matters. Cardinality, selectivity, and sort patterns drive index design.

db.users.createIndex({ email:1 }, { unique:true })

9) Aggregation Framework

Pipeline stages ($match, $group, $project, $sort, $lookup, $facet) enable analytical queries without ETL. Many ops push down to indexes.

db.sales.aggregate([{ $match:{region:"EU"} },{ $group:{ _id:"$sku", rev:{ $sum:"$amount"} } }])

10) Q&A — “MongoDB vs RDBMS?”

Answer: MongoDB trades rigid schemas and joins for flexible documents, fast iteration, and horizontal scale. It supports ACID transactions when needed but shines with denormalized schemas optimized for app reads.

Section 2 — Data Modeling & CRUD

11) Embedding vs Referencing

Embed when data is accessed together and bounded in size (product + reviews snapshot). Reference when many-to-many, unbounded growth, or reuse across documents. Optimize for read patterns.

// Embed example
{ _id:1, name:"Post", comments:[{ by:"u1", text:"hi" }] }

12) Schema Design Rule

“Data that is accessed together should be stored together.” Model queries first. Precompute aggregates if needed. Avoid “3rd NF” mindset—join costs can be high.

// Orders keep buyer snapshot to avoid cross-collection joins

13) Document Growth & Padding

Growing documents may trigger move/rewrite. Avoid unbounded arrays; use bucketing (e.g., monthly activity docs) or $push with $slice.

db.posts.updateOne({_id:1}, { $push:{ comments:{ $each:[c], $slice:-100 } } })

14) CRUD Examples

Use operators: $set, $inc, $push, $pull, $addToSet. Upserts combine insert + update.

db.users.updateOne({ email }, { $set:{ name }, $inc:{ logins:1 } }, { upsert:true })

15) Array Operators

Use $elemMatch, positional $/$[] to update matching elements. $size, $slice for reads. Consider $addToSet to avoid duplicates.

db.c.updateOne({ _id:1, "tags":"js" }, { $set:{ "tags.$":"javascript" } })

16) Validation

JSON Schema validation at collection level ensures shape and ranges. Use validationAction warn/strict.

db.createCollection("users",{ validator:{ $jsonSchema:{ bsonType:"object", required:["email"] }}})

17) Bulk Writes

Batch operations with bulkWrite reduce round-trips and improve throughput. Choose ordered/unordered by failure semantics.

db.users.bulkWrite([{ insertOne:{ document:{email:"a"}} }, { updateOne:{ filter:{email:"a"}, update:{ $set:{name:"A"}}}}])

18) Transactions

Multi-document ACID transactions across a replica set or sharded cluster. Use for truly atomic multi-collection updates; keep them short.

const s = db.getMongo().startSession(); s.startTransaction();
try { s.getDatabase("shop").orders.insertOne(o); s.commitTransaction(); } catch(e){ s.abortTransaction(); }

19) Change Streams

Subscribe to real-time data changes from the oplog—great for triggers, caching, search indexing. Requires replica set.

db.collection.watch([{ $match:{ "operationType":{ $in:["insert","update"]}} }])

20) Q&A — “When to use transactions?”

Answer: When invariants span multiple documents/collections (e.g., money transfer, inventory + order). Otherwise prefer single-document atomic updates with operators for better performance.

Section 3 — Indexing, Aggregation & Search

21) Compound Indexes

Index order matters: prefix fields must match query’s sort/filter. A query on leftmost fields can use the index. Design around your most common $match + $sort.

db.orders.createIndex({ userId:1, createdAt:-1 })

22) Partial & Sparse Index

Partial indexes only include docs matching an expression—smaller/faster. Sparse only index docs where the field exists.

db.users.createIndex({ "email":1 }, { partialFilterExpression:{ verified:true } })

23) TTL & Expire

TTL indexes automatically delete docs after a time-to-live or at a date field. Great for sessions, temp data.

db.sessions.createIndex({ "expiresAt":1 }, { expireAfterSeconds:0 })

24) Text & Atlas Search

Built-in text index supports stemming/scores; Atlas Search (Lucene-based) offers rich full-text, facets, and relevance tuning (managed in Atlas).

db.articles.createIndex({ content:"text", title:"text" })

25) Geospatial

2dsphere indexes for GeoJSON (Point, Polygon); queries like $near, $geoWithin. Store coords as [lng, lat].

db.places.createIndex({ loc:"2dsphere" })
db.places.find({ loc:{ $near:{ $geometry:{ type:"Point", coordinates:[-0.1,51.5] }, $maxDistance:5000 }}})

26) Aggregation Patterns

Use $match early, $project to reduce fields, and $group for rollups. $lookup for joins (beware big fan-out). $facet runs multiple pipelines in one pass.

db.sales.aggregate([
 { $match:{ ts:{ $gte:ISODate("2025-01-01") }}},
 { $group:{ _id:"$sku", qty:{ $sum:"$qty" }, rev:{ $sum:"$amount" } }},
 { $sort:{ rev:-1 } }, { $limit:10 }
])

27) Performance Explain

Use explain("executionStats") to inspect index use, scanned vs returned, stage tree. Fix COLLSCANs, add/adjust indexes.

db.orders.find({ userId:1 }).sort({ createdAt:-1 }).explain("executionStats")

28) Aggregation Windows

Window operators ($setWindowFields) compute moving averages, ranks, etc., over partitions—handy for analytics without exporting to Spark.

db.m.find().aggregate([{ $setWindowFields:{ partitionBy:"$userId", sortBy:{ts:1}, output:{ runAvg:{ $avg:"$v", window:{ documents:[-5,0] }}}}}])

29) Faceted Search

Build search pages with counts per facet using $facet to compute buckets in one pipeline, reducing extra round-trips.

db.products.aggregate([{ $match:{ inStock:true } },{ $facet:{ brands:[{ $sortByCount:"$brand"}], price:[{ $bucket:{ groupBy:"$price", boundaries:[0,50,100,200,1000] }}] }}])

30) Q&A — “Compound index order?”

Answer: Put equality fields first, then sort/range fields. The index must support your $sort order; otherwise MongoDB sorts in memory. Design around your top N query patterns.

Section 4 — Performance, Operations & Security

31) Connection Pooling

Use driver pools; avoid creating clients per request. Configure timeouts and keep-alive. Monitor connection counts to primary/secondaries.

// Node.js
import { MongoClient } from "mongodb";
const client = new MongoClient(uri, { maxPoolSize: 50 });

32) Write Patterns

Use single-document atomic ops. Avoid read-modify-write races by leveraging operators like $inc and $currentDate. Apply idempotency keys for at-least-once inputs.

db.counters.updateOne({ _id:"order" }, { $inc:{ seq:1 } }, { upsert:true })

33) Hot Partitions & Shard Keys

Avoid monotonically increasing shard keys (e.g., timestamp) that hotspot the last chunk. Use hashed shard keys or compound keys with good distribution and query targeting.

sh.shardCollection("app.events", { userId:"hashed" })

34) Backups & PITR

Use Cloud/Atlas backups or filesystem snapshots with consistent checkpoints. Oplog-based Point-in-Time Restore (PITR) enables recovering to a timestamp.

# Atlas: continuous backups + PITR window (UI/CLI)

35) Monitoring

Watch Opcounters, page faults, queue lengths, replication lag, WT cache pressure, and slow queries. Enable profiler at low sampling for hotspots.

db.setProfilingLevel(1, 100) // log ops slower than 100ms

36) Security Basics

Enable auth, enforce TLS, restrict network, use SCRAM users with least privilege, rotate keys. Encrypt at rest; use field-level encryption for PII where needed.

# mongod.conf
security:
  authorization: enabled

37) Read/Write Patterns at Scale

Batch writes, use unordered bulk for higher throughput, set appropriate write concern. For reads, use projections to return only needed fields and indexes that support sort.

db.users.find({active:true}, { projection:{ email:1, name:1 }})

38) Caching & Secondary Reads

Cache hot reads at app layer. For replicas, use readPreference: secondary for non-critical analytics; understand eventual consistency.

// Node.js
coll.find(q, { readPreference: "secondaryPreferred" })

39) Multi-Tenancy

Per-tenant databases/collections or a shared collection with tenantId in the shard key and unique index partials. Enforce via app and (in Atlas) data access controls.

db.items.createIndex({ tenantId:1, key:1 }, { unique:true, partialFilterExpression:{ tenantId:{ $exists:true }}})

40) Q&A — “Why is my query slow?”

Answer: Missing/inefficient indexes, low selectivity, in-memory sort, large projections, or sharding causing scatter-gather. Fix with targeted compound indexes, projections, and query routing to the correct shard.

Section 5 — Recipes, Pitfalls & Interview Q&A

41) Pagination

Prefer range-based (bookmark) pagination over skip/limit for large offsets. Use an indexed field (e.g., _id or createdAt).

db.posts.find({ _id:{ $gt:lastId } }).sort({ _id:1 }).limit(20)

42) Unique Constraints

Enforce uniqueness with unique indexes; for scoped uniqueness use compound or partial unique indexes.

db.users.createIndex({ tenantId:1, email:1 }, { unique:true })

43) Upsert Idempotency

Design idempotent upserts using natural keys and $setOnInsert. Prevent duplicates in retries.

db.payments.updateOne({ extId }, { $set:{status:"paid"}, $setOnInsert:{ createdAt:new Date() } }, { upsert:true })

44) Soft Deletes

Use deletedAt timestamps and partial indexes that ignore deleted docs, or move to an archive collection.

db.items.createIndex({ email:1 }, { unique:true, partialFilterExpression:{ deletedAt:{ $exists:false }}})

45) Pre-Aggregations

Materialize aggregates (daily rollups) into separate collections via scheduled jobs/change streams to speed dashboards.

// write to reports.daily_sales { day, sku, qty, rev }

46) Time-Series Collections

Use time-series collections for telemetry: automatic bucketing, compressed storage, and optimized queries on time/metadata.

db.createCollection("metrics",{ timeseries:{ timeField:"ts", metaField:"host", granularity:"minutes" }})

47) Large Files (GridFS)

For files >16MB, use GridFS to chunk and store; otherwise prefer object storage (S3) and keep only metadata/URLs in MongoDB.

# GridFS via drivers (fs.files, fs.chunks)

48) Common Pitfalls

Designing schemas like SQL, unbounded arrays, skip/limit deep pages, scatter-gather queries, monotonous shard keys, and missing TTLs for ephemeral data.

// Fix with embedding, bucketing, range pagination, hashed/compound shard keys

49) Mini Checklist

  • Model around queries & workloads
  • Compound indexes support filter+sort
  • Use projections, avoid large docs
  • Backups + PITR tested
  • Auth/TLS + least privilege

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Embedding vs referencing? Embed when data is accessed together and bounded; reference for many-to-many/unbounded or shared entities.

2) How to pick a shard key? High cardinality, even write distribution, and query targeting. Avoid hotspots; consider hashed or compound keys.

3) Write concern vs read concern? Write concern = durability guarantees; read concern = consistency level. Tune per op/txn.

4) When use transactions? Only when invariants span multiple documents/collections; keep txns short to avoid lock contention.

5) Indexes for sort + filter? Compound with equality fields first, then sort/range fields, matching sort order.

6) Avoid skip/limit? Use range pagination with indexed fields; skip scans O(n).

7) Handle schema changes? Use schema versioning and migration scripts; keep validators lenient during rollout.

8) Why COLLSCAN? No matching index, low selectivity, or expression not supported by index. Add/rewrite index or reshape query.

9) Partial index use? Smaller, faster indexes when you only query a subset (e.g., active:true).

10) Change streams use cases? Cache invalidation, search indexing, CDC to warehouses, reactive systems.

11) Atlas Search vs text index? Atlas Search offers Lucene features, scoring, facets, and better relevance; text index is basic.

12) Time-series benefits? Compressed storage, automatic bucketing, faster time-bounded queries.

13) Monotonic shard key problem? Hot chunk receives all new writes; fix with hashed or compound shard keys.

14) Ensure idempotency? Natural keys + $setOnInsert, dedupe tokens, and unique indexes.

15) Geospatial basics? 2dsphere index, GeoJSON types, queries like $near/$geoWithin.

16) Profiler usage? Capture slow ops to tune indexes and queries; don’t leave at level 2 in prod.

17) Field-level encryption? Encrypt sensitive fields client-side; server stores ciphertext; keys via KMS.

18) Bulk vs single writes? Bulk reduces round-trips and improves throughput; unordered is fastest but partial failures need handling.

19) Aggregation optimization? $match early, projections, index-friendly operators, avoid unwinds on huge arrays without filters.

20) Replica set lag? Network/IO pressure; tune write concern, optimize secondaries, and monitor oplog window.