BigQuery Pocket Book — Uplatz
~60 deep-dive flashcards • Single column • Storage & Tables • SQL & Arrays • Performance & Cost • Security • Ingestion • Ops • Interview Q&A
Cheat-friendly explanations • Readable SQL/CLI snippets • Production-oriented tips
1) What is BigQuery?
Serverless, highly parallel data warehouse on Google Cloud. Separate storage/compute, ANSI SQL, petabyte scale, pay per data processed or via slot capacity.
# CLI auth & default project
gcloud auth login
gcloud config set project <PROJECT_ID>
2) Core Objects
Project → Dataset (schema) → Table/View. Regions: multi/regions; keep data close to compute and sources.
3) Pricing Basics
Storage charged by TB/month; queries charged by TB scanned (on-demand) or covered by slot reservations (capacity). Use DRY RUN
to estimate bytes.
-- Estimate cost before running
EXPLAIN SELECT * FROM `proj.ds.table`;
-- or
DECLARE _DRY BOOL DEFAULT TRUE;
4) SQL Dialect
Standard SQL (default). Supports arrays, structs, JSON type, window functions, geospatial, ML, and procedural scripting.
5) When to Use BigQuery?
Analytics and BI on large datasets, ELT on lake data, semi-structured JSON logs, joins across massive tables, and ML close to data.
6) Regions & Storage
Pick a single region or multi-region (e.g., US/EU). Data residency matters for compliance and latency to GCS/GA4/Dataflow.
7) Projects & Datasets
One dataset per domain or environment. Name with clarity: analytics_raw
, analytics_curated
, bi
.
bq mk --dataset --location=US myproj:analytics_raw
8) Time Travel & Snapshots
Query previous table versions (system time) for accidental deletes/overwrites.
SELECT * FROM `proj.ds.sales`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR);
9) Query Caching
Identical queries over same data may return from cache at no cost. Disable if you need fresh reads.
10) Console & CLI
Use BigQuery Studio (web) or bq
CLI and client libraries. Prefer versioned SQL files in Git for repeatability.
11) Create Dataset/Table
-- DDL
CREATE SCHEMA IF NOT EXISTS `proj.analytics`;
CREATE TABLE `proj.analytics.events`(
event_time TIMESTAMP,
user_id STRING,
attrs JSON
);
12) Partitioning
Partition by DATE/TIMESTAMP column or by ingestion time. Prunes bytes scanned and speeds queries.
CREATE TABLE `proj.analytics.events_p`
PARTITION BY DATE(event_time) AS SELECT * FROM `proj.analytics.events`;
13) Clustering
Cluster up to 4 columns (e.g., user_id
, country
) to improve predicate/aggregation performance.
CREATE TABLE `proj.analytics.events_c`
PARTITION BY DATE(event_time)
CLUSTER BY user_id, country AS SELECT * FROM `proj.analytics.events`;
14) Ingestion-Time Partitioning
Ideal for logs; use pseudo column for decorators and cost pruning.
CREATE TABLE `proj.logs.app`
PARTITION BY _PARTITIONDATE AS SELECT * FROM `proj.tmp.src`;
15) Decorators
Access a partition or snapshot with decorators.
SELECT COUNT(*) FROM `proj.ds.tbl$20250820`;
SELECT * FROM `proj.ds.tbl@-3600000`; -- 1 hour ago
16) External Tables & BigLake
Query Parquet/CSV/JSON in GCS without loading; add governance via BigLake for consistent security across lake+warehouse.
CREATE EXTERNAL TABLE `proj.ext.web`
WITH CONNECTION `proj.us.gcs`
OPTIONS ( uris=['gs://bucket/curated/web/**/*.parquet'] );
17) Schemas & Types
Prefer typed columns over STRING
; use NUMERIC/BIGNUMERIC
for money; JSON
for semi-structured with functions.
18) Views & Materialized Views
Views = virtual SQL. Materialized views = precomputed and auto-refreshed for speed/cost reduction.
19) Table Lifecycle
Set default expiration per dataset/table to control storage cost; use labels for ownership and cleanup.
20) Sharded Tables (legacy)
Prefer partitioned tables over sharded names like events_20250820
. If you inherit sharded, migrate to partitioned for performance.
21) Arrays & UNNEST
Arrays require UNNEST
to flatten; re-aggregate with ARRAY_AGG
.
SELECT user_id, item FROM `proj.ds.cart`,
UNNEST(items) AS item;
22) Structs
Nested records model JSON-like objects with types.
SELECT address.city, address.zip FROM `proj.ds.customers`;
23) JSON Type
Use JSON_VALUE
/JSON_QUERY
/JSON_EXTRACT
to access fields.
SELECT JSON_VALUE(attrs,'$.device.os') AS os FROM `proj.analytics.events`;
24) Window Functions
Ranking, moving averages, percentiles.
SELECT user_id, amount,
AVG(amount) OVER (PARTITION BY user_id ORDER BY ts ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg7
FROM `proj.sales.txn`;
25) Geography
Spatial analysis with GEOGRAPHY
and ST_*
functions.
SELECT ST_DISTANCE(h.geom, a.geom) AS m FROM homes h, amenities a;
26) Time & Date
Use DATE
/TIMESTAMP
/DATETIME
types and functions; convert time zones explicitly.
27) DDL/DML
ALTER TABLE `proj.ds.t` ADD COLUMN country STRING;
MERGE `proj.ds.t` T
USING `proj.ds.stage` S ON S.id=T.id
WHEN MATCHED THEN UPDATE SET amount=S.amount
WHEN NOT MATCHED THEN INSERT (id,amount) VALUES(S.id,S.amount);
28) Scripting
Multi-statement scripts with variables and control flow.
DECLARE d DATE DEFAULT CURRENT_DATE();
CREATE TEMP TABLE tmp AS SELECT * FROM `proj.ds.t` WHERE date = d;
29) UDFs
SQL or JavaScript UDFs for reusable logic.
CREATE TEMP FUNCTION doubleIt(x INT64) AS (x*2);
SELECT doubleIt(21);
30) Approximate Functions
APPROX_COUNT_DISTINCT
, APPROX_QUANTILES
reduce cost/latency for large scans.
31) Prune Early
Always filter on partition column (WHERE date BETWEEN ...
) and clustered columns; select only needed fields.
32) Dry Runs
Estimate bytes scanned to avoid surprises.
-- In UI: "Query settings" > "Dry run"
-- CLI: bq query --use_legacy_sql=false --dry_run --format=prettyjson "$(cat query.sql)"
33) Materialized Views & Caching
Precompute heavy joins/aggregations; BigQuery rewrites queries to use MVs automatically.
34) Result Cache vs Temp Tables
Cache = free if available; otherwise CREATE TEMP TABLE
to reuse intermediary results across steps.
35) Denormalize (carefully)
Store nested, repeated fields (arrays/structs) to reduce joins; keep file/table granularity manageable.
36) Slot Reservations (Capacity)
Commit slots for predictable performance; flex/short-term options exist. Assign reservations to projects/folders.
37) On-Demand Guardrails
Set maximum bytes billed per query to cap costs.
-- UI setting or:
DECLARE _MAX_BYTES_BILLED INT64 DEFAULT 1e12; -- 1 TB
38) Partition Evolution & Clustering
Recluster and optimize partitions automatically over time; monitor with INFORMATION_SCHEMA.PARTITIONS
.
39) Batch Load from GCS
Fastest/cheapest for large files.
bq load --source_format=PARQUET proj:analytics.events gs://bucket/curated/events/*.parquet
40) Storage Write API (Streaming)
High-throughput, low-latency inserts with exactly-once semantics and schema evolution.
41) Pub/Sub → Dataflow → BigQuery
Managed streaming pipeline with windowing, late data handling, and dead-letter queues.
42) Dataproc/Spark
Use Spark BigQuery connector to write Parquet/ORC results directly into BigQuery tables.
43) Dataform / Cloud Composer
SQL-first transformations (Dataform) and Airflow orchestration (Composer) for ELT DAGs with tests and dependencies.
44) MERGE & Change Data
Upsert CDC streams into partitioned/clustering tables to keep dashboards current at lower cost.
45) IAM Roles
Assign least privilege: bigquery.dataViewer
, dataEditor
, dataOwner
, jobUser
. Prefer groups over individuals.
46) Row-Level Security
Policies filter rows based on predicates.
CREATE ROW ACCESS POLICY region_filter
ON `proj.bi.customers`
GRANT TO ("group:emea-analysts@company.com")
FILTER USING (region = "EMEA");
47) Column-Level Security
Policy tags via Data Catalog; grant access to tags rather than columns directly (mask PII).
48) Encryption
AES-256 at rest by default; use CMEK for customer-managed keys on sensitive datasets; enforce with org policies.
49) Audit & Access Logs
Route Admin/Data Access logs to Log Analytics/BigQuery for inspections. Monitor for exfiltration patterns.
50) Data Quality & Contracts
Validate schemas, nullability, and ranges using Dataform tests or custom INFORMATION_SCHEMA checks before publishing to BI.
51) BI Engine
In-memory acceleration for BI tools (Looker/Looker Studio). Cache hot aggregates for sub-second dashboards.
52) BigQuery ML (BQML)
Train models in SQL (logistic/linear reg, kmeans, time series, XGBoost, some deep models); predictions via ML.PREDICT
.
CREATE OR REPLACE MODEL `proj.ml.churn`
OPTIONS(model_type='logistic_reg') AS
SELECT label, f1,f2,f3 FROM `proj.ds.train`;
53) Federated Queries
External connections to Cloud SQL/Spanner and others for near-real-time analytics without full ingestion (mind quotas/latency).
54) Exports & Interop
Export tables/partitions to GCS (Parquet/Avro/CSV) for sharing with Spark/Dataproc/Vertex AI.
55) INFORMATION_SCHEMA
Introspect tables, columns, partitions, slots, and query history for optimization.
SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);
56) Alerts & Quotas
Set scan/slot usage alerts; protect against runaway queries; set org policies for location, CMEK, and cross-region transfers.
57) Reliability & SLIs
Track p95 query latency, bytes scanned/user, storage growth, MV staleness, and failed jobs. Keep runbooks for hot paths.
58) Pattern — Medallion Lakehouse
Land raw (Bronze) in GCS → curate (Silver) with ELT in BQ → gold marts with MVs/partitioned tables for BI.
59) Pattern — Cost-Aware BI
Partition + cluster fact tables, MVs for heavy joins, BI Engine cache, guardrails on bytes billed, and query templates with required date filters.
60) Interview Q&A — Practical
Partition vs Cluster? Partition reduces scanned bytes by date/ingestion; clustering sorts data within partitions for predicates/aggregations.
On-demand vs Slots? On-demand = per TB scanned; Slots = reserved capacity for predictable performance.
Speed up a slow query? Add partition filters, cluster keys, avoid SELECT *, precompute with MVs, and check skew in INFORMATION_SCHEMA
.