BigQuery Pocket Book | Uplatz Blog

BigQuery Pocket Book — Uplatz

~60 deep-dive flashcards • Single column • Storage & Tables • SQL & Arrays • Performance & Cost • Security • Ingestion • Ops • Interview Q&A

Cheat-friendly explanations • Readable SQL/CLI snippets • Production-oriented tips

Section 1 — Fundamentals

1) What is BigQuery?

Serverless, highly parallel data warehouse on Google Cloud. Separate storage/compute, ANSI SQL, petabyte scale, pay per data processed or via slot capacity.

# CLI auth & default project
gcloud auth login
gcloud config set project <PROJECT_ID>

2) Core Objects

Project → Dataset (schema) → Table/View. Regions: multi/regions; keep data close to compute and sources.

3) Pricing Basics

Storage charged by TB/month; queries charged by TB scanned (on-demand) or covered by slot reservations (capacity). Use DRY RUN to estimate bytes.

-- Estimate cost before running
EXPLAIN SELECT * FROM `proj.ds.table`;
-- or
DECLARE _DRY BOOL DEFAULT TRUE;

4) SQL Dialect

Standard SQL (default). Supports arrays, structs, JSON type, window functions, geospatial, ML, and procedural scripting.

5) When to Use BigQuery?

Analytics and BI on large datasets, ELT on lake data, semi-structured JSON logs, joins across massive tables, and ML close to data.

6) Regions & Storage

Pick a single region or multi-region (e.g., US/EU). Data residency matters for compliance and latency to GCS/GA4/Dataflow.

7) Projects & Datasets

One dataset per domain or environment. Name with clarity: analytics_raw, analytics_curated, bi.

bq mk --dataset --location=US myproj:analytics_raw

8) Time Travel & Snapshots

Query previous table versions (system time) for accidental deletes/overwrites.

SELECT * FROM `proj.ds.sales`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR);

9) Query Caching

Identical queries over same data may return from cache at no cost. Disable if you need fresh reads.

10) Console & CLI

Use BigQuery Studio (web) or bq CLI and client libraries. Prefer versioned SQL files in Git for repeatability.

Section 2 — Storage, Tables & Layout

11) Create Dataset/Table

-- DDL
CREATE SCHEMA IF NOT EXISTS `proj.analytics`;
CREATE TABLE `proj.analytics.events`(
  event_time TIMESTAMP,
  user_id STRING,
  attrs JSON
);

12) Partitioning

Partition by DATE/TIMESTAMP column or by ingestion time. Prunes bytes scanned and speeds queries.

CREATE TABLE `proj.analytics.events_p`
PARTITION BY DATE(event_time) AS SELECT * FROM `proj.analytics.events`;

13) Clustering

Cluster up to 4 columns (e.g., user_id, country) to improve predicate/aggregation performance.

CREATE TABLE `proj.analytics.events_c`
PARTITION BY DATE(event_time)
CLUSTER BY user_id, country AS SELECT * FROM `proj.analytics.events`;

14) Ingestion-Time Partitioning

Ideal for logs; use pseudo column for decorators and cost pruning.

CREATE TABLE `proj.logs.app`
PARTITION BY _PARTITIONDATE AS SELECT * FROM `proj.tmp.src`;

15) Decorators

Access a partition or snapshot with decorators.

SELECT COUNT(*) FROM `proj.ds.tbl$20250820`;
SELECT * FROM `proj.ds.tbl@-3600000`; -- 1 hour ago

16) External Tables & BigLake

Query Parquet/CSV/JSON in GCS without loading; add governance via BigLake for consistent security across lake+warehouse.

CREATE EXTERNAL TABLE `proj.ext.web`
WITH CONNECTION `proj.us.gcs`
OPTIONS ( uris=['gs://bucket/curated/web/**/*.parquet'] );

17) Schemas & Types

Prefer typed columns over STRING; use NUMERIC/BIGNUMERIC for money; JSON for semi-structured with functions.

18) Views & Materialized Views

Views = virtual SQL. Materialized views = precomputed and auto-refreshed for speed/cost reduction.

19) Table Lifecycle

Set default expiration per dataset/table to control storage cost; use labels for ownership and cleanup.

20) Sharded Tables (legacy)

Prefer partitioned tables over sharded names like events_20250820. If you inherit sharded, migrate to partitioned for performance.

Section 3 — Querying, Arrays/Structs & SQL Features

21) Arrays & UNNEST

Arrays require UNNEST to flatten; re-aggregate with ARRAY_AGG.

SELECT user_id, item FROM `proj.ds.cart`,
UNNEST(items) AS item;

22) Structs

Nested records model JSON-like objects with types.

SELECT address.city, address.zip FROM `proj.ds.customers`;

23) JSON Type

Use JSON_VALUE/JSON_QUERY/JSON_EXTRACT to access fields.

SELECT JSON_VALUE(attrs,'$.device.os') AS os FROM `proj.analytics.events`;

24) Window Functions

Ranking, moving averages, percentiles.

SELECT user_id, amount,
AVG(amount) OVER (PARTITION BY user_id ORDER BY ts ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg7
FROM `proj.sales.txn`;

25) Geography

Spatial analysis with GEOGRAPHY and ST_* functions.

SELECT ST_DISTANCE(h.geom, a.geom) AS m FROM homes h, amenities a;

26) Time & Date

Use DATE/TIMESTAMP/DATETIME types and functions; convert time zones explicitly.

27) DDL/DML

ALTER TABLE `proj.ds.t` ADD COLUMN country STRING;
MERGE `proj.ds.t` T
USING `proj.ds.stage` S ON S.id=T.id
WHEN MATCHED THEN UPDATE SET amount=S.amount
WHEN NOT MATCHED THEN INSERT (id,amount) VALUES(S.id,S.amount);

28) Scripting

Multi-statement scripts with variables and control flow.

DECLARE d DATE DEFAULT CURRENT_DATE();
CREATE TEMP TABLE tmp AS SELECT * FROM `proj.ds.t` WHERE date = d;

29) UDFs

SQL or JavaScript UDFs for reusable logic.

CREATE TEMP FUNCTION doubleIt(x INT64) AS (x*2);
SELECT doubleIt(21);

30) Approximate Functions

APPROX_COUNT_DISTINCT, APPROX_QUANTILES reduce cost/latency for large scans.

Section 4 — Performance, Bytes Scanned & Cost Control

31) Prune Early

Always filter on partition column (WHERE date BETWEEN ...) and clustered columns; select only needed fields.

32) Dry Runs

Estimate bytes scanned to avoid surprises.

-- In UI: "Query settings" > "Dry run"
-- CLI: bq query --use_legacy_sql=false --dry_run --format=prettyjson "$(cat query.sql)"

33) Materialized Views & Caching

Precompute heavy joins/aggregations; BigQuery rewrites queries to use MVs automatically.

34) Result Cache vs Temp Tables

Cache = free if available; otherwise CREATE TEMP TABLE to reuse intermediary results across steps.

35) Denormalize (carefully)

Store nested, repeated fields (arrays/structs) to reduce joins; keep file/table granularity manageable.

36) Slot Reservations (Capacity)

Commit slots for predictable performance; flex/short-term options exist. Assign reservations to projects/folders.

37) On-Demand Guardrails

Set maximum bytes billed per query to cap costs.

-- UI setting or:
DECLARE _MAX_BYTES_BILLED INT64 DEFAULT 1e12; -- 1 TB

38) Partition Evolution & Clustering

Recluster and optimize partitions automatically over time; monitor with INFORMATION_SCHEMA.PARTITIONS.

Section 5 — Ingestion, ETL/ELT & Streaming

39) Batch Load from GCS

Fastest/cheapest for large files.

bq load --source_format=PARQUET proj:analytics.events gs://bucket/curated/events/*.parquet

40) Storage Write API (Streaming)

High-throughput, low-latency inserts with exactly-once semantics and schema evolution.

41) Pub/Sub → Dataflow → BigQuery

Managed streaming pipeline with windowing, late data handling, and dead-letter queues.

42) Dataproc/Spark

Use Spark BigQuery connector to write Parquet/ORC results directly into BigQuery tables.

43) Dataform / Cloud Composer

SQL-first transformations (Dataform) and Airflow orchestration (Composer) for ELT DAGs with tests and dependencies.

44) MERGE & Change Data

Upsert CDC streams into partitioned/clustering tables to keep dashboards current at lower cost.

Section 6 — Security, Governance & Compliance

45) IAM Roles

Assign least privilege: bigquery.dataViewer, dataEditor, dataOwner, jobUser. Prefer groups over individuals.

46) Row-Level Security

Policies filter rows based on predicates.

CREATE ROW ACCESS POLICY region_filter
ON `proj.bi.customers`
GRANT TO ("group:emea-analysts@company.com")
FILTER USING (region = "EMEA");

47) Column-Level Security

Policy tags via Data Catalog; grant access to tags rather than columns directly (mask PII).

48) Encryption

AES-256 at rest by default; use CMEK for customer-managed keys on sensitive datasets; enforce with org policies.

49) Audit & Access Logs

Route Admin/Data Access logs to Log Analytics/BigQuery for inspections. Monitor for exfiltration patterns.

50) Data Quality & Contracts

Validate schemas, nullability, and ranges using Dataform tests or custom INFORMATION_SCHEMA checks before publishing to BI.

Section 7 — BI, ML & Ecosystem

51) BI Engine

In-memory acceleration for BI tools (Looker/Looker Studio). Cache hot aggregates for sub-second dashboards.

52) BigQuery ML (BQML)

Train models in SQL (logistic/linear reg, kmeans, time series, XGBoost, some deep models); predictions via ML.PREDICT.

CREATE OR REPLACE MODEL `proj.ml.churn`
OPTIONS(model_type='logistic_reg') AS
SELECT label, f1,f2,f3 FROM `proj.ds.train`;

53) Federated Queries

External connections to Cloud SQL/Spanner and others for near-real-time analytics without full ingestion (mind quotas/latency).

54) Exports & Interop

Export tables/partitions to GCS (Parquet/Avro/CSV) for sharing with Spark/Dataproc/Vertex AI.

Section 8 — Operations, Monitoring & Reliability

55) INFORMATION_SCHEMA

Introspect tables, columns, partitions, slots, and query history for optimization.

SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);

56) Alerts & Quotas

Set scan/slot usage alerts; protect against runaway queries; set org policies for location, CMEK, and cross-region transfers.

57) Reliability & SLIs

Track p95 query latency, bytes scanned/user, storage growth, MV staleness, and failed jobs. Keep runbooks for hot paths.

Section 9 — Patterns & Interview Q&A

58) Pattern — Medallion Lakehouse

Land raw (Bronze) in GCS → curate (Silver) with ELT in BQ → gold marts with MVs/partitioned tables for BI.

59) Pattern — Cost-Aware BI

Partition + cluster fact tables, MVs for heavy joins, BI Engine cache, guardrails on bytes billed, and query templates with required date filters.

60) Interview Q&A — Practical

Partition vs Cluster? Partition reduces scanned bytes by date/ingestion; clustering sorts data within partitions for predicates/aggregations.

On-demand vs Slots? On-demand = per TB scanned; Slots = reserved capacity for predictable performance.

Speed up a slow query? Add partition filters, cluster keys, avoid SELECT *, precompute with MVs, and check skew in INFORMATION_SCHEMA.

Cutting-edge Technology Courses by Uplatz