Azure Synapse Pocket Book


Azure Synapse Pocket Book — Uplatz

55+ deep-dive flashcards • Single column • SQL (Dedicated & Serverless) • Spark & Data Lake • Pipelines & Orchestration • Security • Performance • Interview Q&A

Cheat-friendly explanations • Readable T-SQL/Spark snippets • Production-oriented tips

Section 1 — Fundamentals

1) What is Azure Synapse Analytics?

Unified analytics: data integration, big-data (Spark), SQL analytics (serverless & dedicated pools), and a studio experience. Think “lakehouse + warehouse + pipelines”.

2) Core Building Blocks

Workspace (control plane), Linked Services (connections), Integration Runtimes (compute for pipelines), SQL pools (serverless/dedicated), Spark pools, Data Explorer (optional for logs/timeseries).

3) Serverless vs Dedicated SQL

Serverless = query files in ADLS, pay per TB processed, zero management. Dedicated = MPP warehouse with predictable performance and cost (DWU).

4) Spark in Synapse

Managed Spark pools for notebooks/jobs; Delta Lake support; integrates with lake databases and serverless external tables.

5) Data Lake (ADLS Gen2)

Synapse relies on ADLS Gen2 for storage. Organize zones (raw/curated/gold), enforce ACLs, and use Parquet/Delta for analytics.

6) Synapse Studio

Web UI with hubs: Data, Develop (notebooks/SQL scripts), Integrate (pipelines), Monitor, Manage. Great for end-to-end ops.

7) Typical Flow

Ingest → land in lake → transform (Spark/SQL) → warehouse tables → serve via Power BI/serverless endpoints → orchestrate with pipelines.

8) Cost Model

Serverless charges per TB scanned; dedicated charges per provisioned DWU; Spark charges per node/hour. Save with partition pruning, caching, pausing pools.

9) When to Use Synapse?

If you need unified lake/warehouse, governed pipelines, Spark+SQL side-by-side, and enterprise security with private networking.

10) Quick CLI Setup

# Create dedicated SQL pool (example)
az synapse sql pool create --name dw --performance-level "DW200c" -g rg -w syn-ws

Section 2 — Storage & Ingestion

11) External Tables (Serverless)

Map Parquet/CSV in ADLS to SQL tables for BI and ad-hoc queries.

CREATE EXTERNAL DATA SOURCE ds WITH ( LOCATION = 'https://<acct>.dfs.core.windows.net/lake' );
CREATE EXTERNAL FILE FORMAT ParquetFmt WITH ( FORMAT_TYPE = PARQUET );
CREATE EXTERNAL TABLE dbo.sales_ext
WITH ( LOCATION = '/curated/sales/', DATA_SOURCE = ds, FILE_FORMAT = ParquetFmt ) AS
SELECT * FROM OPENROWSET( BULK '/curated/sales/*.parquet', DATA_SOURCE = 'ds', FORMAT='PARQUET') AS rows;

12) OPENROWSET (Serverless)

Query files directly without external objects.

SELECT TOP 10 * FROM OPENROWSET(
  BULK 'https://<acct>.dfs.core.windows.net/lake/raw/iot/*.parquet',
  FORMAT='PARQUET') AS r;

13) COPY INTO (Dedicated)

Fast load from ADLS/Blob to dedicated tables, with parallelism and file patterns.

COPY INTO dbo.FactSales
FROM 'https://<acct>.dfs.core.windows.net/lake/curated/sales/'
WITH ( FILE_TYPE='PARQUET', MAXERRORS=0 );

14) CTAS & CETAS

CTAS creates and loads a warehouse table efficiently. CETAS writes query results back to the lake (Parquet/CSV).

-- CTAS
CREATE TABLE dbo.top_customers
WITH (DISTRIBUTION = HASH(CustomerId), CLUSTERED COLUMNSTORE INDEX)
AS SELECT ...;

15) File Layout

Partition by date/region; aim for 256MB–1GB Parquet files; avoid millions of tiny files—compact with Spark.

16) Lake Databases

Define tables on top of files (Spark-managed); auto-exposed to serverless SQL for querying the same data.

17) Data Flows (Mapping)

Visual transformations that scale out on a Spark engine—good for ELT into lake/warehouse with schema mapping and surrogate keys.

18) Ingestion Patterns

Landing zone → validate → curate. Choose event-based (Event Grid), micro-batch (ADF copy), or streaming (Spark Structured Streaming).

Section 3 — SQL Analytics (Dedicated & Serverless)

19) Table Types

Dedicated pool supports heap, clustered, clustered columnstore (default for facts). Columnstore = compression + analytics speed.

20) Distribution Styles

HASH(key) to co-locate joins; ROUND_ROBIN for staging; REPLICATE for small dimensions. Choose to minimize data movement.

21) Statistics

Keep stats up-to-date for optimal plans.

UPDATE STATISTICS dbo.FactSales WITH FULLSCAN;

22) Workload Management

Resource classes & workload groups control concurrency and memory. Assign users to classes (smallrc/mediumrc/…).

EXEC sp_addrolemember 'mediumrc', 'bi_reader';

23) Partitioning

Range partition large fact tables by date; improves maintenance and sliding windows.

24) Materialized Views

Pre-compute aggregations/joins; automatic rewrite accelerates queries.

25) PolyBase vs COPY INTO

Prefer COPY INTO for loads; PolyBase external tables for ongoing federation. Validate file formats and column types.

26) Temp vs Staging

Use staging tables (ROUND_ROBIN) for fast loads → transform to production tables (HASH/REPLICATE) via CTAS.

27) Serverless Tips

Use OPENROWSET, specify columns & types, leverage partition folders in predicates (e.g., WHERE folder = '2025/08/20') to prune IO.

28) Caching & Performance (Serverless)

Serverless caches file metadata and query results for a short time. Avoid SELECT *; project only columns needed; compress in Parquet.

29) Concurrency & Queues (Dedicated)

High concurrency may queue requests. Use workload groups to isolate ETL vs BI and scale DWU during heavy windows.

30) Query Perf Insight

Use built-in monitoring to find heavy queries, data skews, or data movement operations (DMS).

Section 4 — Spark & Lakehouse

31) Spark Pools

Configure node sizes, autoscale, and auto-pause. Use small dev pools and larger job pools. Pin runtime version for compat.

32) PySpark Example

df = spark.read.parquet("abfss://lake@acct.dfs.core.windows.net/raw/sales/")
df.groupBy("country").count().write.mode("overwrite").parquet(".../curated/sales_by_country/")

33) Delta Lake

ACID transactions, MERGE/UPSERT, time travel, and auto-optimize. Use Delta for curated/gold layers.

from delta.tables import DeltaTable
DeltaTable.forPath(spark, "/lake/curated/orders").alias("t") \
 .merge(sourceDF.alias("s"), "t.id = s.id") \
 .whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

34) Spark ↔ Serverless SQL

Create Lake databases or use CETAS to write Parquet, then query with serverless (or external tables).

35) Optimize Small Files

Use coalesce/repartition and OPTIMIZE (Delta) to compact. Target 256MB–1GB per file.

36) UDFs & Libraries

Attach wheel/jar packages to Spark pools. Cache hot datasets with df.cache() for iterative notebooks.

37) Structured Streaming

Process Kafka/Event Hub streams; checkpoint to ADLS; write Delta sinks for exactly-once semantics.

38) ML in Synapse

Use Spark MLlib or integrate with Azure ML for tracking/serving; keep features in Delta for online/offline parity.

Section 5 — Pipelines & Orchestration

39) Integration Runtimes

AutoResolve IR for cloud; Self-hosted IR for on-prem/private networks. Monitor IR health and throughput.

40) Activities

Copy, Data Flow, Notebook, Spark job, Stored Procedure, Web, Lookups, ForEach, If, Wait—compose end-to-end DAGs.

41) Triggers

Schedule, tumbling window, event-based (blob created). Use watermarking for incremental ingestion.

42) Parameters & Datasets

Parameterize file paths/dates; pass outputs between activities; store secrets in Key Vault-backed linked services.

43) Copy Activity Perf

Use Binary/Parquet formats, tuned DIUs, and parallel file patterns. Enable staged copy for cross-cloud or large moves.

44) CI/CD for Pipelines

Use Synapse workspace deployment artifacts (ARM/Bicep) or synapse-workspace-deployment action; separate dev/test/prod workspaces.

Section 6 — Security & Networking

45) RBAC vs ACLs

Azure RBAC secures workspace/services; ADLS POSIX ACLs secure files/folders. Both are needed in lakehouse designs.

46) Managed Identity

Synapse uses workspace MSI for accessing storage, SQL, Key Vault; grant least-privilege roles (e.g., Storage Blob Data Contributor).

47) Private Endpoints

Lock down workspace, SQL pools, and storage with private endpoints; disable public network access for zero-trust.

48) Row/Column Security

Implement RLS/CLS on dedicated pool for governed access; use views to mask sensitive fields.

49) Auditing & Defender

Enable auditing to Log Analytics/Storage; use Defender for SQL alerts on anomalous activities.

50) Credential Passthrough (Spark)

Use ACLs with AAD passthrough to enforce per-user lake permissions when running Spark.

51) Key Vault

Centralize secrets; reference in linked services; rotate regularly; avoid storing secrets in pipeline parameters.

Section 7 — Monitoring, Cost & Reliability

52) Monitor Hub

Track pipeline runs, triggers, copy throughput, data flow metrics, Spark session details, and SQL queries.

53) Log Analytics + KQL

Send diagnostics to a workspace; build alerts for long-running jobs, failed copies, pool pause/resume, and excessive scans.

54) Cost Controls

Pause dedicated pools off-hours; increase DWU only during ETL; compact files to reduce serverless scans; cache hot datasets; avoid SELECT *.

55) Reliability

Retries with exponential backoff in pipelines, idempotent loads (MERGE with Delta/CTAS pattern), data quality checks before publish.

56) Troubleshooting Skew

Hash skew shows as long DMS waits; change distribution key or pre-shuffle; replicate small dimensions.

Section 8 — Patterns & Interview Q&A

57) Pattern — Lakehouse

Land raw in Parquet/Delta → curate with Spark → expose gold via serverless external tables for BI and ad-hoc SQL.

58) Pattern — Warehouse ETL

Stage (ROUND_ROBIN) → transform via CTAS → load final HASH-distributed columnstore → add materialized views for hot aggregates.

59) Interview — When serverless?

Answer: Ad-hoc exploration, lightweight BI on the lake, or low-duty dashboards—pay per TB scanned, no pool to manage.

60) Interview — Distribution key choice?

Answer: Pick a high-cardinality key used in big joins; avoid skewed keys; consider REPLICATE for small tables.

61) Interview — Speed up COPY INTO?

Answer: Parquet source, balanced file sizes, increase parallelism, disable constraints during load, use staging & CTAS.

62) Interview — Reduce serverless cost?

Answer: Partition folders with predicates, Parquet/Delta (columnar), select columns, compact small files, and cache results.

63) Production Checklist

Private endpoints, MSI + least privilege, DWU schedule + pause, data contracts & quality checks, lineage, CI/CD, monitoring & alerts, runbooks & DR.