Azure Synapse Pocket Book — Uplatz
55+ deep-dive flashcards • Single column • SQL (Dedicated & Serverless) • Spark & Data Lake • Pipelines & Orchestration • Security • Performance • Interview Q&A
Cheat-friendly explanations • Readable T-SQL/Spark snippets • Production-oriented tips
1) What is Azure Synapse Analytics?
Unified analytics: data integration, big-data (Spark), SQL analytics (serverless & dedicated pools), and a studio experience. Think “lakehouse + warehouse + pipelines”.
2) Core Building Blocks
Workspace (control plane), Linked Services (connections), Integration Runtimes (compute for pipelines), SQL pools (serverless/dedicated), Spark pools, Data Explorer (optional for logs/timeseries).
3) Serverless vs Dedicated SQL
Serverless = query files in ADLS, pay per TB processed, zero management. Dedicated = MPP warehouse with predictable performance and cost (DWU).
4) Spark in Synapse
Managed Spark pools for notebooks/jobs; Delta Lake support; integrates with lake databases and serverless external tables.
5) Data Lake (ADLS Gen2)
Synapse relies on ADLS Gen2 for storage. Organize zones (raw/curated/gold), enforce ACLs, and use Parquet/Delta for analytics.
6) Synapse Studio
Web UI with hubs: Data, Develop (notebooks/SQL scripts), Integrate (pipelines), Monitor, Manage. Great for end-to-end ops.
7) Typical Flow
Ingest → land in lake → transform (Spark/SQL) → warehouse tables → serve via Power BI/serverless endpoints → orchestrate with pipelines.
8) Cost Model
Serverless charges per TB scanned; dedicated charges per provisioned DWU; Spark charges per node/hour. Save with partition pruning, caching, pausing pools.
9) When to Use Synapse?
If you need unified lake/warehouse, governed pipelines, Spark+SQL side-by-side, and enterprise security with private networking.
10) Quick CLI Setup
# Create dedicated SQL pool (example)
az synapse sql pool create --name dw --performance-level "DW200c" -g rg -w syn-ws
11) External Tables (Serverless)
Map Parquet/CSV in ADLS to SQL tables for BI and ad-hoc queries.
CREATE EXTERNAL DATA SOURCE ds WITH ( LOCATION = 'https://<acct>.dfs.core.windows.net/lake' );
CREATE EXTERNAL FILE FORMAT ParquetFmt WITH ( FORMAT_TYPE = PARQUET );
CREATE EXTERNAL TABLE dbo.sales_ext
WITH ( LOCATION = '/curated/sales/', DATA_SOURCE = ds, FILE_FORMAT = ParquetFmt ) AS
SELECT * FROM OPENROWSET( BULK '/curated/sales/*.parquet', DATA_SOURCE = 'ds', FORMAT='PARQUET') AS rows;
12) OPENROWSET (Serverless)
Query files directly without external objects.
SELECT TOP 10 * FROM OPENROWSET(
BULK 'https://<acct>.dfs.core.windows.net/lake/raw/iot/*.parquet',
FORMAT='PARQUET') AS r;
13) COPY INTO (Dedicated)
Fast load from ADLS/Blob to dedicated tables, with parallelism and file patterns.
COPY INTO dbo.FactSales
FROM 'https://<acct>.dfs.core.windows.net/lake/curated/sales/'
WITH ( FILE_TYPE='PARQUET', MAXERRORS=0 );
14) CTAS & CETAS
CTAS creates and loads a warehouse table efficiently. CETAS writes query results back to the lake (Parquet/CSV).
-- CTAS
CREATE TABLE dbo.top_customers
WITH (DISTRIBUTION = HASH(CustomerId), CLUSTERED COLUMNSTORE INDEX)
AS SELECT ...;
15) File Layout
Partition by date/region; aim for 256MB–1GB Parquet files; avoid millions of tiny files—compact with Spark.
16) Lake Databases
Define tables on top of files (Spark-managed); auto-exposed to serverless SQL for querying the same data.
17) Data Flows (Mapping)
Visual transformations that scale out on a Spark engine—good for ELT into lake/warehouse with schema mapping and surrogate keys.
18) Ingestion Patterns
Landing zone → validate → curate. Choose event-based (Event Grid), micro-batch (ADF copy), or streaming (Spark Structured Streaming).
19) Table Types
Dedicated pool supports heap, clustered, clustered columnstore (default for facts). Columnstore = compression + analytics speed.
20) Distribution Styles
HASH(key) to co-locate joins; ROUND_ROBIN for staging; REPLICATE for small dimensions. Choose to minimize data movement.
21) Statistics
Keep stats up-to-date for optimal plans.
UPDATE STATISTICS dbo.FactSales WITH FULLSCAN;
22) Workload Management
Resource classes & workload groups control concurrency and memory. Assign users to classes (smallrc/mediumrc/…).
EXEC sp_addrolemember 'mediumrc', 'bi_reader';
23) Partitioning
Range partition large fact tables by date; improves maintenance and sliding windows.
24) Materialized Views
Pre-compute aggregations/joins; automatic rewrite accelerates queries.
25) PolyBase vs COPY INTO
Prefer COPY INTO for loads; PolyBase external tables for ongoing federation. Validate file formats and column types.
26) Temp vs Staging
Use staging tables (ROUND_ROBIN) for fast loads → transform to production tables (HASH/REPLICATE) via CTAS.
27) Serverless Tips
Use OPENROWSET
, specify columns & types, leverage partition folders in predicates (e.g., WHERE folder = '2025/08/20'
) to prune IO.
28) Caching & Performance (Serverless)
Serverless caches file metadata and query results for a short time. Avoid SELECT *
; project only columns needed; compress in Parquet.
29) Concurrency & Queues (Dedicated)
High concurrency may queue requests. Use workload groups to isolate ETL vs BI and scale DWU during heavy windows.
30) Query Perf Insight
Use built-in monitoring to find heavy queries, data skews, or data movement operations (DMS).
31) Spark Pools
Configure node sizes, autoscale, and auto-pause. Use small dev pools and larger job pools. Pin runtime version for compat.
32) PySpark Example
df = spark.read.parquet("abfss://lake@acct.dfs.core.windows.net/raw/sales/")
df.groupBy("country").count().write.mode("overwrite").parquet(".../curated/sales_by_country/")
33) Delta Lake
ACID transactions, MERGE/UPSERT, time travel, and auto-optimize. Use Delta for curated/gold layers.
from delta.tables import DeltaTable
DeltaTable.forPath(spark, "/lake/curated/orders").alias("t") \
.merge(sourceDF.alias("s"), "t.id = s.id") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
34) Spark ↔ Serverless SQL
Create Lake databases or use CETAS
to write Parquet, then query with serverless (or external tables).
35) Optimize Small Files
Use coalesce/repartition
and OPTIMIZE
(Delta) to compact. Target 256MB–1GB per file.
36) UDFs & Libraries
Attach wheel/jar packages to Spark pools. Cache hot datasets with df.cache()
for iterative notebooks.
37) Structured Streaming
Process Kafka/Event Hub streams; checkpoint to ADLS; write Delta sinks for exactly-once semantics.
38) ML in Synapse
Use Spark MLlib or integrate with Azure ML for tracking/serving; keep features in Delta for online/offline parity.
39) Integration Runtimes
AutoResolve IR for cloud; Self-hosted IR for on-prem/private networks. Monitor IR health and throughput.
40) Activities
Copy, Data Flow, Notebook, Spark job, Stored Procedure, Web, Lookups, ForEach, If, Wait—compose end-to-end DAGs.
41) Triggers
Schedule, tumbling window, event-based (blob created). Use watermarking for incremental ingestion.
42) Parameters & Datasets
Parameterize file paths/dates; pass outputs between activities; store secrets in Key Vault-backed linked services.
43) Copy Activity Perf
Use Binary/Parquet formats, tuned DIUs, and parallel file patterns. Enable staged copy for cross-cloud or large moves.
44) CI/CD for Pipelines
Use Synapse workspace deployment artifacts (ARM/Bicep) or synapse-workspace-deployment action; separate dev/test/prod workspaces.
45) RBAC vs ACLs
Azure RBAC secures workspace/services; ADLS POSIX ACLs secure files/folders. Both are needed in lakehouse designs.
46) Managed Identity
Synapse uses workspace MSI for accessing storage, SQL, Key Vault; grant least-privilege roles (e.g., Storage Blob Data Contributor).
47) Private Endpoints
Lock down workspace, SQL pools, and storage with private endpoints; disable public network access for zero-trust.
48) Row/Column Security
Implement RLS/CLS on dedicated pool for governed access; use views to mask sensitive fields.
49) Auditing & Defender
Enable auditing to Log Analytics/Storage; use Defender for SQL alerts on anomalous activities.
50) Credential Passthrough (Spark)
Use ACLs with AAD passthrough to enforce per-user lake permissions when running Spark.
51) Key Vault
Centralize secrets; reference in linked services; rotate regularly; avoid storing secrets in pipeline parameters.
52) Monitor Hub
Track pipeline runs, triggers, copy throughput, data flow metrics, Spark session details, and SQL queries.
53) Log Analytics + KQL
Send diagnostics to a workspace; build alerts for long-running jobs, failed copies, pool pause/resume, and excessive scans.
54) Cost Controls
Pause dedicated pools off-hours; increase DWU only during ETL; compact files to reduce serverless scans; cache hot datasets; avoid SELECT *
.
55) Reliability
Retries with exponential backoff in pipelines, idempotent loads (MERGE with Delta/CTAS pattern), data quality checks before publish.
56) Troubleshooting Skew
Hash skew shows as long DMS waits; change distribution key or pre-shuffle; replicate small dimensions.
57) Pattern — Lakehouse
Land raw in Parquet/Delta → curate with Spark → expose gold via serverless external tables for BI and ad-hoc SQL.
58) Pattern — Warehouse ETL
Stage (ROUND_ROBIN) → transform via CTAS → load final HASH-distributed columnstore → add materialized views for hot aggregates.
59) Interview — When serverless?
Answer: Ad-hoc exploration, lightweight BI on the lake, or low-duty dashboards—pay per TB scanned, no pool to manage.
60) Interview — Distribution key choice?
Answer: Pick a high-cardinality key used in big joins; avoid skewed keys; consider REPLICATE for small tables.
61) Interview — Speed up COPY INTO?
Answer: Parquet source, balanced file sizes, increase parallelism, disable constraints during load, use staging & CTAS.
62) Interview — Reduce serverless cost?
Answer: Partition folders with predicates, Parquet/Delta (columnar), select columns, compact small files, and cache results.
63) Production Checklist
Private endpoints, MSI + least privilege, DWU schedule + pause, data contracts & quality checks, lineage, CI/CD, monitoring & alerts, runbooks & DR.