Data Engineering Cheat Sheet

🔧 Data Engineering Cheat Sheet

Complete guide to data engineering tools, frameworks, architectures, and best practices

🎯 Core DE Concepts
🏗️

Data Pipeline Architecture

Extract, Transform, Load (ETL) vs Extract, Load, Transform (ELT)

ETL Pattern:
• Data transformed before loading
• Better for structured data
• Traditional on-premise approach

ELT Pattern:
• Raw data loaded first, transformed later
• Leverages cloud compute power
• Better for big data and unstructured data

🏛️

Data Lake vs Data Warehouse

Data Lake:
• Stores raw, unstructured data
• Schema-on-read approach
• Cost-effective for big data
• Examples: S3, HDFS, Azure Data Lake

Data Warehouse:
• Structured, processed data
• Schema-on-write approach
• Optimized for analytics queries
• Examples: Snowflake, BigQuery, Redshift

💎

Data Quality Framework

Six Dimensions of Data Quality:
Accuracy: Data reflects real-world values
Completeness: No missing critical data
Consistency: Data uniform across systems
Timeliness: Data available when needed
Validity: Data conforms to defined formats
Uniqueness: No unwanted duplicates

🛠️ Data Engineering Technology Stack

Data Processing Engines

Apache Spark: Unified analytics engine
Apache Kafka: Distributed streaming platform
Apache Flink: Stream processing framework
Apache Storm: Real-time computation system
Dask: Parallel computing in Python
Ray: Distributed computing framework

🗄️

Storage Solutions

Object Storage: AWS S3, GCS, Azure Blob
Distributed Files: HDFS, GlusterFS
NoSQL: MongoDB, Cassandra, DynamoDB
Time Series: InfluxDB, TimescaleDB
Graph: Neo4j, Amazon Neptune
Search: Elasticsearch, Solr

🎼

Orchestration Tools

Apache Airflow: Workflow management platform
Prefect: Modern workflow orchestration
Dagster: Data orchestrator for ML/analytics
Luigi: Python workflow engine
Apache NiFi: Data flow automation
Argo Workflows: Kubernetes-native workflows

🐍 Python for Data Engineering
🐼

Pandas Essentials

Data Loading & Basic Operations:

# Load data from various sources
df = pd.read_csv(‘data.csv’)
df = pd.read_json(‘data.json’)
df = pd.read_parquet(‘data.parquet’)
df = pd.read_sql(‘SELECT * FROM table’, connection)

# Basic data exploration
df.head(10) # First 10 rows
df.info() # Data types and memory usage
df.describe() # Statistical summary
df.shape # Dimensions (rows, columns)
df.columns.tolist() # Column names
df.dtypes # Data types per column

# Data cleaning
df.dropna() # Remove null values
df.fillna(method=’ffill’) # Forward fill nulls
df.drop_duplicates() # Remove duplicates
df[df[‘column’].notna()] # Filter non-null values

🔗

Database Connections

SQLAlchemy Examples:

from sqlalchemy import create_engine
import pandas as pd

# PostgreSQL connection
engine = create_engine(
‘postgresql://user:password@host:port/database’
)

# MySQL connection
engine = create_engine(
‘mysql+pymysql://user:password@host:port/database’
)

# Read data
df = pd.read_sql_query(“””
SELECT customer_id, order_date, total_amount
FROM orders
WHERE order_date >= ‘2024-01-01’
“””, engine)

# Write data
df.to_sql(‘processed_orders’, engine,
if_exists=’replace’, index=False,
chunksize=10000)

📊

Data Transformation

Advanced Pandas Operations:

# Group by operations
result = df.groupby(‘category’).agg({
‘sales’: [‘sum’, ‘mean’, ‘count’],
‘profit’: ‘sum’
}).round(2)

# Window functions
df[‘running_total’] = df.groupby(‘customer_id’)[‘amount’].cumsum()
df[‘row_number’] = df.groupby(‘category’).cumcount() + 1

# Date operations
df[‘order_date’] = pd.to_datetime(df[‘order_date’])
df[‘year’] = df[‘order_date’].dt.year
df[‘month’] = df[‘order_date’].dt.month
df[‘weekday’] = df[‘order_date’].dt.day_name()

# Pivot operations
pivot_df = df.pivot_table(
values=’sales’,
index=’product’,
columns=’month’,
aggfunc=’sum’
)

⚡ Apache Spark & PySpark
🚀

Spark Session & DataFrames

PySpark Basics:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count

# Create Spark session
spark = SparkSession.builder \
.appName(“DataEngineering”) \
.config(“spark.sql.adaptive.enabled”, “true”) \
.getOrCreate()

# Read data
df = spark.read.csv(“file.csv”, header=True, inferSchema=True)
df = spark.read.json(“file.json”)
df = spark.read.parquet(“file.parquet”)

# Basic operations
df.show(20) # Display first 20 rows
df.count() # Count rows
df.columns # Column names
df.printSchema() # Schema information
df.describe().show() # Statistical summary

🔄

Transformations & Actions

Common PySpark Operations:

# Transformations (lazy evaluation)
df_filtered = df.filter(col(“age”) > 25)
df_selected = df.select(“name”, “age”, “salary”)
df_grouped = df.groupBy(“department”).agg(
avg(“salary”).alias(“avg_salary”),
count(“*”).alias(“employee_count”)
)

# Joins
result = df1.join(df2, df1.id == df2.user_id, “inner”)

# Window functions
from pyspark.sql.window import Window
window = Window.partitionBy(“department”).orderBy(“salary”)
df_ranked = df.withColumn(“rank”,
row_number().over(window))

# Actions (trigger computation)
df.collect() # Collect all rows to driver
df.take(10) # Take first 10 rows
df.write.parquet(“output/”) # Write to storage

⚙️

Performance Optimization

Partitioning Strategies:
• Partition by frequently filtered columns
• Use date/time for time-series data
• Avoid small files (< 128MB) Caching:
• Cache frequently accessed DataFrames
• Use appropriate storage levels
• Unpersist when no longer needed

Join Optimization:
• Broadcast small tables (< 10MB)
• Use bucketing for large joins
• Prefer sort-merge joins for large datasets

🗄️ SQL for Data Engineering
🔍

Advanced Query Patterns

Complex SQL Examples:

— CTEs for complex logic
WITH monthly_sales AS (
SELECT
DATE_TRUNC(‘month’, order_date) as month,
SUM(amount) as total_sales,
COUNT(*) as order_count
FROM orders
WHERE order_date >= ‘2024-01-01’
GROUP BY 1
),
growth_calc AS (
SELECT *,
LAG(total_sales) OVER (ORDER BY month) as prev_month_sales,
(total_sales – LAG(total_sales) OVER (ORDER BY month)) /
LAG(total_sales) OVER (ORDER BY month) * 100 as growth_rate
FROM monthly_sales
)
SELECT * FROM growth_calc WHERE growth_rate IS NOT NULL;

— Recursive CTE for hierarchies
WITH RECURSIVE employee_hierarchy AS (
SELECT id, name, manager_id, 1 as level
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.name, e.manager_id, eh.level + 1
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.id
)
SELECT * FROM employee_hierarchy ORDER BY level, name;

📊

Window Functions

Window Function Examples:

— Ranking functions
SELECT
employee_name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
DENSE_RANK() OVER (ORDER BY salary DESC) as overall_rank,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY hire_date) as hire_order
FROM employees;

— Aggregate window functions
SELECT
order_date,
daily_sales,
SUM(daily_sales) OVER (ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_7_day,
AVG(daily_sales) OVER (ORDER BY order_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as rolling_30_day_avg,
daily_sales – LAG(daily_sales) OVER (ORDER BY order_date) as day_over_day
FROM daily_sales_summary;

— Percentiles and distribution
SELECT
product_id,
revenue,
NTILE(4) OVER (ORDER BY revenue) as quartile,
PERCENT_RANK() OVER (ORDER BY revenue) as percentile_rank
FROM product_revenue;

🔧

Data Quality Checks

SQL Data Quality Queries:

— Null value analysis
SELECT
‘customer_id’ as column_name,
COUNT(*) as total_rows,
COUNT(customer_id) as non_null_count,
COUNT(*) – COUNT(customer_id) as null_count,
ROUND(100.0 * (COUNT(*) – COUNT(customer_id)) / COUNT(*), 2) as null_percentage
FROM orders
UNION ALL
SELECT
’email’,
COUNT(*),
COUNT(email),
COUNT(*) – COUNT(email),
ROUND(100.0 * (COUNT(*) – COUNT(email)) / COUNT(*), 2)
FROM orders;

— Duplicate detection
SELECT email, COUNT(*) as duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

— Data freshness check
SELECT
MAX(created_at) as latest_record,
CURRENT_TIMESTAMP – MAX(created_at) as data_age
FROM transactions;

☁️ Cloud Data Engineering
🟡

AWS Data Services

AWS CLI Commands:

# S3 operations
aws s3 cp data.csv s3://my-bucket/raw-data/
aws s3 sync ./local-folder s3://my-bucket/batch-data/
aws s3 ls s3://my-bucket/processed/ –recursive

# Glue job execution
aws glue start-job-run –job-name my-etl-job \
–arguments=’–input_path=s3://input/,–output_path=s3://output/’

# EMR cluster management
aws emr create-cluster \
–name “Data-Processing-Cluster” \
–applications Name=Spark Name=Hadoop Name=Hive \
–instance-type m5.xlarge \
–instance-count 3 \
–bootstrap-actions Path=s3://my-bucket/bootstrap.sh

# Redshift data loading
aws redshift-data execute-statement \
–cluster-identifier my-cluster \
–database dev \
–sql “COPY table_name FROM ‘s3://bucket/data/’
IAM_ROLE ‘arn:aws:iam::account:role/RedshiftRole’
CSV IGNOREHEADER 1;”

🔵

Google Cloud Platform

GCP CLI Commands:

# BigQuery operations
bq query –use_legacy_sql=false \
“SELECT COUNT(*) FROM \`project.dataset.table\`
WHERE DATE(created_at) = CURRENT_DATE()”

bq load –source_format=CSV \
–skip_leading_rows=1 \
dataset.table gs://bucket/data.csv

# Dataflow pipeline
gcloud dataflow jobs run my-pipeline \
–gcs-location gs://bucket/templates/pipeline \
–region us-central1 \
–parameters inputTopic=projects/proj/topics/input,\
outputTable=proj:dataset.table

# Cloud Storage operations
gsutil cp gs://source-bucket/* gs://dest-bucket/
gsutil -m rsync -r ./local-dir gs://bucket/folder/

# Dataproc cluster
gcloud dataproc clusters create my-cluster \
–zone us-central1-b \
–num-masters 1 \
–num-workers 3 \
–worker-machine-type n1-standard-4

🔷

Azure Data Platform

Core Services:
Azure Data Factory: ETL/ELT orchestration
Azure Synapse: Analytics service (DW + Spark)
Azure Data Lake: Scalable data storage
Azure Databricks: Apache Spark platform
Azure Stream Analytics: Real-time processing

Integration Pattern:
Data Lake → Data Factory → Synapse → Power BI

🏗️ Data Pipeline Architecture Patterns
📊

Modern Data Architecture

Bronze Layer (Raw)

• Ingest data in original format
• Minimal transformation
• Data lake storage (Parquet, JSON)
• Preserve data lineage and audit trail
• Examples: Raw logs, API responses, file uploads

Silver Layer (Refined)

• Cleaned and validated data
• Standardized schemas
• Data quality checks applied
• Deduplication and error handling
• Examples: Cleaned customer data, validated transactions

Gold Layer (Curated)

• Business-ready datasets
• Aggregated and enriched data
• Optimized for analytics
• Data mart and dimensional modeling