📊 Data Analyst Cheat Sheet
Complete guide to data analysis techniques, tools, statistical methods, and visualization best practices
Types of Data Analysis
Descriptive Analytics:
• What happened? Historical data summary
• Reports, dashboards, KPI tracking
Diagnostic Analytics:
• Why did it happen? Root cause analysis
• Correlation analysis, drill-down reports
Predictive Analytics:
• What will happen? Future forecasting
• Statistical modeling, machine learning
Prescriptive Analytics:
• What should we do? Optimization
• Decision support, recommendation engines
Data Types & Measurement
• Discrete: Countable values (customers, orders)
• Continuous: Measurable values (revenue, time)
Qualitative Data:
• Nominal: Categories without order (color, brand)
• Ordinal: Categories with order (rating, size)
Measurement Scales:
• Ratio: True zero point (age, income)
• Interval: Equal intervals, no true zero (temperature)
Analytics Framework
1. Business Understanding: Define objectives
2. Data Understanding: Explore and assess data
3. Data Preparation: Clean and transform
4. Modeling: Apply analytical techniques
5. Evaluation: Assess results against objectives
6. Deployment: Implement findings
Iterative process with feedback loops between stages
Descriptive Statistics
• Mean: Average value (sensitive to outliers)
• Median: Middle value (robust to outliers)
• Mode: Most frequent value
Variability:
• Range: Max – Min
• Variance: Average squared deviation
• Standard Deviation: Square root of variance
• IQR: 75th percentile – 25th percentile
Distribution Shape:
• Skewness: Asymmetry measure
• Kurtosis: Tail heaviness measure
Inferential Statistics
• Null hypothesis (H₀) vs Alternative (H₁)
• p-value: Probability of observing data if H₀ true
• Significance level (α): Threshold (usually 0.05)
• Type I error: False positive
• Type II error: False negative
Confidence Intervals:
• Range of plausible values for parameter
• 95% CI: 95% confidence true value is within range
Common Statistical Tests
• One-sample t-test: Sample vs population
• Two-sample t-test: Compare two groups
• ANOVA: Compare multiple groups
Relationships:
• Correlation: Linear relationship strength
• Chi-square: Independence of categorical variables
• Regression: Predict dependent variable
Non-parametric Tests:
• Mann-Whitney U: Compare medians
• Kruskal-Wallis: Multiple group comparison
Essential Excel Functions
=AVERAGE(A1:A100) // Mean
=MEDIAN(A1:A100) // Median
=MODE.SNGL(A1:A100) // Mode
=STDEV.S(A1:A100) // Sample standard deviation
=VAR.S(A1:A100) // Sample variance
=QUARTILE(A1:A100, 1) // First quartile
=PERCENTILE(A1:A100, 0.95) // 95th percentile
=CORREL(A1:A100, B1:B100) // Correlation coefficient
// Conditional Statistics
=AVERAGEIF(B1:B100, “>100”, A1:A100) // Conditional average
=COUNTIFS(A1:A100, “>50”, B1:B100, “Yes”) // Multiple criteria count
=SUMIFS(C1:C100, A1:A100, “Product A”, B1:B100, “>100”)
// Lookup and Reference
=VLOOKUP(E2, A1:C100, 3, FALSE) // Vertical lookup
=INDEX(C1:C100, MATCH(E2, A1:A100, 0)) // Index-match
=XLOOKUP(E2, A1:A100, C1:C100) // Modern lookup (Excel 365)
Pivot Tables & Analysis
• Rows: Categories to group by
• Columns: Additional grouping dimension
• Values: Metrics to calculate
• Filters: Data subset controls
Common Aggregations:
• Sum, Average, Count, Max, Min
• Percentage of total, running totals
• Year-over-year growth calculations
Advanced Features:
• Calculated fields and items
• Grouping dates and numbers
• Pivot charts for visualization
Data Visualization in Excel
• Column/Bar: Compare categories
• Line: Show trends over time
• Scatter: Examine relationships
• Pie: Show parts of whole (limit to 5-7 slices)
• Area: Show cumulative totals
• Combo: Multiple metrics with different scales
Best Practices:
• Clear titles and axis labels
• Consistent color scheme
• Remove chart junk (unnecessary elements)
• Use data labels when helpful
Essential SQL Queries
— Data exploration
SELECT COUNT(*) as total_records,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(order_date) as earliest_order,
MAX(order_date) as latest_order
FROM orders;
— Summary statistics
SELECT
AVG(order_value) as avg_order_value,
MEDIAN(order_value) as median_order_value,
STDDEV(order_value) as std_dev,
MIN(order_value) as min_value,
MAX(order_value) as max_value,
COUNT(*) as total_orders
FROM orders
WHERE order_date >= ‘2024-01-01’;
— Percentile analysis
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_value) as q1,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_value) as median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_value) as q3,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_value) as p95
FROM orders;
Analytical SQL Functions
— Running totals and moving averages
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date) as running_total,
AVG(daily_revenue) OVER (ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_7_day_avg
FROM daily_sales;
— Ranking and percentiles
SELECT
customer_id,
total_spent,
RANK() OVER (ORDER BY total_spent DESC) as spending_rank,
NTILE(10) OVER (ORDER BY total_spent) as decile,
PERCENT_RANK() OVER (ORDER BY total_spent) as percentile_rank
FROM customer_totals;
— Period-over-period analysis
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) as prev_month_revenue,
LAG(revenue, 12) OVER (ORDER BY month) as same_month_last_year,
(revenue – LAG(revenue, 1) OVER (ORDER BY month)) /
LAG(revenue, 1) OVER (ORDER BY month) * 100 as mom_growth,
(revenue – LAG(revenue, 12) OVER (ORDER BY month)) /
LAG(revenue, 12) OVER (ORDER BY month) * 100 as yoy_growth
FROM monthly_revenue;
Advanced Analysis Patterns
WITH first_purchase AS (
SELECT customer_id,
MIN(DATE_TRUNC(‘month’, order_date)) as cohort_month
FROM orders
GROUP BY customer_id
),
customer_activity AS (
SELECT o.customer_id,
fp.cohort_month,
DATE_TRUNC(‘month’, o.order_date) as activity_month,
EXTRACT(EPOCH FROM (DATE_TRUNC(‘month’, o.order_date) – fp.cohort_month)) /
(60*60*24*30) as period_number
FROM orders o
JOIN first_purchase fp ON o.customer_id = fp.customer_id
)
SELECT cohort_month,
period_number,
COUNT(DISTINCT customer_id) as customers,
ROUND(100.0 * COUNT(DISTINCT customer_id) /
FIRST_VALUE(COUNT(DISTINCT customer_id))
OVER (PARTITION BY cohort_month ORDER BY period_number), 2) as retention_rate
FROM customer_activity
GROUP BY cohort_month, period_number
ORDER BY cohort_month, period_number;
Pandas Fundamentals
import pandas as pd
import numpy as np
# Data loading and exploration
df = pd.read_csv(‘data.csv’)
df = pd.read_excel(‘data.xlsx’, sheet_name=’Sales’)
df.head() # First 5 rows
df.info() # Data types and memory usage
df.describe() # Statistical summary
df.shape # Dimensions
df.columns.tolist() # Column names
# Data cleaning
df.dropna() # Remove missing values
df.fillna(df.mean()) # Fill with mean
df.drop_duplicates() # Remove duplicates
df[‘column’].unique() # Unique values
df[‘column’].value_counts() # Frequency counts
# Data transformation
df[‘new_column’] = df[‘col1’] + df[‘col2’] # Create new column
df.groupby(‘category’).agg({ # Group by operations
‘sales’: [‘sum’, ‘mean’, ‘count’],
‘profit’: ‘sum’
})
df.pivot_table(values=’sales’, index=’product’,
columns=’month’, aggfunc=’sum’) # Pivot table
Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
plt.style.use(‘seaborn-v0_8’)
sns.set_palette(“husl”)
# Basic plots
plt.figure(figsize=(10, 6))
plt.plot(df[‘date’], df[‘sales’]) # Line plot
plt.bar(df[‘category’], df[‘sales’]) # Bar chart
plt.scatter(df[‘advertising’], df[‘sales’]) # Scatter plot
plt.hist(df[‘sales’], bins=20) # Histogram
# Seaborn statistical plots
sns.boxplot(x=’category’, y=’sales’, data=df) # Box plot
sns.violinplot(x=’category’, y=’sales’, data=df) # Violin plot
sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’) # Correlation matrix
sns.pairplot(df) # Pairwise relationships
sns.regplot(x=’advertising’, y=’sales’, data=df) # Regression plot
# Customization
plt.title(‘Sales Analysis’, fontsize=16)
plt.xlabel(‘Category’, fontsize=12)
plt.ylabel(‘Sales ($)’, fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Statistical Analysis
from scipy import stats
import scipy.stats as stats
# Descriptive statistics
stats.describe(df[‘sales’]) # Complete description
df[‘sales’].skew() # Skewness
df[‘sales’].kurtosis() # Kurtosis
# Hypothesis testing
# One-sample t-test
t_stat, p_value = stats.ttest_1samp(df[‘sales’], 100000)
print(f”t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}”)
# Two-sample t-test
group_a = df[df[‘group’] == ‘A’][‘sales’]
group_b = df[df[‘group’] == ‘B’][‘sales’]
t_stat, p_value = stats.ttest_ind(group_a, group_b)
# Chi-square test for independence
contingency_table = pd.crosstab(df[‘category’], df[‘region’])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
# Correlation analysis
correlation, p_value = stats.pearsonr(df[‘advertising’], df[‘sales’])
spearman_corr, p_value = stats.spearmanr(df[‘advertising’], df[‘sales’])
# ANOVA
f_stat, p_value = stats.f_oneway(
df[df[‘category’] == ‘A’][‘sales’],
df[df[‘category’] == ‘B’][‘sales’],
df[df[‘category’] == ‘C’][‘sales’]
)
Tableau Essentials
• Dimensions: Categorical data (qualitative)
• Measures: Numerical data (quantitative)
• Calculated Fields: Custom formulas
• Parameters: Interactive user inputs
Common Calculations:
• Running totals: RUNNING_SUM(SUM([Sales]))
• Percentage of total: SUM([Sales])/TOTAL(SUM([Sales]))
• Year-over-year growth: (SUM([Sales]) – LOOKUP(SUM([Sales]), -12))/LOOKUP(SUM([Sales]), -12)
• Rank: RANK(SUM([Sales]), ‘desc’)
Best Practices:
• Use appropriate chart types for data
• Consistent color schemes and formatting
• Clear titles and labels
Power BI Fundamentals
• Calculated Columns: Row-level calculations
• Measures: Aggregated calculations
• Tables: Virtual table functions
Common DAX Functions:
• Total Sales = SUM(Sales[Amount])
• Sales YTD = TOTALYTD([Total Sales], Calendar[Date])
• Previous Year = CALCULATE([Total Sales], SAMEPERIODLASTYEAR(Calendar[Date]))
• Growth % = DIVIDE([Total Sales] – [Previous Year], [Previous Year])
Key Features:
• Power Query for data transformation
• Interactive dashboards and reports
• Natural language Q&A
• Mobile-optimized views
Looker & Google Data Studio
• Model layer for consistent metrics
• SQL-based modeling approach
• Reusable dimensions and measures
• Git-based version control
Google Data Studio:
• Free Google visualization tool
• Easy Google Sheets integration
• Calculated fields and parameters
• Collaborative report sharing
Use Cases:
• Marketing performance dashboards
• Executive reporting
• Self-service analytics
Experimental Design
• Hypothesis: Clear, testable prediction
• Control Group: Baseline/current experience
• Treatment Group: New variant being tested
• Random Assignment: Unbiased group allocation
• Sample Size: Adequate power to detect effect
Success Metrics:
• Primary: Main KPI you’re trying to impact
• Secondary: Supporting metrics
• Guardrail: Metrics that shouldn’t be harmed
Common Pitfalls:
• Insufficient sample size
• Multiple testing without correction
• Stopping tests early
• Selection bias in assignment
Statistical Power & Sample Size
import scipy.stats as stats
from statsmodels.stats.power import ttest_power
from statsmodels.stats.proportion import proportion_effectsize
# For proportions (conversion rates)
def sample_size_proportion(p1, p2, alpha=0.05, power=0.8):
effect_size = proportion_effectsize(p1, p2)
n = stats.norm.ppf(1 – alpha/2)**2 * 2 * p1 * (1-p1) / (p1-p2)**2
return int(n)
# Example: Current conversion 5%, want to detect 1% increase
current_rate = 0.05
target_rate = 0.06
sample_size = sample_size_proportion(current_rate, target_rate)
print(f”Required sample size per group: {sample_size}”)
# For continuous variables (means)
def sample_size_ttest(mu1, mu2, sigma, alpha=0.05, power=0.8):
effect_size = abs(mu1 – mu2) / sigma
n = 2 * (stats.norm.ppf(1-alpha/2) + stats.norm.ppf(power))**2 / effect_size**2
return int(n)
# Power analysis – what effect can we detect?
detectable_effect = ttest_power(effect_size=None, nobs=1000,
alpha=0.05, power=0.8)
Results Analysis
# Chi-square test for conversion rates
from scipy.stats import chi2_contingency
# Test data
control_conversions = 250
control_visitors = 5000
treatment_conversions = 280
treatment_visitors = 5000
# Create contingency table
obs = [[control_conversions, control_visitors – control_conversions],
[treatment_conversions, treatment_visitors – treatment_conversions]]
# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(obs)
# Calculate conversion rates and confidence intervals
control_rate = control_conversions / control_visitors
treatment_rate = treatment_conversions / treatment_visitors
lift = (treatment_rate – control_rate) / control_rate * 100
print(f”Control conversion rate: {control_rate:.3f}”)
print(f”Treatment conversion rate: {treatment_rate:.3f}”)
print(f”Relative lift: {lift:.1f}%”)
print(f”Statistical significance (p-value): {p_value:.4f}”)
# Confidence interval for difference in proportions
import statsmodels.stats.proportion as smp
ci_lower, ci_upper = smp.confint_proportions_2indep(
control_conversions, control_visitors,
treatment_conversions, treatment_visitors
)
Cohort Analysis
Types of Cohorts:
• Time-based: Users who signed up in same period
• Behavioral: Users who performed specific action
• Size-based: Users grouped by spend/usage
Key Metrics:
• Retention rates by cohort month
• Revenue per cohort over time
• User lifecycle patterns
Applications:
• Product-market fit assessment
• Feature impact measurement
• Customer lifetime value prediction
• Churn analysis and prevention
Customer Lifetime Value (CLV)
# Simple CLV calculation
avg_order_value = df.groupby(‘customer_id’)[‘order_value’].mean()
purchase_frequency = df.groupby(‘customer_id’)[‘order_id’].count() / months
customer_lifespan = 1 / churn_rate # months
simple_clv = avg_order_value * purchase_frequency * customer_lifespan
# Granular CLV with cohort analysis
import numpy as np
def calculate_clv_cohort(cohort_data, discount_rate=0.1):
“””
cohort_data: DataFrame with retention rates by period
discount_rate: Monthly discount rate for NPV
“””
clv = 0
for period, retention_rate in enumerate(cohort_data[‘retention_rate’]):
period_revenue = cohort_data.loc[period, ‘revenue_per_user’]
discounted_value = period_revenue * retention_rate / ((1 + discount_rate) ** period)
clv += discounted_value
return clv
# Predictive CLV using regression
from sklearn.linear_model import LinearRegression
features = [‘recency’, ‘frequency’, ‘monetary’, ‘avg_days_between_orders’]
model = LinearRegression()
model.fit(customer_features[features], historical_clv)
predicted_clv = model.predict(new_customers[features])
RFM Analysis
import pandas as pd
from datetime import datetime, timedelta
def rfm_analysis(df, customer_id=’customer_id’, order_date=’order_date’, revenue=’revenue’):
# Calculate Recency, Frequency, Monetary
snapshot_date = df[order_date].max() + timedelta(days=1)
rfm = df.groupby([customer_id]).agg({
order_date: lambda x: (snapshot_date – x.max()).days, # Recency
customer_id: ‘count’, # Frequency
revenue: ‘sum’ # Monetary
}).round(2)
rfm.columns = [‘Recency’, ‘Frequency’, ‘Monetary’]
# Create RFM scores (1-5 scale)
rfm[‘R_Score’] = pd.qcut(rfm[‘Recency’].rank(method=’first’), 5,
labels=[5,4,3,2,1]).astype(int)
rfm[‘F_Score’] = pd.qcut(rfm[‘Frequency’], 5,
labels=[1,2,3,4,5]).astype(int)
rfm[‘M_Score’] = pd.qcut(rfm[‘Monetary’], 5,
labels=[1,2,3,4,5]).astype(int)
# Combine scores
rfm[‘RFM_Score’] = rfm[‘R_Score’].astype(str) + \
rfm[‘F_Score’].astype(str) + \
rfm[‘M_Score’].astype(str)
# Customer segments
segment_map = {
r'[4-5][4-5][4-5]’: ‘Champions’,
r'[3-5][2-5][3-5]’: ‘Loyal Customers’,
r'[3-5][1-3][1-3]’: ‘Potential Loyalists’,
r'[4-5][0-1][0-1]’: ‘New Customers’,
r'[3-4][0-1][0-1]’: ‘Promising’,
r'[2-3][2-3][2-3]’: ‘Customers Needing Attention’,
r'[2-3][0-2][0-2]’: ‘About to Sleep’,
r'[0-2][2-5][2-5]’: ‘At Risk’,
r'[0-1][4-5][4-5]’: “Can’t Lose Them”,
r'[1-2][1-2][1-2]’: ‘Hibernating’,
r'[0-2][0-2][0-2]’: ‘Lost’
}
rfm[‘Segment’] = rfm[‘RFM_Score’].replace(segment_map, regex=True)
return rfm
Total Revenue, MRR/ARR, ARPU, Revenue Growth Rate, Revenue per Customer
CAC, CLV, Churn Rate, Retention Rate, NPS, Customer Satisfaction
User Growth Rate, Market Share, Viral Coefficient, Organic vs Paid Growth
Conversion Rates, Funnel Analysis, Time to Value, Feature Adoption
DAU/MAU, Session Duration, Page Views, Bounce Rate, Stickiness Ratio
Feature Usage, User Flows, A/B Test Results, Product-Market Fit Score
Chart Selection Guide
• Bar Charts: Compare categories
• Column Charts: Compare values over time
• Horizontal Bar: Long category names
Trends & Time Series:
• Line Charts: Continuous data over time
• Area Charts: Cumulative values
• Slope Charts: Before/after comparisons
Relationships:
• Scatter Plots: Correlation between variables
• Bubble Charts: 3-dimensional relationships
Composition:
• Pie Charts: Parts of whole (max 7 categories)
• Stacked Bar: Subcategory composition
• Treemap: Hierarchical data
Design Principles
• Clear, descriptive titles and labels
• Appropriate scale and axes
• Remove unnecessary elements (chart junk)
• Use white space effectively
Color Usage:
• Consistent color scheme
• Use color to highlight key insights
• Consider colorblind accessibility
• Limit to 6-8 colors maximum
Typography:
• Readable font sizes (minimum 10pt)
• Consistent font family
• Proper hierarchy with font weights
Interactivity:
• Tooltips for additional context
• Filters for data exploration
• Drill-down capabilities
Dashboard Design
• Most important metrics at top-left
• Logical flow and grouping
• Consistent spacing and alignment
• Mobile-responsive design
Information Hierarchy:
• Executive summary at top
• Supporting details below
• Use size and position to show importance
Performance:
• Optimize load times
• Use aggregated data when possible
• Progressive loading for complex dashboards
User Experience:
• Intuitive navigation
• Contextual help and explanations
• Export and sharing capabilities
Excel, Google Sheets, Numbers – fundamental analysis tool
MySQL, PostgreSQL, SQL Server, BigQuery, Snowflake
Python, R, SAS – advanced statistical analysis
Tableau, Power BI, Looker, D3.js, matplotlib, ggplot2
Tableau, Power BI, Qlik Sense, Looker, Sisense
Optimizely, Google Optimize, Adobe Target, VWO
Google Analytics, Adobe Analytics, Mixpanel, Amplitude
Alteryx, Dataiku, Knime, Trifacta, OpenRefine
Step-by-Step Analysis Framework
• Understand business context and objectives
• Define specific, measurable questions
• Identify key stakeholders and success criteria
• Set project timeline and deliverables
• Document assumptions and constraints
• Identify relevant data sources
• Assess data quality and completeness
• Extract data using appropriate tools
• Document data lineage and definitions
• Validate data integrity and accuracy
• Perform initial data profiling
• Identify patterns, outliers, and anomalies
• Calculate descriptive statistics
• Create preliminary visualizations
• Formulate hypotheses based on observations
• Handle missing values appropriately
• Remove or correct erroneous data
• Standardize formats and units
• Create derived variables as needed
• Document all transformation steps
• Apply appropriate analytical techniques
• Perform statistical tests and validations
• Build predictive models if required
• Validate results and check assumptions
• Interpret findings in business context
• Synthesize key findings and insights
• Develop actionable recommendations
• Quantify business impact where possible
• Address limitations and uncertainties
• Prepare compelling data story
• Create executive summary and detailed report
• Design effective visualizations
• Present findings to stakeholders
• Define success metrics for recommendations
• Plan follow-up analysis and monitoring
Data Quality Issues
Solutions:
• Implement data validation checks
• Create data quality scorecards
• Establish data governance processes
• Document data definitions and sources
• Use statistical methods to detect anomalies
• Collaborate with data engineering teams
• Always validate results against business logic
Unclear Requirements
Solutions:
• Ask clarifying questions upfront
• Document requirements and assumptions
• Create analysis plan with stakeholder approval
• Use iterative approach with regular check-ins
• Provide multiple scenarios and options
• Educate stakeholders on data limitations
• Set realistic expectations for deliverables
Time Constraints
Solutions:
• Prioritize high-impact analysis
• Use automated tools and templates
• Focus on key metrics and insights
• Communicate trade-offs clearly
• Build reusable analysis frameworks
• Invest in self-service analytics
• Manage stakeholder expectations proactively
Analysis Quality
• Version Control: Track changes in analysis
• Peer Review: Have colleagues check your work
• Statistical Rigor: Use appropriate tests and methods
• Validation: Cross-check results with multiple approaches
• Sensitivity Analysis: Test robustness of conclusions
• Ethical Considerations: Avoid bias and misrepresentation
Communication
• Tell a Story: Create narrative around data
• Focus on Insights: Not just data, but what it means
• Visual Design: Clear, effective charts and dashboards
• Action-Oriented: Provide specific recommendations
• Confidence Levels: Express uncertainty appropriately
• Follow Up: Track impact of recommendations
Continuous Improvement
• Industry Knowledge: Understand business domain
• Automation: Streamline repetitive tasks
• Collaboration: Work closely with stakeholders
• Feedback Loop: Learn from analysis outcomes
• Knowledge Sharing: Document and share learnings
• Experimentation: Test new approaches regularly
Technical Skills
• SQL Proficiency: Complex queries, window functions, CTEs
• Excel/Sheets: Advanced functions, pivot tables, macros
• Programming: Python/R for data analysis
• Visualization: Tableau, Power BI, or similar tools
• Database Knowledge: Understanding of data models
• A/B Testing: Experimental design and analysis
Business Acumen
• Business Metrics: KPIs and success measures
• Strategic Thinking: Connect data to business outcomes
• Process Understanding: How business operations work
• Stakeholder Management: Understanding different needs
• Project Management: Organizing and prioritizing work
• Change Management: Driving adoption of insights