Interview Questions Booklet – Machine Learning

Machine Learning — Interview Questions Booklet (50 Q&A)

Fundamentals & Math • Supervised & Unsupervised • Feature Engineering • Evaluation & Experimentation • Deep Learning • NLP/CV/Recs • MLOps & Monitoring • Ethics • Real-world Scenarios

Section 1 — ML Fundamentals

1) What is machine learning, and how does it differ from rule-based programming?

Answer: ML learns patterns from data to make predictions/decisions, whereas rule-based systems use hand-crafted logic; ML generalizes from examples and adapts with new data.

2) What are the main types of learning, and when would you use each?

Answer: Supervised (labeled prediction), unsupervised (structure discovery), self-/semi-supervised (leverage unlabeled), and reinforcement learning (sequential decisions).

3) What is the bias–variance trade-off, and why does it matter?

Answer: High bias underfits; high variance overfits. The goal is a model complex enough to capture signal but regularized to generalize to new data.

4) How do parametric and non-parametric models differ?

Answer: Parametric models assume a fixed form with finite parameters (e.g., logistic regression); non-parametric models grow with data (e.g., kNN, trees) and can fit more complex patterns.

5) How do discriminative and generative models compare?

Answer: Discriminative learn p(y|x) for prediction (e.g., SVM); generative model joint p(x,y) (e.g., Naive Bayes), enabling sampling and handling missingness.

Section 2 — Probability, Statistics & Linear Algebra

6) Where would you apply Bayes’ theorem in practical ML workflows?

Answer: In probabilistic classifiers, spam filtering, model calibration, posterior updates with priors, and combining signals from multiple sources.

7) How do you detect and quantify overfitting during training?

Answer: Track train vs. validation error curves, use cross-validation, monitor generalization gap, and watch for performance divergence over epochs.

8) How do L1 and L2 regularization differ in effect and use?

Answer: L1 induces sparsity and feature selection; L2 shrinks weights smoothly to reduce variance. Elastic Net blends both for correlated features.

9) What is PCA, and when should you use or avoid it?

Answer: PCA projects data to orthogonal components maximizing variance; use for compression/noise reduction; avoid when features are not linearly related or interpretability is critical.

10) How do gradient descent variants (SGD, momentum, Adam) compare?

Answer: SGD is simple/stochastic; momentum accelerates along consistent gradients; Adam adapts per-parameter steps for faster convergence but may need warmup/decay.

Section 3 — Supervised Learning

11) How do ordinary least squares, Ridge, and Lasso regression differ?

Answer: OLS minimizes squared error; Ridge adds L2 penalty to reduce variance; Lasso adds L1 to zero out features, aiding interpretability.

12) When would you choose decision trees, random forests, or gradient boosting?

Answer: Trees for interpretability/baselines; forests for robust performance with minimal tuning; boosting (e.g., XGBoost) for top tabular accuracy with careful regularization.

13) What is the kernel trick in SVMs, and why is it useful?

Answer: It implicitly maps inputs to high-dimensional spaces to separate non-linear data using kernels (RBF, polynomial) without explicit feature construction.

14) What are the strengths and weaknesses of k-nearest neighbors?

Answer: Simple, non-parametric, competitive for small, low-dimensional data; weak on high-D, large datasets and sensitive to scaling and k choice.

15) How do you handle severe class imbalance in classification?

Answer: Use stratified CV, class weights, resampling (SMOTE/undersampling), threshold tuning, and metrics like PR-AUC, F1, and recall at K.

Section 4 — Unsupervised & Representation Learning

16) How do k-means and Gaussian Mixture Models differ, and when pick each?

Answer: k-means assumes spherical, equal-variance clusters with hard assignments; GMM models soft assignments and ellipsoidal clusters via covariances.

17) When is hierarchical clustering preferable to partitioning methods?

Answer: When the number of clusters is unknown, you need a dendrogram for multi-resolution insights, or data is small to moderate in size.

18) How do t-SNE and UMAP compare for visualization?

Answer: Both preserve local structure; t-SNE excels at local neighborhoods but struggles with global distances; UMAP often preserves more global structure and scales better.

19) What are common approaches to anomaly detection?

Answer: Statistical thresholds, Isolation Forest, One-Class SVM, autoencoders, and density-based methods like LOF, chosen by data type and label scarcity.

20) What are support, confidence, and lift in association mining?

Answer: Support: frequency of itemset; confidence: conditional probability of rule; lift: ratio vs. independence—lift > 1 indicates positive association.

Section 5 — Feature Engineering & Data Quality

21) How do you handle missing data without biasing results?

Answer: Analyze missingness pattern (MCAR/MAR/MNAR), use appropriate imputation (median/knn/mice), add missing indicators, and validate sensitivity.

22) Which encoding strategies suit high-cardinality categorical features?

Answer: Target/likelihood encoding with CV, hashing trick, embeddings for deep models, and careful regularization to prevent leakage.

23) When should you standardize, normalize, or use robust scaling?

Answer: Standardize for models assuming Gaussian features; min–max for bounded inputs/neural nets; robust scaling for outlier-heavy data.

24) How do you detect and prevent data leakage?

Answer: Audit pipelines so transforms fit only on training folds, isolate temporal features correctly, and avoid target-derived features leaking into training.

25) What special care is needed for time-series features?

Answer: Use lag/rolling stats, avoid lookahead, employ time-based CV, and handle seasonality/holidays and stationarity transformations.

Section 6 — Evaluation, Validation & Experimentation

26) How do cross-validation strategies differ for IID data vs. time series?

Answer: IID uses k-fold/stratified; time series uses forward-chaining or rolling windows preserving temporal order to avoid leakage.

27) How do you choose appropriate metrics for classification, regression, and ranking?

Answer: Classification: accuracy/F1/PR-AUC; regression: RMSE/MAE/R²; ranking/recommenders: MAP/NDCG/Hit@K—aligned to business impact.

28) When is ROC-AUC misleading, and why might PR-AUC be better?

Answer: With heavy class imbalance, ROC can look good; PR-AUC focuses on positive class precision–recall, reflecting usefulness under imbalance.

29) What is model calibration, and how do you improve it?

Answer: Calibration aligns predicted probabilities with reality; improve via Platt scaling, isotonic regression, or calibrated ensembles.

30) How do offline evaluation and online A/B testing complement each other?

Answer: Offline gives fast iteration and safety; A/B validates real impact, captures feedback loops and bias shifts; use both to de-risk launches.

Section 7 — Deep Learning Basics

31) Why are non-linear activation functions essential in neural networks?

Answer: Without non-linearity, stacked layers collapse to a linear map; activations allow networks to approximate complex, non-linear functions.

32) What causes vanishing/exploding gradients, and how do you mitigate them?

Answer: Caused by deep multiplications; mitigate with ReLU-family activations, proper initialization, residual connections, normalization, and gradient clipping.

33) What architectural ideas make CNNs effective for images?

Answer: Local receptive fields, weight sharing, and hierarchical feature extraction via convolution/pooling capture spatial structure efficiently.

34) How do RNNs/LSTMs compare to Transformers for sequence modeling?

Answer: RNNs/LSTMs process sequentially and struggle with long dependencies; Transformers use attention for parallelism and long-range context.

35) Which regularization techniques are common in deep learning?

Answer: Dropout, weight decay, data augmentation, early stopping, batch/LayerNorm, and mixup/cutout for robustness.

Section 8 — NLP, CV & Recommender Systems

36) How do static word embeddings differ from contextual embeddings?

Answer: Static (Word2Vec/GloVe) give one vector per word; contextual (BERT) vary by sentence, capturing polysemy and richer semantics.

37) When should you fine-tune a pre-trained model versus use it as a frozen feature extractor?

Answer: Fine-tune when you have enough domain data and compute; freeze for small datasets or when you want speed and avoid overfitting.

38) How does transfer learning accelerate computer vision projects?

Answer: It reuses features learned on large datasets (e.g., ImageNet), reducing data needs and training time while improving accuracy.

39) What are the main families of recommender systems?

Answer: Collaborative filtering (user–item interactions), content-based (item/user features), and hybrids combining both with contextual signals.

40) How do you handle cold-start problems in recommenders?

Answer: Use content features, popularity/recency priors, contextual bandits, onboarding questionnaires, and cross-domain transfer.

Section 9 — MLOps, Deployment & Monitoring

41) What deployment patterns exist for ML models, and when choose each?

Answer: Batch scoring for periodic workloads; online REST/gRPC for low-latency; streaming for real-time events; edge for offline/latency-critical use.

42) What is a feature store, and why is it valuable?

Answer: It centralizes feature definitions, backfills, and online/offline consistency to reduce leakage, duplication, and training–serving skew.

43) How do you detect data and concept drift in production?

Answer: Monitor feature distributions (KS/PSI), label delay proxies, performance by cohort, and trigger re-training or alerts on drift thresholds.

44) How do you approach model explainability responsibly?

Answer: Use inherently interpretable models where possible; apply SHAP/LIME/feature importance carefully, validate stability, and tailor explanations to stakeholders.

45) Which governance and ethics practices are essential for ML systems?

Answer: Document datasets/models, manage consent, minimize PII, assess bias/fairness, perform model risk reviews, and enable auditability and rollback.

Section 10 — Real-World Scenarios & Troubleshooting

46) A retrained model performs worse than the previous version; how do you proceed?

Answer: Compare data slices, check label drift/leakage, reproduce training with fixed seeds, inspect hyperparameters, and roll back while you root-cause.

47) Offline metrics are strong, but online impact is weak; what could be wrong?

Answer: Training–serving skew, poor calibration, feedback loops, or metric mismatch with business KPIs; validate with shadow traffic and A/B tests.

48) You have very little labeled data; how can you build a useful model?

Answer: Use transfer learning, data augmentation, weak labeling, active learning, and semi-/self-supervised methods to leverage unlabeled data.

49) Stakeholders want interpretability over a small accuracy gain; what do you deliver?

Answer: Prefer interpretable models (GBMs with monotonic constraints, GLMs, GAMs) or deliver post-hoc explanations with governance and human-in-the-loop checks.

50) How do you communicate model value to non-technical leaders effectively?

Answer: Frame outcomes in business KPIs (lift, cost savings), show counterfactual examples, quantify uncertainty, and present an experiment-backed roadmap.