Machine Learning — Interview Questions Booklet (50 Q&A)
Fundamentals & Math • Supervised & Unsupervised • Feature Engineering • Evaluation & Experimentation • Deep Learning • NLP/CV/Recs • MLOps & Monitoring • Ethics • Real-world Scenarios
1) What is machine learning, and how does it differ from rule-based programming?
Answer: ML learns patterns from data to make predictions/decisions, whereas rule-based systems use hand-crafted logic; ML generalizes from examples and adapts with new data.
2) What are the main types of learning, and when would you use each?
Answer: Supervised (labeled prediction), unsupervised (structure discovery), self-/semi-supervised (leverage unlabeled), and reinforcement learning (sequential decisions).
3) What is the bias–variance trade-off, and why does it matter?
Answer: High bias underfits; high variance overfits. The goal is a model complex enough to capture signal but regularized to generalize to new data.
4) How do parametric and non-parametric models differ?
Answer: Parametric models assume a fixed form with finite parameters (e.g., logistic regression); non-parametric models grow with data (e.g., kNN, trees) and can fit more complex patterns.
5) How do discriminative and generative models compare?
Answer: Discriminative learn p(y|x) for prediction (e.g., SVM); generative model joint p(x,y) (e.g., Naive Bayes), enabling sampling and handling missingness.
6) Where would you apply Bayes’ theorem in practical ML workflows?
Answer: In probabilistic classifiers, spam filtering, model calibration, posterior updates with priors, and combining signals from multiple sources.
7) How do you detect and quantify overfitting during training?
Answer: Track train vs. validation error curves, use cross-validation, monitor generalization gap, and watch for performance divergence over epochs.
8) How do L1 and L2 regularization differ in effect and use?
Answer: L1 induces sparsity and feature selection; L2 shrinks weights smoothly to reduce variance. Elastic Net blends both for correlated features.
9) What is PCA, and when should you use or avoid it?
Answer: PCA projects data to orthogonal components maximizing variance; use for compression/noise reduction; avoid when features are not linearly related or interpretability is critical.
10) How do gradient descent variants (SGD, momentum, Adam) compare?
Answer: SGD is simple/stochastic; momentum accelerates along consistent gradients; Adam adapts per-parameter steps for faster convergence but may need warmup/decay.
11) How do ordinary least squares, Ridge, and Lasso regression differ?
Answer: OLS minimizes squared error; Ridge adds L2 penalty to reduce variance; Lasso adds L1 to zero out features, aiding interpretability.
12) When would you choose decision trees, random forests, or gradient boosting?
Answer: Trees for interpretability/baselines; forests for robust performance with minimal tuning; boosting (e.g., XGBoost) for top tabular accuracy with careful regularization.
13) What is the kernel trick in SVMs, and why is it useful?
Answer: It implicitly maps inputs to high-dimensional spaces to separate non-linear data using kernels (RBF, polynomial) without explicit feature construction.
14) What are the strengths and weaknesses of k-nearest neighbors?
Answer: Simple, non-parametric, competitive for small, low-dimensional data; weak on high-D, large datasets and sensitive to scaling and k choice.
15) How do you handle severe class imbalance in classification?
Answer: Use stratified CV, class weights, resampling (SMOTE/undersampling), threshold tuning, and metrics like PR-AUC, F1, and recall at K.
16) How do k-means and Gaussian Mixture Models differ, and when pick each?
Answer: k-means assumes spherical, equal-variance clusters with hard assignments; GMM models soft assignments and ellipsoidal clusters via covariances.
17) When is hierarchical clustering preferable to partitioning methods?
Answer: When the number of clusters is unknown, you need a dendrogram for multi-resolution insights, or data is small to moderate in size.
18) How do t-SNE and UMAP compare for visualization?
Answer: Both preserve local structure; t-SNE excels at local neighborhoods but struggles with global distances; UMAP often preserves more global structure and scales better.
19) What are common approaches to anomaly detection?
Answer: Statistical thresholds, Isolation Forest, One-Class SVM, autoencoders, and density-based methods like LOF, chosen by data type and label scarcity.
20) What are support, confidence, and lift in association mining?
Answer: Support: frequency of itemset; confidence: conditional probability of rule; lift: ratio vs. independence—lift > 1 indicates positive association.
21) How do you handle missing data without biasing results?
Answer: Analyze missingness pattern (MCAR/MAR/MNAR), use appropriate imputation (median/knn/mice), add missing indicators, and validate sensitivity.
22) Which encoding strategies suit high-cardinality categorical features?
Answer: Target/likelihood encoding with CV, hashing trick, embeddings for deep models, and careful regularization to prevent leakage.
23) When should you standardize, normalize, or use robust scaling?
Answer: Standardize for models assuming Gaussian features; min–max for bounded inputs/neural nets; robust scaling for outlier-heavy data.
24) How do you detect and prevent data leakage?
Answer: Audit pipelines so transforms fit only on training folds, isolate temporal features correctly, and avoid target-derived features leaking into training.
25) What special care is needed for time-series features?
Answer: Use lag/rolling stats, avoid lookahead, employ time-based CV, and handle seasonality/holidays and stationarity transformations.
26) How do cross-validation strategies differ for IID data vs. time series?
Answer: IID uses k-fold/stratified; time series uses forward-chaining or rolling windows preserving temporal order to avoid leakage.
27) How do you choose appropriate metrics for classification, regression, and ranking?
Answer: Classification: accuracy/F1/PR-AUC; regression: RMSE/MAE/R²; ranking/recommenders: MAP/NDCG/Hit@K—aligned to business impact.
28) When is ROC-AUC misleading, and why might PR-AUC be better?
Answer: With heavy class imbalance, ROC can look good; PR-AUC focuses on positive class precision–recall, reflecting usefulness under imbalance.
29) What is model calibration, and how do you improve it?
Answer: Calibration aligns predicted probabilities with reality; improve via Platt scaling, isotonic regression, or calibrated ensembles.
30) How do offline evaluation and online A/B testing complement each other?
Answer: Offline gives fast iteration and safety; A/B validates real impact, captures feedback loops and bias shifts; use both to de-risk launches.
31) Why are non-linear activation functions essential in neural networks?
Answer: Without non-linearity, stacked layers collapse to a linear map; activations allow networks to approximate complex, non-linear functions.
32) What causes vanishing/exploding gradients, and how do you mitigate them?
Answer: Caused by deep multiplications; mitigate with ReLU-family activations, proper initialization, residual connections, normalization, and gradient clipping.
33) What architectural ideas make CNNs effective for images?
Answer: Local receptive fields, weight sharing, and hierarchical feature extraction via convolution/pooling capture spatial structure efficiently.
34) How do RNNs/LSTMs compare to Transformers for sequence modeling?
Answer: RNNs/LSTMs process sequentially and struggle with long dependencies; Transformers use attention for parallelism and long-range context.
35) Which regularization techniques are common in deep learning?
Answer: Dropout, weight decay, data augmentation, early stopping, batch/LayerNorm, and mixup/cutout for robustness.
36) How do static word embeddings differ from contextual embeddings?
Answer: Static (Word2Vec/GloVe) give one vector per word; contextual (BERT) vary by sentence, capturing polysemy and richer semantics.
37) When should you fine-tune a pre-trained model versus use it as a frozen feature extractor?
Answer: Fine-tune when you have enough domain data and compute; freeze for small datasets or when you want speed and avoid overfitting.
38) How does transfer learning accelerate computer vision projects?
Answer: It reuses features learned on large datasets (e.g., ImageNet), reducing data needs and training time while improving accuracy.
39) What are the main families of recommender systems?
Answer: Collaborative filtering (user–item interactions), content-based (item/user features), and hybrids combining both with contextual signals.
40) How do you handle cold-start problems in recommenders?
Answer: Use content features, popularity/recency priors, contextual bandits, onboarding questionnaires, and cross-domain transfer.
41) What deployment patterns exist for ML models, and when choose each?
Answer: Batch scoring for periodic workloads; online REST/gRPC for low-latency; streaming for real-time events; edge for offline/latency-critical use.
42) What is a feature store, and why is it valuable?
Answer: It centralizes feature definitions, backfills, and online/offline consistency to reduce leakage, duplication, and training–serving skew.
43) How do you detect data and concept drift in production?
Answer: Monitor feature distributions (KS/PSI), label delay proxies, performance by cohort, and trigger re-training or alerts on drift thresholds.
44) How do you approach model explainability responsibly?
Answer: Use inherently interpretable models where possible; apply SHAP/LIME/feature importance carefully, validate stability, and tailor explanations to stakeholders.
45) Which governance and ethics practices are essential for ML systems?
Answer: Document datasets/models, manage consent, minimize PII, assess bias/fairness, perform model risk reviews, and enable auditability and rollback.
46) A retrained model performs worse than the previous version; how do you proceed?
Answer: Compare data slices, check label drift/leakage, reproduce training with fixed seeds, inspect hyperparameters, and roll back while you root-cause.
47) Offline metrics are strong, but online impact is weak; what could be wrong?
Answer: Training–serving skew, poor calibration, feedback loops, or metric mismatch with business KPIs; validate with shadow traffic and A/B tests.
48) You have very little labeled data; how can you build a useful model?
Answer: Use transfer learning, data augmentation, weak labeling, active learning, and semi-/self-supervised methods to leverage unlabeled data.
49) Stakeholders want interpretability over a small accuracy gain; what do you deliver?
Answer: Prefer interpretable models (GBMs with monotonic constraints, GLMs, GAMs) or deliver post-hoc explanations with governance and human-in-the-loop checks.
50) How do you communicate model value to non-technical leaders effectively?
Answer: Frame outcomes in business KPIs (lift, cost savings), show counterfactual examples, quantify uncertainty, and present an experiment-backed roadmap.