LightGBM Pocket Book

LightGBM Pocket Book — Uplatz

50 in-depth cards • Wide layout • Readable examples • 20+ Interview Q&A included

Section 1 — Foundations

1) What is LightGBM?

LightGBM (by Microsoft) is a gradient boosting framework using tree-based learners optimized with histogram-based splitting, leaf-wise growth, and smart sampling. It’s known for speed, memory efficiency, and strong accuracy on tabular data. Supports regression, binary/multiclass classification, and ranking (LambdaRank).

pip install lightgbm
# Optional: GPU build requires proper toolchain & GPU libs

2) Why LightGBM vs XGBoost/CatBoost?

LightGBM’s histogram algorithm + leaf-wise growth can be very fast and accurate, especially on large, sparse datasets. XGBoost is a strong baseline with breadth of features and robust CPU/GPU; CatBoost excels with categorical handling out-of-the-box. Pick based on data shape, categorical complexity, and infra.

# scikit-learn API usage
from lightgbm import LGBMClassifier
clf = LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=64)
clf.fit(X_train, y_train, eval_set=[(X_val,y_val)], eval_metric='auc', early_stopping_rounds=50)

3) Core Ideas: Histograms, Leaf-wise Growth

Features are bucketed into histograms to compute split gains quickly. LightGBM grows trees leaf-wise (best-first): it expands the leaf with the highest loss reduction. This can improve accuracy but risks overfitting without constraints (e.g., num_leaves, min_data_in_leaf).

params = {"boosting_type":"gbdt","num_leaves":63,"max_depth":-1,"min_data_in_leaf":20}

4) Objectives & Metrics

Common objectives: regression, binary, multiclass, lambdarank. Metrics: rmse, mae, auc, logloss, multi_logloss, ndcg, map. Set both explicitly to track the right signal.

params = {"objective":"binary","metric":["auc","binary_logloss"]}

5) Sklearn API vs Native API

Sklearn wrappers (LGBMClassifier, LGBMRegressor, etc.) integrate nicely with pipelines and CV. Native API uses lgb.Dataset and lgb.train for fine control (multiple validation sets, callbacks, custom objectives/metrics).

import lightgbm as lgb
dtrain = lgb.Dataset(X_train, label=y_train, free_raw_data=False)
bst = lgb.train(params, dtrain, num_boost_round=2000, valid_sets=[dtrain], valid_names=["train"])

6) Handling Missing Values

LightGBM natively handles NaNs; it learns the optimal direction for missing values at each split. You generally don’t need imputation unless it benefits downstream features. If imputing, encode “missingness” separately so information isn’t lost.

# NaNs allowed in X; LightGBM handles them internally

7) Categorical Features

Provide integer-encoded categorical columns and mark them with categorical_feature. LightGBM uses optimal splits with built-in category handling (no one-hot needed). Keep categories stable across train/valid/test.

cat_cols = ["country","device","plan"]
for c in cat_cols: X_train[c] = X_train[c].astype("category")
clf = LGBMClassifier(categorical_feature=cat_cols)

8) Imbalance: is_unbalance & class_weight

For skewed classes, set is_unbalance=true or class_weight={'0':1,'1':w}. Choose one (not both). Also consider focal loss (custom) or proper sampling strategies.

LGBMClassifier(is_unbalance=True)  # or class_weight={"0":1,"1":10}

9) Early Stopping & Validation

Use a validation set and early_stopping_rounds to stop when metric doesn’t improve, then reuse best_iteration_ for predictions. Keep validation distribution realistic (time-aware split when appropriate).

clf.fit(X_tr, y_tr, eval_set=[(X_val,y_val)], eval_metric="auc", early_stopping_rounds=100)
y_pred = clf.predict_proba(X_te, num_iteration=clf.best_iteration_)[:,1]

10) Q&A — “Why does LightGBM overfit sometimes?”

Answer: Leaf-wise growth can create deep, unbalanced trees capturing noise. Control with num_leaves, min_data_in_leaf, max_depth, feature_fraction, bagging_fraction, and early stopping. Always validate on held-out data.

Section 2 — Data Preparation & Feature Engineering

11) Dataset & Freeing Memory

With native API, build lgb.Dataset for train/valid. Use free_raw_data=False if you need raw features later. Reuse constructed bins when creating validation sets to speed up.

dtrain = lgb.Dataset(X_tr, y_tr, free_raw_data=False)
dval   = lgb.Dataset(X_val, y_val, reference=dtrain)

12) Feature Binning

LightGBM discretizes features into bins (controlled by max_bin). More bins capture finer splits but increase memory/time. Typical defaults work; increase cautiously if you see underfitting on continuous features.

params["max_bin"] = 255  # try 255, 511 for more granularity

13) High Cardinality Categoricals

LightGBM can handle high-cardinality categoricals without one-hot, but performance may degrade. Consider frequency/target encoding (carefully, with CV), grouping rare categories, or hashing to reduce noise.

# Frequency encode before marking as category (optionally)
freq = X_tr['city'].value_counts()
X_tr['city_freq'] = X_tr['city'].map(freq)

14) Text & Dates

For text, build numeric features (TF-IDF, embeddings) then feed to LightGBM. For timestamps, extract calendar and lag features. Beware data leakage: compute statistics (means, counts) within CV folds only.

X["hour"] = pd.to_datetime(X["ts"]).dt.hour
X["dow"]  = pd.to_datetime(X["ts"]).dt.dayofweek

15) Monotonic Constraints

Enforce monotone relationships (e.g., price ↑ → risk ↑). Provide array of -1/0/+1 for each feature. Be sure features are scaled consistently with the constraint direction.

params["monotone_constraints"] = [1, 0, -1, 0, ...]  # length == n_features

16) Interaction Constraints (Practical Tip)

Constrain which features can interact (co-occur in splits) to reduce spurious rules and overfitting. Define feature groups that can split together; others cannot. Use sparingly for domain rules.

# Example conceptual usage (check your LightGBM version support)
# params["interaction_constraints"] = [["age","income"], ["country","device"]]

17) Grouped / Time-Aware Splits

For time series and groups (users/sessions), use time-based splits or group-aware CV to avoid leakage. LightGBM doesn’t enforce this — you must split appropriately before fitting.

# sklearn TimeSeriesSplit or GroupKFold

18) Label Encoding vs One-Hot

Prefer integer-encoding for categorical features with categorical_feature set. One-hot can work but increases dimensionality. If using one-hot, beware of high cardinality and sparse effects.

# pandas Categorical to integers is fine
X[c] = X[c].astype("category")

19) Feature Importance

Types: gain (total split gain), split (frequency). Gain is more informative. Combine with permutation importance and SHAP for robust insights.

import numpy as np
imp = clf.booster_.feature_importance(importance_type='gain')
names = clf.booster_.feature_name()

20) Q&A — “Should I scale features?”

Answer: Tree models don’t need scaling/normalization. However, scaling can help with monotone constraints’ direction sanity or if you combine with linear models. Generally, skip scaling for LightGBM.

Section 3 — Training, Hyperparameters & Regularization

21) num_leaves & max_depth

num_leaves controls model complexity (higher = more complex). max_depth can cap tree depth. A rule of thumb: num_leaves ≈ 2^(max_depth). Start modestly (31–127) and tune.

params.update(num_leaves=63, max_depth=-1)

22) min_data_in_leaf & min_sum_hessian_in_leaf

Regularize leaves by requiring a minimum number of samples or total Hessian. Increase to reduce overfitting on small patterns/noise.

params.update(min_data_in_leaf=30, min_sum_hessian_in_leaf=1e-3)

23) Learning Rate & Estimators

Small learning_rate with larger n_estimators improves generalization but costs time. Typical: 0.03–0.1. Use early stopping to avoid guessing n_estimators.

LGBMClassifier(learning_rate=0.05, n_estimators=5000)

24) Subsampling: feature_fraction & bagging_fraction

Randomly sample features and rows per iteration for regularization and speed. Pair with bagging_freq (e.g., 1 for every iteration).

params.update(feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1)

25) L1/L2 Regularization

lambda_l1 and lambda_l2 penalize leaf scores to reduce variance. Start with small values (e.g., 0–1). Tune alongside leaves and min_data_in_leaf.

params.update(lambda_l1=0.0, lambda_l2=1.0)

26) Boosting Types: gbdt, dart, rf, goss

gbdt is standard gradient boosting. dart drops trees (like dropout) to reduce overfitting. rf builds random forests. goss (Gradient-based One-Side Sampling) keeps large-grad samples and subsamples small-grad ones to speed training.

params["boosting_type"] = "dart"  # try "gbdt", "dart", "rf", "goss"

27) DART Parameters

For boosting_type="dart", set drop_rate, skip_drop, and max_drop. DART can help generalization but may need more trees.

params.update(boosting_type="dart", drop_rate=0.1, skip_drop=0.5)

28) GOSS & EFB

GOSS speeds training by smart sampling gradients. EFB (Exclusive Feature Bundling) packs mutually exclusive features to reduce dimensionality. Both are internal speed-ups; you mainly control via boosting_type="goss" and defaults.

params["boosting_type"] = "goss"

29) Cross-Validation

Use lgb.cv or sklearn CV with early stopping to pick iterations and avoid overfitting. Remember to propagate best_iteration to final training.

cv = lgb.cv(params, dtrain, nfold=5, num_boost_round=10000, early_stopping_rounds=200, seed=42)

30) Q&A — “What’s a good starting grid?”

Answer: learning_rate=0.05, num_leaves=63, min_data_in_leaf=30, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, lambda_l2=1, with early stopping. Then adjust leaves, min_data_in_leaf, and regularization.

Section 4 — Ranking, GPU, Interpretability & Ops

31) Learning-to-Rank (LambdaRank)

Set objective="lambdarank" with group/query info. Metrics: ndcg, map. Provide group sizes (queries), optional label_gain, and evaluate with NDCG@k.

params = {"objective":"lambdarank","metric":"ndcg","ndcg_eval_at":[5,10]}
# dtrain = lgb.Dataset(X, label=y, group=query_sizes)

32) Multiclass Classification

Use objective="multiclass" with num_class. Metric: multi_logloss or multi_error. For imbalanced classes, consider class weights.

params = {"objective":"multiclass","num_class":5,"metric":"multi_logloss"}

33) GPU Training

Enable GPU with device="gpu". It accelerates histogram building and split finding; benefits vary by dataset. Ensure GPU build is installed and memory is sufficient.

params.update(device="gpu")  # or device_type in some builds

34) Calibration

Boosted probabilities can be miscalibrated. Apply Platt scaling or isotonic regression on validation predictions to calibrate probabilities for decision-making.

from sklearn.calibration import CalibratedClassifierCV
cal = CalibratedClassifierCV(clf, cv="prefit", method="isotonic").fit(X_val, y_val)

35) SHAP Values

Use SHAP to interpret feature contributions per prediction. Combine with global importance for reliable narratives. Be mindful of correlated features.

import shap
explainer = shap.TreeExplainer(clf.booster_)
shap_values = explainer.shap_values(X_sample)

36) Partial Dependence & ICE

Explore feature effects with PDP/ICE to see marginal impact. This complements SHAP for business stakeholders.

from sklearn.inspection import plot_partial_dependence  # scikit-learn

37) Model Saving & Inference

Save boosters to text or binary; load later for predictions. Keep feature order/processing identical at inference.

bst.save_model("model.txt")
bst2 = lgb.Booster(model_file="model.txt")
pred = bst2.predict(X_te, num_iteration=bst2.best_iteration)

38) Onnx & Portability (Tip)

You can export to ONNX via converters for some pipelines, but many deploy LightGBM natively (Python/C++/CLI). Validate parity if converting.

# Use onnxmltools/skl2onnx with care; verify predictions match within tolerance

39) Logging & Reproducibility

Log params, seed, data versions, and code commit. Fix seed for reproducibility (note: parallelism can still introduce tiny non-determinism). Save best_iteration and evaluation curves.

params.update(seed=42, deterministic=True)

40) Q&A — “Why is my validation AUC unstable?”

Answer: Small data, leakage, or high variance splits. Use stratified CV, larger validation sets, group/time-aware splits, and average across folds. Fix seeds and log experiments.

Section 5 — Recipes, CLI, Checklists & Interview Q&A

41) Quick Binary Classification (sklearn)

Fast baseline with early stopping; tune leaves and regularization next.

clf = LGBMClassifier(
  learning_rate=0.05, n_estimators=5000, num_leaves=63,
  min_data_in_leaf=30, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1
)
clf.fit(X_tr, y_tr, eval_set=[(X_val,y_val)], eval_metric="auc", early_stopping_rounds=200)

42) Native API with Multiple Validations

Track several validation sets (e.g., CV folds or time slices). Early stopping triggers on the first valid set by default; set first_metric_only if needed.

bst = lgb.train(params, dtrain, num_boost_round=10000,
  valid_sets=[dtrain, dval1, dval2], valid_names=["train","val1","val2"],
  early_stopping_rounds=200, verbose_eval=100)

43) CLI Training

Use LightGBM CLI for reproducible training outside Python.

# train.conf
task = train
objective = binary
metric = auc
data = train.svm
valid_data = valid.svm
num_leaves = 63
learning_rate = 0.05
num_boost_round = 10000
early_stopping_round = 200

# Run
lightgbm config=train.conf

44) Class Weights via sklearn

Use class_weight mapping for imbalanced classes. Prefer weights over naive downsampling if data is scarce.

LGBMClassifier(class_weight={0:1, 1:20})

45) Threshold Selection

Optimize decision threshold on validation set by metric (F1, Youden’s J, cost). Don’t assume 0.5 is optimal for imbalanced tasks.

thr = np.linspace(0,1,101)
best = max(thr, key=lambda t: f1_score(y_val, (p_val>t).astype(int)))

46) Pipeline Integration

Combine preprocessing + model in sklearn Pipeline. For categoricals, use encoders that emit integer categories and pass categorical_feature indexes to LGBM.

from sklearn.pipeline import Pipeline
pipe = Pipeline([("model", LGBMClassifier())])

47) Drift Monitoring

Track feature distributions and outcome rates. Recalibrate thresholds or retrain when drift detected. Keep a shadow model to compare.

# Log KS-statistics/PSI per feature over time

48) Production Checklist

  • Version data, code, params, and model
  • Fix seed, log evals & best_iteration
  • Validate on time/group splits
  • Calibrate probs if needed
  • Monitor drift, latency, errors
  • Have rollback model & thresholds

49) Common Pitfalls

Overfitting from large num_leaves, mixing one-hot + categorical flags incorrectly, leakage in target/freq encoding, using is_unbalance with class weights simultaneously, misaligned feature order at inference, ignoring group/time splits.

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Why LightGBM fast? Histogram-based splits + leaf-wise growth + efficient sampling (EFB/GOSS) reduce computations and memory.

2) Leaf-wise vs level-wise? Leaf-wise expands the highest-gain leaf first (can fit complex patterns); level-wise grows evenly. Leaf-wise risks overfitting without constraints.

3) Role of num_leaves? Controls complexity; too high overfits, too low underfits. Tune with min_data_in_leaf and regularization.

4) max_depth usage? Cap depth to limit tree size (esp. with high num_leaves) or leave at -1 and rely on other constraints.

5) Imbalanced data approach? Use is_unbalance or class weights, not both. Also tune thresholds, use AUC/PR metrics, and calibrate.

6) Why early stopping? Prevents overfitting and selects best_iteration automatically, improving generalization.

7) Categorical handling? Integer-encode + specify categorical_feature; LightGBM finds optimal category splits without one-hot.

8) When use DART? To reduce overfitting via tree dropout. Expect more iterations; validate gains.

9) GOSS benefit? Speeds training by focusing on large-gradient samples while keeping overall gradient unbiased.

10) feature_fraction vs bagging_fraction? Feature subsampling reduces feature correlation/overfit; bagging subsamples rows per iteration.

11) L1 vs L2 regularization? L1 encourages sparsity in leaf weights; L2 smooths weights. L2 is a common default.

12) Monotone constraints use case? Enforce domain monotonicity (pricing, risk). Helps trust & compliance.

13) Ranking setup? Use lambdarank with group/query sizes; evaluate ndcg@k. Ensure no leakage across groups.

14) Probability calibration? Use isotonic/Platt on validation outputs for better decision thresholds.

15) Importance types? gain (split gain sum) is preferred; split counts frequency. Use SHAP/permutation for robustness.

16) Handling leakage in encodings? Compute encodings within CV folds; never use global target stats that “peek” at validation/test.

17) GPU worth it? Helps on large, wide datasets; speedups vary. Validate memory usage and parity.

18) Time series best practice? Time-based split, lag features, rolling stats within train windows, no shuffling.

19) Reproducibility? Fix seeds, log everything, pin versions; accept small nondeterminism from parallel ops.

20) Deployment gotchas? Preserve feature order/types, same preprocessing, same categorical mapping, use best_iteration at inference.