Light Gradient-Boosting Machine (LightGBM)

Introduction

Light GBM is a high-performance gradient boosting framework for efficient tree-based machine learning, employing a leaf-wise growth strategy and histogram-based learning, ideal for large datasets and tasks where speed is paramount.
Key Characteristics:
- Gradient Boosting: It is an ensemble learning method that builds a series of weak learners (usually decision trees) to create a strong learner.
- Leaf-Wise Growth: Light GBM grows trees leaf-wise instead of level-wise, leading to faster training times.
- Histogram-Based Learning: Utilizes histograms to find the best splits during tree growth, reducing memory usage and improving computational efficiency.

Code Usage

lgb_model = LGBMClassifier()
lgb_model.fit(X_train_pre,
            y_train,
            eval_metric='auc',
            eval_set=[(X_val_pre, y_val)])

Hyper-parameter Space

num_boost_round (n_estimators) number of trees
- The more trees you have, the more stable your predictions will be
- Tip: Start small, around 100–500, and use early stopping to avoid overfitting.
- How many trees should you choose:
  - If your model needs to deliver results with low latency, you might want to limit the number of trees to around 200.
  - If your model runs once a week (e.g.: sales forecasting) and has more time to make the predictions, you could consider using up to 5,000 trees
learning_rate controls how much each tree contributes to the final prediction.
- Rule of thumb: start by fixing the number of trees and then focus on tuning the learning_rate
- Tip: Start with 0.1, then reduce to 0.01 or 0.001 for tougher problems.
- The more trees you have, the smaller the learning rate should be.
- Range: 0.001 and 0.1 (trial.suggest_float("learning_rate", 1e-3, 0.1, log=True))
num_leaves maximum number of terminal nodes (leaves) that can be present in each tree.
- Tip: A larger number improves accuracy but can lead to overfitting. Keep it close to 2^(max_depth).
- In a decision tree, a leaf represents a decision or an outcome.
- Range: powers of 2, starting from 2 and going up to 1024
- Pros: By increasing the num_leaves, you allow the tree to grow more complex, creating a higher number of distinct decision paths.
- Cons: increasing the number of leaves may also cause the model to overfit the training data, as it will have a lower amount of data per leaf.
max_depth: Limits the depth of the trees.
- Tip: Start with a range of 3-10, depending on your dataset’s complexity.
feature_fraction the fraction of features used in each tree.
- Tip: Start at 0.8 and decrease if you notice overfitting.
subsample control the amount of data used for building each tree in your model.
- Range: a fraction that ranges from 0 to 1, representing the proportion of the dataset to be randomly selected for training each tree (Recommend: 0.05 and 1)
- By using only a subset of the data for each tree, the model can benefit from the diversity and reduce the correlation between the trees, which may help combat overfitting.
bagging_freq is the frequency at which the data is sampled.
- Rule of thumb: to set bagging_freq to a positive value or LightGBM will ignore subsample.
colsample_bytree determines the proportion of features to be used for each tree.
- Range: from 0 to 1, where a value of 1 means that all features will be considered for every tree
min_data_in_leaf sets the minimum number of data points that must be present in a leaf node in each tree.
- This parameter helps control the complexity of the model and prevents overfitting.
- Range: 1 to 100
- If you have a leaf node with only 1 data point, your prediction will be the value of that single data point.
- If you have a leaf node with 30 data points, your prediction will be the average of those 30 data points.

def objective(trial):
    param = {
        "objective": "regression",
        "metric": "auc",
        "n_estimators": 1000,
        "verbosity": -1,
        "bagging_freq": 1,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 2**10),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
    }


    # Fit the model
    model = LGBMClassifier(**param)
    model.fit(X_train_pre,
                y_train,
                eval_set=[(X_val_pre, y_val)])

    # Make predictions
    y_pred_proba = model.predict_proba(X_val_pre)[:, 1]
    # Evaluate predictions
    fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    return roc_auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=600)

print(f"Number of finished trials: {len(study.trials)}")
print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")
print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Light Gradient-Boosting Machine (LightGBM)

Introduction

Code Usage

Hyper-parameter Space

FilesExpand file tree

lightgbm.md

Latest commit

History

lightgbm.md

File metadata and controls

Light Gradient-Boosting Machine (LightGBM)

Introduction

Code Usage

Hyper-parameter Space