Skip to content

Latest commit

 

History

History
271 lines (227 loc) · 14 KB

File metadata and controls

271 lines (227 loc) · 14 KB

XGBoost

Introduction

  • eXtreme Gradient Boosting (XGBoost) is a scalable and improved version of the gradient boosting algorithm (terminology alert) designed for efficacy, computational speed and model performance.
  • Model Usage: regression, classification, and ranking problems.
  • Model Architecture:
    • Sequential Learning: XGBoost commonly uses decision trees as its foundational learning. Each subsequent tree is constructed on the previous one's errors, with a focus on misclassified data points. The approach use gradient descent to find the optimal weights for each tree while minimizing the loss function.
    • Ensemble: XGBoost generates an ensemble of decision trees and combines their predictions to improve overall accuracy. The final forecast is a weighted sum of all the trees' predictions and weighted based on performance.
  • Some of the advantages of using XGBoost over Gradient Boosting are:
    • Regularization: - XGBoost Incorporates both L1 (LASSO) and L2 (Ridge) regularization terms in the objective function, providing better control over model complexity.
    • Parallelization: XGBoost is optimized for parallel computing, making it more efficient and scalable. This is achieved through parallel tree construction, which is particularly beneficial for large datasets.
    • Handling Missing Values: XGBoost can handle missing values internally, reducing the need for explicit imputation.
    • Tree Pruning: XGBoost utilizes "max_depth" and "min_child_weight" parameters during tree construction to control the depth and size of trees, enabling more effective pruning.
    • Cross-validation: XGBoost has built-in cross-validation capabilities, simplifying the model selection process while cross-validation in gradient boosting needs to be implemented separately.

How XGboost works

  • XGBoost builds an ensemble of decision trees, where each tree learns from the errors of the previous trees, and together they create a powerful and accurate predictive model.
  • The process of learning from errors and building more trees continues until the model is robust and performs well on new data.
    • Step 1 - Initialization: We start with an initial guess for the predictions $F_1(X)$. This can be a simple value like the average of the target values (for regression) or the most common class (for classification).
    • Step 2 - Calculate Errors: We calculate the errors between our initial predictions and the actual target values in the training data. $r_1= \text{error metrics} (Y, F_1(X))$
    • Step 3 - Build a Tree to correct Errors: Now, we create a decision tree to correct these errors. The tree tries to find patterns in the data that help us make better predictions. $h_1(X, r_1)$
      • $h$ is a function trained to predict the residuals
    • Step 4 - Update Predictions: We use the newly created tree to update our predictions.
      • $F_2(X) = F_1(X) + \alpha h_1(X, r_1) $
      • The tree's predictions are combined with the previous predictions, giving more weight to the tree's predictions when they are more accurate.
    • Step 5 - Repeat for More Trees: We repeat Steps 2 to 4 to create more trees.
      • Each new tree focuses on correcting the errors that the previous trees couldn't handle.
    • Step 6 - Stop When Ready: we repeat this process for a certain number of rounds (boosting rounds) or until the model performs well enough. We want to avoid overfitting, so we stop when the model reaches a satisfactory level of accuracy.
    • Step 7 - Make Predictions: once the training is complete, we have a collection of trees that work together to make predictions on new, unseen data. To make a prediction, we pass the new data through each tree, and their combined predictions give us the final result.

Code Usage: XGBoost Native vs. XGBoost Sklearn API

  • Objective: The learning task parameters define the goal of optimization and the metric that will be chosen at each step.
    • reg:squarederror regression problem (xgb.XGBRegressor(objective ='reg:squarederror')
    • binary:logistic binary classification (the target contains only two classes, i.e., cat or dog)
    • multi:softmax multi-class classification (more than two classes in the target, i.e., apple/orange/banana)
  • Early Stopping Round: when given an unnecessary number of boosting rounds, XGBoost starts to overfit and memorize the dataset (say n_estimators=3000). This, in turn, leads to validation performance drop because the model is memorizing instead of generalizing. early_stopping_rounds helps to prevent that. If value of early_stopping_rounds is set to 10 then model will stop the training process if there is no major improvement in the evaluation parameters.
  • Evaluation Metric: The performance measure.
    • For example, r2 for regression models, precision for classification models. auc (Area under curve) because it performs well with the imbalanced data.
  • Evaluation set: X_val and y_val both are used for the evaluation purpose.

XGBoost Native API xgboost.train()

  • XGBoost comes with its own class for storing datasets called DMatrix.
    • It is a highly optimized class for memory and speed and has the ability to internally deal with categoricals.
    • That's why converting datasets into this format is a requirement for the native XGBoost API.
import xgboost as xgb
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2025)

# Extract categorical features
cats = X.select_dtypes(exclude=np.number).columns.tolist()
# Convert to Pandas category
for col in cats:
   X[col] = X[col].astype('category')

# Create regression matrices
dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

# Define hyperparameters
params = {
  "objective": "reg:squarederror",
  "tree_method": "gpu_hist"
}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=n,
)
# Evaluation
from sklearn.metrics import mean_squared_error

preds = model.predict(dtest_reg)
rmse = mean_squared_error(y_test, preds, squared=False)

print(f"RMSE of the base model: {rmse:.3f}")

XGBoost Scikit-Learn API: xgboost.XGBRegressor, XGBClassifier

  • XGBoost also offers XGBClassifier and XGBRegressor classes so that they can be integrated into the Scikit-Learn ecosystem
import xgboost as xgb

# Train a model using the scikit-learn API
clf_xgb = xgb.XGBClassifier(
            n_estimators=100, objective='binary:logistic', tree_method='hist', eta=0.1, max_depth=3, enable_categorical=True
)
# can X_train without encoding
clf_xgb.fit(X_train,
            y_train,
            verbose=False,
            early_stopping_rounds=10,
            eval_metric='auc',
            eval_set=[(X_val, y_val)])

# Convert the model to a native API model
model = clf_xgb.get_booster()

Cross-Validation

  • During cross-validation, we are asking XGBoost to watch three classification metrics which report model performance from three different angles.
# Encode y to numeric
y_encoded = OrdinalEncoder().fit_transform(y)

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to pd.Categorical
for col in cats:
   X[col] = X[col].astype('category')

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, random_state=1, stratify=y_encoded)

# Create classification matrices
dtrain_clf = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, y_test, enable_categorical=True)

# This multi:softprob objective also requires the number of classes num_class
params = {"objective": "multi:softprob",
          "num_class": 5,
          "tree_method": "gpu_hist"
}
n = 1000

results = xgb.cv(
   params, dtrain_clf,
   num_boost_round=n,
   nfold=5,
   metrics=["mlogloss", "auc", "merror"],
)
  • To see the best AUC score, we take the maximum of test-auc-mean column:
# It has the same number of rows as the number of boosting rounds
# Each row is the average metrics of all splits (folds) for that round.
# to get best number of boosting round (before early stopping)
best_num_boost_round = results.shape[0]
best_auc = results['test-auc-mean'].max()

Hyper-parameter Space

  • Reference: XGBoost's parameters
  • booster [default=gbtree]: This parameter basically selects the type of model to run at each iteration, which gives 2 options
    • gbtree: tree-based models
    • gblinear: linear models.
  • num_boost_round (for the native XGBoost interface) (a.k.a n_estimators for the Scikit-learn interface) number of trees)
    • The more trees you have, the more reliable your predictions will be, but need to use with early_stopping_rounds during .fit() to prevent overfitting.
    • How many trees should you pick?
      • Quick result: limit the number of trees to around 200.
      • Model only runs once a week: up to 5,000 trees.
  • eta regulates how much each tree contributes to the final prediction. The more trees you have, the smaller the learning rate should be.
    • Range: between 0.001 and 0.1.
  • max_depth decides the complexity of each tree in your model & refers to the maximum depth that a tree can grow to.
    • Range: 1 to 10
  • subsample controls the amount of data used for building each tree in your model.
    • Range: fraction that ranges from 0 to 1 (recommended: 0.05 and 1)
      • representing the proportion of the dataset to be randomly selected for training each tree.
    • By using only a portion of the data for each tree, the model can benefit from diversity and reduce the correlation between the trees, which may help combat overfitting.
  • colsample_bytree proportion of features to be considered for each tree.
    • Range: from 0 to 1, where
      • a value of 1 means that all features will be considered for every tree
      • a lower value indicates that only a subset of features will be randomly chosen before building each tree.
  • min_child_weight sets the minimum sum of instance weights that must be present in a child node in each tree.
    • Range: 1 to 20
  • Note: hyper-paramters equivalent from native XGBoost interface (xgboost.train()) to the Scikit-learn interface (xgboost.XGBRegressor, xgboost.XGBClassifier)
    • etalearning_rate
    • lambda → reg_lambda
    • alphareg_alpha
    • num_boost_roundn_estimators: are the number of boosting rounds (trees) and they are functionally equivalent, but for different APIs:
      • Use n_estimators for the Scikit-learn interface (xgboost.XGBRegressor, xgboost.XGBClassifier).
      • Use num_boost_round for the native XGBoost interface (xgboost.train())
def objective(trial):
    param = {
        "verbosity": 0,
        "objective": "binary:logistic",
        # use exact for small dataset.
        #"tree_method": "exact",
        'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
        # defines booster, gblinear for linear functions.
        "n_estimators": 1000,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
        # L2 regularization weight within a logarithmic scale (log=True)
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    if param["booster"] in ["gbtree", "dart"]:
        # maximum depth of the tree, signifies complexity of the tree.
        param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
        # minimum child weight, larger the term more conservative the tree.
        param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
        param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
        # defines how selective algorithm is.
        param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
        param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    if param["booster"] == "dart":
        param["sample_type"] = trial.suggest_categorical("sample_type", ["uniform", "weighted"])
        param["normalize_type"] = trial.suggest_categorical("normalize_type", ["tree", "forest"])
        param["rate_drop"] = trial.suggest_float("rate_drop", 1e-8, 1.0, log=True)
        param["skip_drop"] = trial.suggest_float("skip_drop", 1e-8, 1.0, log=True)

    # Fit the model
    """
    clf_xgb = XGBClassifier(**param, seed=42)
    clf_xgb.fit(X_train_pre,
                y_train,
                verbose=False,
                early_stopping_rounds=10,
                eval_metric='auc',
                eval_set=[(X_val_pre, y_val)])

    # Make predictions
    y_pred_proba = clf_xgb.predict_proba(X_val_pre)[:, 1]
    # Evaluate predictions
    fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    """
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=2024)
    # follow the cross validation in optuna:
    # https://www.kaggle.com/code/iqbalsyahakbar/ps4e1-3rd-place-solution#CatBoost
    model = XGBClassifier(**param, seed=42)
    roc_auc = np.round(
                    np.mean(
                        cross_val_score(model, X_train_pre, y_train,
                        scoring="roc_auc", cv=kfold)
                    ), 3
                )

    return roc_auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=600)

print(f"Number of finished trials: {len(study.trials)}")
print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")
print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")