XGBoost 回归(1) - 芒果文档

📌 相关文章

📜 XGBoost 回归(1)

📅 最后修改于: 2023-12-03 14:48:37.950000 🧑 作者: Mango

XGBoost Regression

Introduction

XGBoost is an optimized and scalable gradient boosting library. It is designed to be highly efficient, flexible and portable. XGBoost is widely used in industry for building predictive models, particularly for structured data.

In this tutorial, we will explore how to use XGBoost for regression tasks. We will cover the following topics:

Introduction to regression
Introduction to XGBoost regression
Data preparation
Training the model
Evaluating the model
Tuning hyperparameters
Feature importance

Regression

Regression is a type of supervised learning where the goal is to predict a continuous output variable. The input variables can be either categorical or continuous. The most common regression algorithms are linear regression, logistic regression and polynomial regression.

In linear regression, the goal is to find a linear relationship between the input variables and the output variable. The coefficients of the linear equation are learned during the training process.

Logistic regression is used when the output variable is categorical. It tries to find the relationship between the input variables and the probability of a certain outcome.

Polynomial regression is a form of regression in which the relationship between the input variables and the output variable is modelled as an n-th degree polynomial.

XGBoost Regression

XGBoost is an implementation of gradient boosting algorithm. Gradient boosting is a method of ensembling decision trees. In gradient boosting, the trees are built sequentially, with each tree trying to correct the mistakes of the previous tree.

XGBoost is particularly effective for regression tasks because it can handle both categorical and continuous input variables. It also has a number of features that make it stand out from other gradient boosting libraries, including:

Regularization: XGBoost includes L1, L2 and elastic net regularization to prevent overfitting.
Cross-validation: XGBoost includes built-in cross-validation to select the best hyperparameters.
Missing values handling: XGBoost can handle missing values in the input data.
Parallel processing: XGBoost can utilize all the CPU cores on a machine to speed up the training process.

Data Preparation

Before training our XGBoost regression model, we first need to prepare our data. We will use the Boston housing dataset, which is included in scikit-learn library. The goal of the dataset is to predict the median value of owner-occupied homes.

# import libraries
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# load data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Model

Now that our data is prepared, we can train our XGBoost regression model. To do this, we first need to import the XGBRegressor class from the xgboost library. We can then create an instance of the class and specify the hyperparameters we want to use for the model.

# import XGBRegressor
from xgboost import XGBRegressor

# create instance of the class
xgb_reg = XGBRegressor(objective='reg:squarederror', n_estimators=1000, learning_rate=0.05, max_depth=5, subsample=0.7, colsample_bytree=0.7, random_state=42)

# fit the model
xgb_reg.fit(X_train, y_train)

Evaluating the Model

Once we have trained our XGBoost regression model, we need to evaluate its performance. We can do this by computing the root mean squared error (RMSE) on the test set. The RMSE is a common metric used to evaluate regression models, and it measures the average distance between the predicted and actual values.

from sklearn.metrics import mean_squared_error

# make predictions on the test set
y_pred = xgb_reg.predict(X_test)

# compute RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)

print('RMSE:', rmse)

Tuning Hyperparameters

Hyperparameters are parameters that are set before training the model, and they can have a big impact on model performance. XGBoost has a number of hyperparameters that can be tuned to improve performance.

One way to tune the hyperparameters is by using cross-validation. XGBoost has a built-in function for performing cross-validation called cv. We can use this function to train and evaluate the model on different subsets of the data, and then select the hyperparameters that result in the best performance.

# import cross_val_score
from sklearn.model_selection import cross_val_score

# set hyperparameters to tune
params = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.5, 0.7, 1.0],
    'colsample_bytree': [0.5, 0.7, 1.0]
}

# perform grid search using cross-validation
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(xgb_reg, param_grid=params, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# print best hyperparameters
print(grid_search.best_params_)

Feature Importance

After training our XGBoost regression model, we can also look at the feature importances to understand which features are most important for predicting the target variable. We can use the plot_importance function from the xgboost library to visualize the importance of each feature.

from xgboost import plot_importance
import matplotlib.pyplot as plt

# plot feature importances
plot_importance(xgb_reg)
plt.show()

Conclusion

In this tutorial, we learned how to use XGBoost for regression tasks. We covered data preparation, model training and evaluation, hyperparameter tuning, and feature importance. XGBoost is a powerful library that is widely used in industry for building predictive models. With its flexibility, efficiency and portability, XGBoost is definitely a tool that should be in every machine learning developer's toolbox.