使用 GridSearchCV 进行 SVM 超参数调优 |机器学习

机器学习模型被定义为具有许多需要从数据中学习的参数的数学模型。但是，有一些参数，称为超参数，不能直接学习。它们通常是由人类在实际训练开始之前根据一些直觉或命中和试验来选择的。这些参数通过提高模型的性能（例如其复杂性或学习率）来展示其重要性。模型可以有许多超参数，找到最佳参数组合可以视为搜索问题。
SVM也有一些超参数（比如使用什么 C 或 gamma 值），找到最佳超参数是一项非常难以解决的任务。但是可以通过尝试所有组合并查看哪些参数最有效来找到它。它背后的主要思想是创建一个超参数网格并尝试它们的所有组合（因此，这种方法称为Gridsearch ，但不用担心！我们不必手动进行，因为 Scikit-learn 有此功能内置于 GridSearchCV。
GridSearchCV采用一个字典，该字典描述了可以在模型上尝试训练它的参数。参数网格定义为字典，其中键是参数，值是要测试的设置。
本文演示了如何使用GridSearchCV搜索方法找到最佳超参数，从而提高准确性/预测结果

导入必要的库并获取数据：

我们将使用来自 Scikit Learn 的内置乳腺癌数据集。我们可以使用 load函数：

Python3

import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
 
cancer = load_breast_cancer()
 
# The data set is presented in a dictionary form:
print(cancer.keys())

Python3

df_feat = pd.DataFrame(cancer['data'],
                       columns = cancer['feature_names'])
 
# cancer column is our target
df_target = pd.DataFrame(cancer['target'],
                     columns =['Cancer'])
 
print("Feature Variables: ")
print(df_feat.info())

Python3

print("Dataframe looks like : ")
print(df_feat.head())

Python3

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
                        df_feat, np.ravel(df_target),
                test_size = 0.30, random_state = 101)

Python3

# train the model on train set
model = SVC()
model.fit(X_train, y_train)
 
# print prediction results
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

Python3

from sklearn.model_selection import GridSearchCV
 
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}
 
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
 
# fitting the model for grid search
grid.fit(X_train, y_train)

Python3

# print best parameter after tuning
print(grid.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

Python3

grid_predictions = grid.predict(X_test)
 
# print classification report
print(classification_report(y_test, grid_predictions))

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

现在我们将把所有特征提取到新的数据帧中，并将我们的目标特征提取到单独的数据帧中。

Python3

df_feat = pd.DataFrame(cancer['data'],
                       columns = cancer['feature_names'])
 
# cancer column is our target
df_target = pd.DataFrame(cancer['target'],
                     columns =['Cancer'])
 
print("Feature Variables: ")
print(df_feat.info())

Python3

print("Dataframe looks like : ")
print(df_feat.head())

训练测试拆分

现在我们将以 70:30 的比例将数据分成训练集和测试集

Python3

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
                        df_feat, np.ravel(df_target),
                test_size = 0.30, random_state = 101)

在没有超参数调整的情况下训练支持向量分类器 –

首先，我们将通过调用标准 SVC()函数而不进行超参数调整来训练我们的模型，并查看它的分类和混淆矩阵。

Python3

# train the model on train set
model = SVC()
model.fit(X_train, y_train)
 
# print prediction results
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

我们得到了 61 % 的准确率，但您发现有什么奇怪的地方吗？
请注意，第 0 类的召回率和准确率始终为 0。这意味着分类器始终将所有内容分类为单个类，即第 1 类！这意味着我们的模型需要调整其参数。
这是 GridSearch 的用处出现的时候。我们可以使用 GridSearch 来搜索参数！