毫升 |装袋分类器

Bagging 分类器是一个集成元估计器，它将每个基分类器拟合到原始数据集的随机子集上，然后聚合它们各自的预测（通过投票或平均）以形成最终预测。这种元估计器通常可以用作减少黑盒估计器（例如，决策树）方差的一种方式，方法是将随机化引入其构建过程，然后将其集成。
每个基分类器与一个训练集并行训练，该训练集通过随机抽取原始训练数据集中的 N 个示例（或数据）进行替换，其中 N 是原始训练集的大小。每个基分类器的训练集彼此独立。许多原始数据可能会在生成的训练集中重复，而其他数据可能会被遗漏。

Bagging 通过平均或投票来减少过拟合（方差），然而，这会导致偏差的增加，但可以通过方差的减少来补偿。

Bagging 如何在训练数据集上工作？
Bagging 如何在虚构的训练数据集上工作如下所示。由于 Bagging 使用替换对原始训练数据集进行重新采样，因此某些实例（或数据）可能会出现多次，而其他实例（或数据）可能会被忽略。

Original training dataset: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Resampled training set 1: 2, 3, 3, 5, 6, 1, 8, 10, 9, 1
Resampled training set 2: 1, 1, 5, 6, 3, 8, 9, 10, 2, 7
Resampled training set 3: 1, 5, 8, 9, 2, 10, 9, 7, 5, 4

编程需要懂一点英语

Bagging 分类器的算法：

Classifier generation:

Let N be the size of the training set.
for each of t iterations:
    sample N instances with replacement from the original training set.
    apply the learning algorithm to the sample.
    store the resulting classifier.

Classification:
for each of the t classifiers:
    predict class of instance using classifier.
return class that was predicted most often.

以下是上述算法的Python实现：

from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
  
# load the data
url = "/home/debomit/Downloads/wine_data.xlsx"
dataframe = pd.read_excel(url)
arr = dataframe.values
X = arr[:, 1:14]
Y = arr[:, 0]
  
seed = 8
kfold = model_selection.KFold(n_splits = 3,
                       random_state = seed)
  
# initialize the base classifier
base_cls = DecisionTreeClassifier()
  
# no. of base classifier
num_trees = 500
  
# bagging classifier
model = BaggingClassifier(base_estimator = base_cls,
                          n_estimators = num_trees,
                          random_state = seed)
  
results = model_selection.cross_val_score(model, X, Y, cv = kfold)
print("accuracy :")
print(results.mean())

输出：

accuracy :
0.8372093023255814