📜  Scikit学习-决策树

📅  最后修改于: 2020-12-10 05:55:28             🧑  作者: Mango


在本章中,我们将学习称为决策树的Sklearn中的学习方法。

决策树(DTs)是最强大的非参数监督学习方法。它们可用于分类和回归任务。 DT的主要目标是通过学习从数据特征推导出的简单决策规则来创建预测目标变量值的模型。决策树有两个主要实体。一个是根节点,数据在其中拆分,另一个是决策节点或叶子,在此处获得最终输出。

决策树算法

下面解释了不同的决策树算法-

ID3

它由Ross Quinlan在1986年开发。它也称为迭代二分法3。此算法的主要目标是为每个节点找到那些分类特征,这些分类特征将为分类目标产生最大的信息增益。

它使树生长到最大大小,然后为了提高树在看不见数据上的能力,请执行修剪步骤。该算法的输出将是多路树。

C4.5

它是ID3的继承者,它动态定义了一个离散属性,该属性将连续属性值划分为一组离散的间隔。这就是它消除了分类功能限制的原因。它将经过ID3训练的树转换为“ IF-THEN”规则集。

为了确定应用这些规则的顺序,将首先评估每个规则的准确性。

C5.0

它的工作方式与C4.5类似,但使用的内存更少,构建的规则集更小。它比C4.5更准确。

大车

这称为分类和回归树算法。它基本上通过使用特征和阈值来生成二进制拆分,从而在每个节点上产生最大的信息增益(称为基尼索引)。

同质性取决于基尼系数,基尼系数的值越高,同质性越高。就像C4.5算法一样,但是区别在于它不计算规则集,也不支持数字目标变量(回归)。

决策树分类

在这种情况下,决策变量是分类的。

Sklearn模块-Scikit-learn库提供模块名称DecisionTreeClassifier,用于对数据集执行多类分类。

参量

下表包含sklearn.tree.DecisionTreeClassifier模块使用的参数-

Sr.No Parameter & Description
1

criterion − string, optional default= “gini”

It represents the function to measure the quality of a split. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain.

2

splitter − string, optional default= “best”

It tells the model, which strategy from “best” or “random” to choose the split at each node.

3

max_depth − int or None, optional default=None

This parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_split samples.

4

min_samples_split − int, float, optional default=2

This parameter provides the minimum number of samples required to split an internal node.

5

min_samples_leaf − int, float, optional default=1

This parameter provides the minimum number of samples required to be at a leaf node.

6

min_weight_fraction_leaf − float, optional default=0.

With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node.

7

max_features − int, float, string or None, optional default=None

It gives the model the number of features to be considered when looking for the best split.

8

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

9

max_leaf_nodes − int or None, optional default=None

This parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unlimited number of leaf nodes.

10

min_impurity_decrease − float, optional default=0.

This value works as a criterion for a node to split because the model will split a node if this split induces a decrease of the impurity greater than or equal to min_impurity_decrease value.

11

min_impurity_split − float, default=1e-7

It represents the threshold for early stopping in tree growth.

12

class_weight − dict, list of dicts, “balanced” or None, default=None

It represents the weights associated with classes. The form is {class_label: weight}. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.

13

presort − bool, optional default=False

It tells the model whether to presort the data to speed up the finding of best splits in fitting. The default is false but of set to true, it may slow down the training process.

属性

下表包含sklearn.tree.DecisionTreeClassifier模块使用的属性-

Sr.No Parameter & Description
1

feature_importances_ − array of shape =[n_features]

This attribute will return the feature importance.

2

classes_: − array of shape = [n_classes] or a list of such arrays

It represents the classes labels i.e. the single output problem, or a list of arrays of class labels i.e. multi-output problem.

3

max_features_ − int

It represents the deduced value of max_features parameter.

4

n_classes_ − int or list

It represents the number of classes i.e. the single output problem, or a list of number of classes for every output i.e. multi-output problem.

5

n_features_ − int

It gives the number of features when fit() method is performed.

6

n_outputs_ − int

It gives the number of outputs when fit() method is performed.

方法

下表包含sklearn.tree.DecisionTreeClassifier模块使用的方法-

Sr.No Parameter & Description
1

apply(self, X[, check_input])

This method will return the index of the leaf.

2

decision_path(self, X[, check_input])

As name suggests, this method will return the decision path in the tree

3

fit(self, X, y[, sample_weight, …])

fit() method will build a decision tree classifier from given training set (X, y).

4

get_depth(self)

As name suggests, this method will return the depth of the decision tree

5

get_n_leaves(self)

As name suggests, this method will return the number of leaves of the decision tree.

6

get_params(self[, deep])

We can use this method to get the parameters for estimator.

7

predict(self, X[, check_input])

It will predict class value for X.

8

predict_log_proba(self, X)

It will predict class log-probabilities of the input samples provided by us, X.

9

predict_proba(self, X[, check_input])

It will predict class probabilities of the input samples provided by us, X.

10

score(self, X, y[, sample_weight])

As the name implies, the score() method will return the mean accuracy on the given test data and labels..

11

set_params(self, \*\*params)

We can set the parameters of estimator with this method.

实施实例

下面的Python脚本将使用sklearn.tree.DecisionTreeClassifier模块构造一个分类器,根据我们的数据集中的25个样本和两个特征(“身高”和“头发长度”)来预测男性或女性-

from sklearn import tree
from sklearn.model_selection import train_test_split
X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15]
,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12
6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2
5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Ma
n','Woman','Man','Woman','Man','Woman','Woman','Woman','
Man','Woman','Woman','Man', 'Woman', 'Woman', 'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)

输出

['Woman']

我们还可以通过使用以下Python Forecast_proba()方法来预测每个类别的概率,如下所示:

prediction = DTclf.predict_proba([[135,29]])
print(prediction)

输出

[[0. 1.]]

决策树回归

在这种情况下,决策变量是连续的。

Sklearn模块-Scikit-learn库提供模块名称DecisionTreeRegressor,用于将决策树应用于回归问题。

参量

DecisionTreeRegressor使用的参数与DecisionTreeClassifier模块中使用的参数几乎相同。区别在于“标准”参数。对于DecisionTreeRegressor模块的‘criterion : 字符串,可选的default =“ mse”’参数具有以下值-

  • mse-代表均方误差。它等于方差减少作为特征选择准则。它使用每个终端节点的平均值将L2损耗降至最低。

  • freidman_mse-它也使用均方误差,但具有弗里德曼的改善得分。

  • mae-代表平均绝对误差。它使用每个终端节点的中值将L1损耗降至最低。

另一个区别是它没有‘class_weight’参数。

属性

DecisionTreeRegressor的属性也与DecisionTreeClassifier模块的属性相同。区别在于它不具有‘classes_’‘n_classes_ ‘属性。

方法

DecisionTreeRegressor的方法也与DecisionTreeClassifier模块的方法相同。区别在于它不具有‘predict_log_proba()’‘predict_proba()’ ‘属性。

实施实例

决策树回归模型中的fit()方法将采用y的浮点值。我们来看一个使用Sklearn.tree.DecisionTreeRegressor的简单实现示例-

from sklearn import tree
X = [[1, 1], [5, 5]]
y = [0.1, 1.5]
DTreg = tree.DecisionTreeRegressor()
DTreg = clf.fit(X, y)

拟合后,我们可以使用此回归模型进行如下预测:

DTreg.predict([[4, 5]])

输出

array([1.5])