📜  Scikit学习-随机梯度下降

📅  最后修改于: 2020-12-10 05:50:08             🧑  作者: Mango


在这里,我们将学习Sklearn中的优化算法,称为随机梯度下降(SGD)。

随机梯度下降(SGD)是一种简单而有效的优化算法,用于查找使成本函数最小化的函数参数/系数值。换句话说,它用于凸损失函数(例如SVM和Logistic回归)下的线性分类器的判别学习。它已成功应用于大型数据集,因为是针对每个训练实例(而不是在实例结束时)执行系数更新。

SGD分类器

随机梯度下降(SGD)分类器基本上实现了简单的SGD学习例程,该例程支持各种损失函数和分类惩罚。 Scikit-learn提供了SGDClassifier模块来实现SGD分类。

参量

下表包含SGDClassifier模块使用的参数-

Sr.No Parameter & Description
1

loss − str, default = ‘hinge’

It represents the loss function to be used while implementing. The default value is ‘hinge’ which will give us a linear SVM. The other options which can be used are −

  • log − This loss will give us logistic regression i.e. a probabilistic classifier.

  • modified_huber − a smooth loss that brings tolerance to outliers along with probability estimates.

  • squared_hinge − similar to ‘hinge’ loss but it is quadratically penalized.

  • perceptron − as the name suggests, it is a linear loss which is used by the perceptron algorithm.

2

penalty − str, ‘none’, ‘l2’, ‘l1’, ‘elasticnet’

It is the regularization term used in the model. By default, it is L2. We can use L1 or ‘elasticnet; as well but both might bring sparsity to the model, hence not achievable with L2.

3

alpha − float, default = 0.0001

Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001.

4

l1_ratio − float, default = 0.15

This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0, the penalty would be an L2 penalty.

5

fit_intercept − Boolean, Default=True

This parameter specifies that a constant (bias or intercept) should be added to the decision function. No intercept will be used in calculation and data will be assumed already centered, if it will set to false.

6

tol − float or none, optional, default = 1.e-3

This parameter represents the stopping criterion for iterations. Its default value is False but if set to None, the iterations will stop when 𝒍loss > best_loss – tol for n_iter_no_changesuccessive epochs.

7

shuffle − Boolean, optional, default = True

This parameter represents that whether we want our training data to be shuffled after each epoch or not.

8

verbose − integer, default = 0

It represents the verbosity level. Its default value is 0.

9

epsilon − float, default = 0.1

This parameter specifies the width of the insensitive region. If loss = ‘epsilon-insensitive’, any difference, between current prediction and the correct label, less than the threshold would be ignored.

10

max_iter − int, optional, default = 1000

As name suggest, it represents the maximum number of passes over the epochs i.e. training data.

11

warm_start − bool, optional, default = false

With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.

12

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options.

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

13

n_jobs − int or none, optional, Default = None

It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1.

14

learning_rate − string, optional, default = ‘optimal’

  • If learning rate is ‘constant’, eta = eta0;

  • If learning rate is ‘optimal’, eta = 1.0/(alpha*(t+t0)), where t0 is chosen by Leon Bottou;

  • If learning rate = ‘invscalling’, eta = eta0/pow(t, power_t).

  • If learning rate = ‘adaptive’, eta = eta0.

15

eta0 − double, default = 0.0

It represents the initial learning rate for above mentioned learning rate options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’.

16

power_t − idouble, default =0.5

It is the exponent for ‘incscalling’ learning rate.

17

early_stopping − bool, default = False

This parameter represents the use of early stopping to terminate training when validation score is not improving. Its default value is false but when set to true, it automatically set aside a stratified fraction of training data as validation and stop training when validation score is not improving.

18

validation_fraction − float, default = 0.1

It is only used when early_stopping is true. It represents the proportion of training data to set asides as validation set for early termination of training data..

19

n_iter_no_change − int, default=5

It represents the number of iteration with no improvement should algorithm run before early stopping.

20

classs_weight − dict, {class_label: weight} or “balanced”, or None, optional

This parameter represents the weights associated with classes. If not provided, the classes are supposed to have weight 1.

20

warm_start − bool, optional, default = false

With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.

21

average − iBoolean or int, optional, default = false

It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1.

属性

下表包含SGDClassifier模块使用的属性-

Sr.No Attributes & Description
1

coef_ − array, shape (1, n_features) if n_classes==2, else (n_classes, n_features)

This attribute provides the weight assigned to the features.

2

intercept_ − array, shape (1,) if n_classes==2, else (n_classes,)

It represents the independent term in decision function.

3

n_iter_ − int

It gives the number of iterations to reach the stopping criterion.

实施实例

像其他分类器一样,随机梯度下降(SGD)必须配备以下两个数组-

  • 存放训练样本的数组X。它的大小为[n_samples,n_features]。

  • 保存目标值的数组Y,即训练样本的类别标签。它的大小为[n_samples]。

以下Python脚本使用SGDClassifier线性模型-

import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
SGDClf = linear_model.SGDClassifier(max_iter = 1000, tol=1e-3,penalty = "elasticnet")
SGDClf.fit(X, Y)

输出

SGDClassifier(
   alpha = 0.0001, average = False, class_weight = None,
   early_stopping = False, epsilon = 0.1, eta0 = 0.0, fit_intercept = True,
   l1_ratio = 0.15, learning_rate = 'optimal', loss = 'hinge', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, n_jobs = None, penalty = 'elasticnet',
   power_t = 0.5, random_state = None, shuffle = True, tol = 0.001,
   validation_fraction = 0.1, verbose = 0, warm_start = False
)

现在,一旦拟合,模型可以预测新值,如下所示:

SGDClf.predict([[2.,2.]])

输出

array([2])

对于上面的示例,我们可以借助以下Python脚本获取权重向量-

SGDClf.coef_

输出

array([[19.54811198, 9.77200712]])

同样,我们可以在以下Python脚本的帮助下获取拦截的值-

SGDClf.intercept_

输出

array([10.])

通过使用以下Python脚本中使用的SGDClassifier.decision_function ,可以获取到超平面的有符号距离-

SGDClf.decision_function([[2., 2.]])

输出

array([68.6402382])

SGD回归器

随机梯度下降(SGD)回归器基本上实现了简单的SGD学习例程,该例程支持各种损失函数和惩罚以适应线性回归模型。 Scikit-learn提供了SGDRegressor模块来实现SGD回归。

参量

SGDRegressor使用的参数与SGDClassifier模块中使用的参数几乎相同。区别在于“损失”参数。对于SGDRegressor模块的loss参数,正值如下所示-

  • squared_loss-它是指普通的最小二乘拟合。

  • Huber:SGDRegressor-通过将平方损失转换为线性损失超过ε距离来校正异常值。 “休伯”的工作是修改“ squared_loss”,以使算法较少关注校正异常值。

  • epsilon_insensitive-实际上,它忽略小于epsilon的错误。

  • squared_epsilon_insensitive-与epsilon_insensitive相同。唯一的区别是,它变成超过ε容差的平方损耗。

另一个区别是名为’power_t’的参数的默认值为0.25,而不是SGDClassifier中的0.5。此外,它没有’class_weight’和’n_jobs’参数。

属性

SGDRegressor的属性也与SGDClassifier模块的属性相同。相反,它具有三个额外的属性,如下所示:

  • average_coef_ −数组,形状(n_features,)

顾名思义,它提供分配给功能的平均权重。

  • average_intercept_-数组,shape(1,)

顾名思义,它提供了平均截距项。

  • t_-整数

它提供了在训练阶段执行的体重更新次数。

注意-在将参数“ average”启用为True之后,属性average_coef_和average_intercept_将起作用。

实施实例

以下Python脚本使用SGDRegressor线性模型-

import numpy as np
from sklearn import linear_model
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
SGDReg =linear_model.SGDRegressor(
   max_iter = 1000,penalty = "elasticnet",loss = 'huber',tol = 1e-3, average = True
)
SGDReg.fit(X, y)

输出

SGDRegressor(
   alpha = 0.0001, average = True, early_stopping = False, epsilon = 0.1,
   eta0 = 0.01, fit_intercept = True, l1_ratio = 0.15,
   learning_rate = 'invscaling', loss = 'huber', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, penalty = 'elasticnet', power_t = 0.25,
   random_state = None, shuffle = True, tol = 0.001, validation_fraction = 0.1,
   verbose = 0, warm_start = False
)

现在,一旦拟合,我们就可以在以下Python脚本的帮助下获得权重向量-

SGDReg.coef_

输出

array([-0.00423314, 0.00362922, -0.00380136, 0.00585455, 0.00396787])

同样,我们可以在以下Python脚本的帮助下获取拦截的值-

SGReg.intercept_

输出

SGReg.intercept_

我们可以借助以下Python脚本获取训练阶段体重更新的次数-

SGDReg.t_

输出

61.0

SGD的优点和缺点

遵循SGD的优点-

  • 随机梯度下降(SGD)非常有效。

  • 这很容易实现,因为有很多代码调优的机会。

遵循SGD的缺点-

  • 随机梯度下降(SGD)需要一些超参数,例如正则化参数。

  • 它对特征缩放很敏感。