如何在Python实现梯度下降以找到局部最小值？

梯度下降是一种迭代算法，用于通过寻找最佳参数来最小化函数。梯度下降可以应用于任何维度函数，即 1-D、2-D、3-D。在本文中，我们将致力于寻找抛物线函数(2-D) 的全局最小值，并将在Python实现梯度下降以找到线性回归方程 (1-D) 的最佳参数。在深入研究实现部分之前，让我们确定实现梯度下降算法所需的参数集。为了实现梯度下降算法，我们需要一个需要最小化的成本函数、迭代次数、一个学习率来确定每次迭代时的步长，同时向最小值移动，权重和偏差的部分导数来更新参数在每次迭代中，以及一个预测函数。

到目前为止，我们已经看到了梯度下降所需的参数。现在让我们用梯度下降算法映射参数，并通过一个例子来更好地理解梯度下降。让我们考虑抛物线方程 y=4x ² 。通过查看方程，我们可以确定抛物线函数在 x = 0 处最小，即在 x=0, y=0 处。因此 x=0 是抛物线函数y=4x ²的局部最小值。现在让我们看看梯度下降的算法以及如何通过应用梯度下降来获得局部最小值：

梯度下降算法

应与当前点的函数梯度的负值（远离梯度）成比例地进行步骤以找到局部最小值。梯度上升是通过采取与梯度的正数成比例的步长（向梯度移动）来接近函数局部最大值的过程。

repeat until convergence
{
    w = w - (learning_rate * (dJ/dw))
    b = b - (learning_rate * (dJ/db))
}

步骤 1：初始化所有必要的参数并导出抛物线方程 4x ²的梯度函数。 x ²的导数是2x，所以抛物线方程4x ²的导数将是8x。

x₀ = 3 (random initialization of x)

learning_rate = 0.01 (to determine the step size while moving towards local minima)

编程需要懂一点英语

梯度 = $\frac{dy}{dx}=\frac{d}{dx}(4*x^2) = 8*x$ （计算梯度函数）

第 2 步：让我们执行 3 次梯度下降迭代：

对于每次迭代，继续根据梯度下降公式更新 x 的值。

Iteration 1:
    x1 = x0 - (learning_rate * gradient)
    x1 = 3 - (0.01 * (8 * 3))
    x1 = 3 - 0.24
    x1 = 2.76

Iteration 2:
    x2 = x1 - (learning_rate * gradient)
    x2 = 2.76 - (0.01 * (8 * 2.76))
    x2 = 2.76 - 0.2208
    x2 = 2.5392

Iteration 3:
    x3 = x2 - (learning_rate * gradient)
    x3 = 2.5392 - (0.01 * (8 * 2.5392))
    x3 = 2.5392 - 0.203136
    x3 = 2.3360

从上面的梯度下降的三个迭代中，我们可以注意到 x 的值是通过迭代递减的，并且通过运行梯度下降进行更多的迭代会慢慢收敛到 0（局部最小值）。现在你可能有一个问题，我们应该运行多少次迭代梯度下降？

我们可以设置一个停止阈值，即当 x 的前一个值和当前值之间的差异变得小于停止阈值时，我们停止迭代。当涉及到机器学习算法和深度学习算法的梯度下降实现时，我们尝试最小化使用梯度下降算法的成本函数。现在我们已经清楚梯度下降的内部工作，让我们看看梯度下降的Python实现，我们将最小化线性回归算法的成本函数并找到最佳拟合线。在我们的例子中，参数如下：

预测函数

线性回归算法的预测函数是由 y=wx+b 给出的线性方程。

prediction_function (y) = (w * x) + b
Here, x is the independent variable
      y is the dependent variable
      w is the weight associcated with input variable
      b is the bias

成本函数

成本函数用于根据所做的预测计算损失。在线性回归中，我们使用均方误差来计算损失。均方误差是实际值和预测值之间的平方差之和。

成本函数(J) = $(\frac{1}{n}){\sum_{i=1}^{n}(y_{i} - (wx_{i}+b))^{2}}$

这里，n是样本数

偏导数（梯度）

使用成本函数计算权重和偏差的偏导数。我们得到：

$\frac{dJ}{dw}=(\frac{-2}{n}){\sum_{i=1}^{n}x_i*(y_{i} - (wx_{i}+b))}$

$\frac{dJ}{db}=(\frac{-2}{n}){\sum_{i=1}^{n}(y_{i} - (wx_{i}+b))}$

参数更新

通过减去学习率及其各自梯度的乘积来更新权重和偏差。

w = w - (learning_rate * (dJ/dw))
b = b - (learning_rate * (dJ/db))

梯度下降的Python实现

在实现部分，我们将编写两个函数，一个是将实际输出和预测输出作为输入并返回损失的代价函数，第二个是实际的梯度下降函数，它以自变量目标变量作为输入并使用梯度下降算法找到最佳拟合线。迭代次数、learning_rate 和停止阈值是梯度下降算法的调整参数，可由用户调整。在主函数，我们将初始化线性相关的随机数据并对数据应用梯度下降算法以找到最佳拟合线。使用梯度下降算法找到的最佳权重和偏差稍后用于在主函数绘制最佳拟合线。迭代指定必须完成参数更新的次数，停止阈值是停止梯度下降算法的两次连续迭代之间损失的最小变化。

Python3

# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
  
def mean_squared_error(y_true, y_predicted):
      
    # Calculating the loss or cost
    cost = np.sum((y_true-y_predicted)**2) / len(y_true)
    return cost
  
# Gradient Descent Function
# Here iterations, learning_rate, stopping_threshold
# are hyperparameters that can be tuned
def gradient_descent(x, y, iterations = 1000, learning_rate = 0.0001, 
                     stopping_threshold = 1e-6):
      
    # Initializing weight, bias, learning rate and iterations
    current_weight = 0.1
    current_bias = 0.01
    iterations = iterations
    learning_rate = learning_rate
    n = float(len(x))
      
    costs = []
    weights = []
    previous_cost = None
      
    # Estimation of optimal parameters 
    for i in range(iterations):
          
        # Making predictions
        y_predicted = (current_weight * x) + current_bias
          
        # Calculationg the current cost
        current_cost = mean_squared_error(y, y_predicted)
  
        # If the change in cost is less than or equal to 
        # stopping_threshold we stop the gradient descent
        if previous_cost and abs(previous_cost-current_cost)<=stopping_threshold:
            break
          
        previous_cost = current_cost
  
        costs.append(current_cost)
        weights.append(current_weight)
          
        # Calculating the gradients
        weight_derivative = -(2/n) * sum(x * (y-y_predicted))
        bias_derivative = -(2/n) * sum(y-y_predicted)
          
        # Updating weights and bias
        current_weight = current_weight - (learning_rate * weight_derivative)
        current_bias = current_bias - (learning_rate * bias_derivative)
                  
        # Printing the parameters for each 1000th iteration
        print(f"Iteration {i+1}: Cost {current_cost}, Weight \
        {current_weight}, Bias {current_bias}")
      
      
    # Visualizing the weights and cost at for all iterations
    plt.figure(figsize = (8,6))
    plt.plot(weights, costs)
    plt.scatter(weights, costs, marker='o', color='red')
    plt.title("Cost vs Weights")
    plt.ylabel("Cost")
    plt.xlabel("Weight")
    plt.show()
      
    return current_weight, current_bias
  
  
def main():
      
    # Data
    X = np.array([32.50234527, 53.42680403, 61.53035803, 47.47563963, 59.81320787,
           55.14218841, 52.21179669, 39.29956669, 48.10504169, 52.55001444,
           45.41973014, 54.35163488, 44.1640495 , 58.16847072, 56.72720806,
           48.95588857, 44.68719623, 60.29732685, 45.61864377, 38.81681754])
    Y = np.array([31.70700585, 68.77759598, 62.5623823 , 71.54663223, 87.23092513,
           78.21151827, 79.64197305, 59.17148932, 75.3312423 , 71.30087989,
           55.16567715, 82.47884676, 62.00892325, 75.39287043, 81.43619216,
           60.72360244, 82.89250373, 97.37989686, 48.84715332, 56.87721319])
  
    # Estimating weight and bias using gradient descent
    estimated_weight, eatimated_bias = gradient_descent(X, Y, iterations=2000)
    print(f"Estimated Weight: {estimated_weight}\nEstimated Bias: {eatimated_bias}")
  
    # Making predictions using estimated parameters
    Y_pred = estimated_weight*X + eatimated_bias
  
    # Plotting the regression line
    plt.figure(figsize = (8,6))
    plt.scatter(X, Y, marker='o', color='red')
    plt.plot([min(X), max(X)], [min(Y_pred), max(Y_pred)], color='blue',markerfacecolor='red',
             markersize=10,linestyle='dashed')
    plt.xlabel("X")
    plt.ylabel("Y")
    plt.show()
  
      
if __name__=="__main__":
    main()

输出：

Iteration 1: Cost 4352.088931274409, Weight 0.7593291142562117, Bias 0.02288558130709

Iteration 2: Cost 1114.8561474350017, Weight 1.081602958862324, Bias 0.02918014748569513

Iteration 3: Cost 341.42912086804455, Weight 1.2391274084945083, Bias 0.03225308846928192

Iteration 4: Cost 156.64495290904443, Weight 1.3161239281746984, Bias 0.03375132986012604

Iteration 5: Cost 112.49704004742098, Weight 1.3537591652024805, Bias 0.034479873154934775

Iteration 6: Cost 101.9493925395456, Weight 1.3721549833978113, Bias 0.034832195392868505

Iteration 7: Cost 99.4293893333546, Weight 1.3811467575154601, Bias 0.03500062439068245

Iteration 8: Cost 98.82731958262897, Weight 1.3855419247507244, Bias 0.03507916814736111

Iteration 9: Cost 98.68347500997261, Weight 1.3876903144657764, Bias 0.035113776874486774

Iteration 10: Cost 98.64910780902792, Weight 1.3887405007983562, Bias 0.035126910596389935

Iteration 11: Cost 98.64089651459352, Weight 1.389253895811451, Bias 0.03512954755833985

Iteration 12: Cost 98.63893428729509, Weight 1.38950491235671, Bias 0.035127053821718185

Iteration 13: Cost 98.63846506273883, Weight 1.3896276808137857, Bias 0.035122052266051224

Iteration 14: Cost 98.63835254057648, Weight 1.38968776283053, Bias 0.03511582492978764

Iteration 15: Cost 98.63832524036214, Weight 1.3897172043139192, Bias 0.03510899846107016

Iteration 16: Cost 98.63831830104695, Weight 1.389731668997059, Bias 0.035101879159522745

Iteration 17: Cost 98.63831622628217, Weight 1.389738813163012, Bias 0.03509461674147458

Estimated Weight: 1.389738813163012

Estimated Bias: 0.03509461674147458

编程需要懂一点英语

成本函数接近局部最小值

使用梯度下降获得的最佳拟合线