📜  R 编程中的套索回归

📅  最后修改于: 2022-05-13 01:55:30.075000             🧑  作者: Mango

R 编程中的套索回归

Lasso 回归是一种在简单稀疏模型(即参数较少的模型)中使用收缩的分类算法。在收缩中,数据值向中心点收缩,如平均值。 Lasso 回归是一种正则化回归算法,它执行 L1 正则化,其增加的惩罚等于系数大小的绝对值。

“LASSO”代表最小绝对收缩和选择算子。 Lasso 回归适用于显示高水平多重共线性的模型,或者当您想要自动化模型选择的某些部分时,即变量选择或参数消除。 Lasso 回归解决方案是二次规划问题,可以最好地使用 RStudio、Matlab 等软件解决。它具有选择预测变量的能力。

套索回归

该算法最小化有约束的平方和。一些Beta缩小到零,从而导致回归模型。调整参数lambda控制 L1 正则化惩罚的强度。 lambda基本上是收缩量:

  • lambda = 0 时,不消除任何参数。
  • 随着lambda的增加,越来越多的系数被设置为零并消除和偏差增加。
  • lambda = infinity 时,所有系数都被消除。
  • 随着lambda减小,方差增加。

此外,如果模型中包含截距,则保持不变。现在让我们在 R 编程中实现 Lasso 回归。

R中的实现

数据集

Big Mart 数据集包含不同城市 10 家商店的 1559 种产品。每个产品和商店的某些属性已被定义。它由 12 个特征组成,即 Item_Identifier(是分配给每个不同项目的唯一产品 ID)、Item_Weight(包括产品的重量)、Item_Fat_Content(描述产品是否低脂)、Item_Visibility(提到产品的百分比)分配给特定产品的商店中所有产品的总展示面积),Item_Type(描述该项目所属的食品类别),Item_MRP(产品的最高零售价格(标价)),Outlet_Identifier(分配的唯一商店ID。它由一个长度为 6 的字母数字字符串组成,Outlet_Establishment_Year(提及商店成立的年份),Outlet_Size(根据所覆盖的地面面积告诉商店的大小),Outlet_Location_Type(讲述所在城市的大小)商店所在的位置)、Outlet_Type(说明该商店是杂货店还是某种超市)和 Item_Outlet_Sales(特定商店中产品的销售额)。

R
# Loading data
train = fread("Train_UWu5bXk.csv")
test = fread("Test_u94Q5KV.csv")
   
# Structure
str(train)


R
# Installing Packages
install.packages("data.table")
install.packages("dplyr")
install.packages("glmnet")
install.packages("ggplot2")
install.packages("caret")
install.packages("xgboost")
install.packages("e1071")
install.packages("cowplot")
 
# load packages
library(data.table) # used for reading and manipulation of data
library(dplyr)      # used for data manipulation and joining
library(glmnet)     # used for regression
library(ggplot2)    # used for ploting
library(caret)      # used for modeling
library(xgboost)    # used for building XGBoost model
library(e1071)      # used for skewness
library(cowplot)    # used for combining multiple plots
 
# Loading datasets
train = fread("Train_UWu5bXk.csv")
test = fread("Test_u94Q5KV.csv")
 
# Setting test dataset
# Combining datasets
# add Item_Outlet_Sales to test data
test[, Item_Outlet_Sales := NA]
 
combi = rbind(train, test)
 
# Missing Value Treatment
missing_index = which(is.na(combi$Item_Weight))
for(i in missing_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Weight[i] =
      mean(combi$Item_Weight[combi$Item_Identifier == item],
                                                 na.rm = T)
}
 
# Replacing 0 in Item_Visibility with mean
zero_index = which(combi$Item_Visibility == 0)
for(i in zero_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Visibility[i] =
      mean(combi$Item_Visibility[combi$Item_Identifier == item],
                                                     na.rm = T)
}
 
# Label Encoding
# To convert categorical in numerical
combi[, Outlet_Size_num := ifelse(Outlet_Size == "Small", 0,
                                 ifelse(Outlet_Size == "Medium",
                                                        1, 2))]
 
combi[, Outlet_Location_Type_num :=
         ifelse(Outlet_Location_Type == "Tier 3", 0,
                ifelse(Outlet_Location_Type == "Tier 2", 1, 2))]
 
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL]
 
# One Hot Encoding
# To convert categorical in numerical
ohe_1 = dummyVars("~.", data = combi[, -c("Item_Identifier",
                                          "Outlet_Establishment_Year",
                                          "Item_Type")], fullRank = T)
ohe_df = data.table(predict(ohe_1, combi[, -c("Item_Identifier",
                                           "Outlet_Establishment_Year",
                                           "Item_Type")]))
 
combi = cbind(combi[, "Item_Identifier"], ohe_df)
 
# Remove skewness
skewness(combi$Item_Visibility)
skewness(combi$price_per_unit_wt)
 
# log + 1 to avoid division by zero
combi[, Item_Visibility := log(Item_Visibility + 1)]
 
# Scaling and Centering data
num_vars = which(sapply(combi, is.numeric)) # index of numeric features
num_vars_names = names(num_vars)
 
combi_numeric = combi[, setdiff(num_vars_names,
                              "Item_Outlet_Sales"),
                               with = F]
 
prep_num = preProcess(combi_numeric,
                      method=c("center", "scale"))
combi_numeric_norm = predict(prep_num, combi_numeric)
 
# removing numeric independent variables
combi[, setdiff(num_vars_names,
                "Item_Outlet_Sales") := NULL]
combi = cbind(combi, combi_numeric_norm)
 
# splitting data back to train and test
train = combi[1:nrow(train)]
test = combi[(nrow(train) + 1):nrow(combi)]
 
# Removing Item_Outlet_Sales
test[, Item_Outlet_Sales := NULL]
 
# Model Building :Lasso Regression
set.seed(123)
control = trainControl(method ="cv", number = 5)
Grid_la_reg = expand.grid(alpha = 1,
              lambda = seq(0.001, 0.1, by = 0.0002))
 
# Training lasso regression model
lasso_model = train(x = train[, -c("Item_Identifier",
                               "Item_Outlet_Sales")],
                    y = train$Item_Outlet_Sales,
                    method = "glmnet",
                    trControl = control,
                    tuneGrid = Grid_reg
                    )
lasso_model
 
# mean validation score
mean(lasso_model$resample$RMSE)
 
# Plot
plot(lasso_model, main = "Lasso Regression")


输出:

输出

对数据集执行 Lasso 回归

在数据集上使用 Lasso 回归算法,该数据集包括 12 个特征和 1559 种产品,分布在不同城市的 10 家商店。

R

# Installing Packages
install.packages("data.table")
install.packages("dplyr")
install.packages("glmnet")
install.packages("ggplot2")
install.packages("caret")
install.packages("xgboost")
install.packages("e1071")
install.packages("cowplot")
 
# load packages
library(data.table) # used for reading and manipulation of data
library(dplyr)      # used for data manipulation and joining
library(glmnet)     # used for regression
library(ggplot2)    # used for ploting
library(caret)      # used for modeling
library(xgboost)    # used for building XGBoost model
library(e1071)      # used for skewness
library(cowplot)    # used for combining multiple plots
 
# Loading datasets
train = fread("Train_UWu5bXk.csv")
test = fread("Test_u94Q5KV.csv")
 
# Setting test dataset
# Combining datasets
# add Item_Outlet_Sales to test data
test[, Item_Outlet_Sales := NA]
 
combi = rbind(train, test)
 
# Missing Value Treatment
missing_index = which(is.na(combi$Item_Weight))
for(i in missing_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Weight[i] =
      mean(combi$Item_Weight[combi$Item_Identifier == item],
                                                 na.rm = T)
}
 
# Replacing 0 in Item_Visibility with mean
zero_index = which(combi$Item_Visibility == 0)
for(i in zero_index)
{
  item = combi$Item_Identifier[i]
  combi$Item_Visibility[i] =
      mean(combi$Item_Visibility[combi$Item_Identifier == item],
                                                     na.rm = T)
}
 
# Label Encoding
# To convert categorical in numerical
combi[, Outlet_Size_num := ifelse(Outlet_Size == "Small", 0,
                                 ifelse(Outlet_Size == "Medium",
                                                        1, 2))]
 
combi[, Outlet_Location_Type_num :=
         ifelse(Outlet_Location_Type == "Tier 3", 0,
                ifelse(Outlet_Location_Type == "Tier 2", 1, 2))]
 
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL]
 
# One Hot Encoding
# To convert categorical in numerical
ohe_1 = dummyVars("~.", data = combi[, -c("Item_Identifier",
                                          "Outlet_Establishment_Year",
                                          "Item_Type")], fullRank = T)
ohe_df = data.table(predict(ohe_1, combi[, -c("Item_Identifier",
                                           "Outlet_Establishment_Year",
                                           "Item_Type")]))
 
combi = cbind(combi[, "Item_Identifier"], ohe_df)
 
# Remove skewness
skewness(combi$Item_Visibility)
skewness(combi$price_per_unit_wt)
 
# log + 1 to avoid division by zero
combi[, Item_Visibility := log(Item_Visibility + 1)]
 
# Scaling and Centering data
num_vars = which(sapply(combi, is.numeric)) # index of numeric features
num_vars_names = names(num_vars)
 
combi_numeric = combi[, setdiff(num_vars_names,
                              "Item_Outlet_Sales"),
                               with = F]
 
prep_num = preProcess(combi_numeric,
                      method=c("center", "scale"))
combi_numeric_norm = predict(prep_num, combi_numeric)
 
# removing numeric independent variables
combi[, setdiff(num_vars_names,
                "Item_Outlet_Sales") := NULL]
combi = cbind(combi, combi_numeric_norm)
 
# splitting data back to train and test
train = combi[1:nrow(train)]
test = combi[(nrow(train) + 1):nrow(combi)]
 
# Removing Item_Outlet_Sales
test[, Item_Outlet_Sales := NULL]
 
# Model Building :Lasso Regression
set.seed(123)
control = trainControl(method ="cv", number = 5)
Grid_la_reg = expand.grid(alpha = 1,
              lambda = seq(0.001, 0.1, by = 0.0002))
 
# Training lasso regression model
lasso_model = train(x = train[, -c("Item_Identifier",
                               "Item_Outlet_Sales")],
                    y = train$Item_Outlet_Sales,
                    method = "glmnet",
                    trControl = control,
                    tuneGrid = Grid_reg
                    )
lasso_model
 
# mean validation score
mean(lasso_model$resample$RMSE)
 
# Plot
plot(lasso_model, main = "Lasso Regression")

输出:

  • 模型套索模型:

输出

Lasso 回归模型使用 alpha 值作为 1 和 lambda 值作为 0.1。 RMSE用于使用最小值选择最佳模型。

  • 平均验证分数:

输出

该模型的平均验证分数为 1128.869。

  • 阴谋:

输出图

正则化参数增加,RMSE 保持不变。