📜  spark mllib 教程 - Python (1)

📅  最后修改于: 2023-12-03 15:05:14.938000             🧑  作者: Mango

Spark MLlib Tutorial - Python

In this tutorial, we will cover the basics of using Spark's Machine Learning library (MLlib) in Python. We will start with installing Spark and setting up the environment. Then, we will cover some basic machine learning algorithms such as linear regression and logistic regression.

Installation and Setup

To get started with Spark MLlib, you will first need to install Apache Spark on your machine. Follow the instructions given on the official website to install Spark. Once you have installed Spark, you can start it by running the following command:

./bin/spark-shell
Loading Data

Before we can start training our machine learning algorithms, we need to load some data. For the purpose of this tutorial, we will be using the iris dataset which is available in the scikit-learn library. You can install scikit-learn using pip:

pip install -U scikit-learn

Once you have installed scikit-learn, you can load the iris dataset using the following code:

from sklearn.datasets import load_iris
iris = load_iris()
Data Preparation

Before we can start training our machine learning algorithms, we need to prepare our data. The iris dataset contains 150 samples and 4 features (sepal length, sepal width, petal length, petal width). We will split the data into training and testing sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
Linear Regression

Linear regression is a simple machine learning algorithm that models the relationship between a dependent variable and one or more independent variables. To perform linear regression in Spark MLlib, we first need to create a LinearRegression object.

from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction')

Next, we need to create a Spark DataFrame from our training data.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LinearRegression").getOrCreate()

train_data = spark.createDataFrame(list(zip(X_train, y_train)), schema=["features", "label"])

Once we have the DataFrame, we can fit the model using the fit() method.

model = lr.fit(train_data)

To make predictions, we need to create another DataFrame from our test data.

test_data = spark.createDataFrame(list(zip(X_test, y_test)), schema=["features", "label"])

Then, we can use the transform() method to make predictions.

predictions = model.transform(test_data)
Logistic Regression

Logistic regression is a classification algorithm that models the probability of a binary outcome. To perform logistic regression in Spark MLlib, we first need to create a LogisticRegression object.

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction')

Next, we need to create a Spark DataFrame from our training data.

train_data = spark.createDataFrame(list(zip(X_train, y_train)), schema=["features", "label"])

Once we have the DataFrame, we can fit the model using the fit() method.

model = lr.fit(train_data)

To make predictions, we need to create another DataFrame from our test data.

test_data = spark.createDataFrame(list(zip(X_test, y_test)), schema=["features", "label"])

Then, we can use the transform() method to make predictions.

predictions = model.transform(test_data)
Conclusion

In this tutorial, we covered the basics of using Spark's Machine Learning library (MLlib) in Python. We covered how to install Spark, load data, prepare data, and train and make predictions using linear regression and logistic regression algorithms. Spark MLlib provides a powerful platform for distributed machine learning and can be a valuable addition to any data scientist's toolkit.