📜  作为初学者实施的 5 个机器学习项目(1)

📅  最后修改于: 2023-12-03 15:36:24.528000             🧑  作者: Mango

以'作为初学者实施的 5 个机器学习项目'为主题

作为初学者,实施机器学习项目是了解和理解机器学习概念和应用的最佳方式之一。以下是五个适合初学者实施的机器学习项目。

1. 手写数字识别

手写数字识别是机器学习领域中的经典问题。它可以帮助你了解图像分类和卷积神经网络(CNN)的基础知识。您可以使用MNIST数据集,然后使用Python中的Keras库来构建CNN模型。

import keras
from keras.datasets import mnist
from keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from keras.models import Sequential

# Data loading and preprocessing
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255.0
x_test /= 255.0
num_classes = 10

# Model building and training
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train, keras.utils.to_categorical(y_train, num_classes), batch_size=128, epochs=10, verbose=1,
          validation_data=(x_test, keras.utils.to_categorical(y_test, num_classes)))
2. 垃圾邮件过滤器

分类垃圾邮件和非垃圾邮件是机器学习中的一个经典问题。这可通过使用Python和Scikit Learn库来实现。您可以使用公共垃圾邮件数据集,例如Enron-Spam和SpamAssassin公共数据集。

import os
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

path = 'path-to-data'

def read_files(path):
    for root, dirnames, filenames in os.walk(path):
      for filename in filenames:
            path = os.path.join(root, filename)
        inBody = False
        lines = []
        f = open(path, 'r', encoding='latin1')
        for line in f:
            if inBody:
                lines.append(line)
            elif line == '\n':
                inBody = True
        f.close()
        message = '\n'.join(lines)
        yield path, message

def data_frame(path, classification):
    rows = []
    index = []
    for filename, message in read_files(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)
    return pd.DataFrame(rows, index=index)

data = pd.DataFrame({'message': [], 'class': []})
data = data.append(data_frame(os.path.join(path, 'spam'), 'spam'))
data = data.append(data_frame(os.path.join(path, 'ham'), 'ham'))

# Bag of words creation
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
targets = data['class'].values
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.1)

# Model training and testing
model = MultinomialNB()
model.fit(X_train, y_train)
print("Accuracy: ", model.score(X_test, y_test))
3. 山火预测

机器学习技术可应用于预测山火的发生。可以使用公共山火数据集,例如这个数据集。你可以使用Python中的Scikit Learn库来实现。

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Data loading
df = pd.read_csv("forestfires.csv")

# Data preprocessing
df['month'] = df['month'].apply(lambda x: int(x.replace('jan', '1').replace('feb', '2').replace('mar', '3').replace('apr', '4').replace('may', '5').replace(
    'jun', '6').replace('jul', '7').replace('aug', '8').replace('sep', '9').replace('oct', '10').replace('nov', '11').replace('dec', '12')))
df['day'] = df['day'].apply(lambda x: int(x.replace('sun', '1').replace('mon', '2').replace('tue', '3').replace('wed', '4').replace('thu', '5').replace(
    'fri', '6').replace('sat', '7')))
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Model training
regr = LinearRegression()
regr.fit(X_train, y_train)

# Model testing
y_pred = regr.predict(X_test)
print('Coefficients:', regr.coef_)
print('Mean squared error:', mean_squared_error(y_test, y_pred))
print('R2 score:', r2_score(y_test, y_pred))
4. 电影推荐系统

构建电影推荐系统是了解协同过滤算法的一种好方法。可以使用公共电影数据集,例如这个数据集. 您也可以使用Python中的Scikit Learn库来实现。

import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import pairwise_distances

# Data loading and preprocessing
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
rating_count = pd.DataFrame(ratings.groupby('movieId')['rating'].count())
rating_count.columns = ['count_rating']
rating_count = rating_count.sort_values('count_rating', ascending=False)
popular_movies = rating_count.head(50).index.tolist()
movie_filter = ratings.movieId.isin(popular_movies).values
movie_user_mat = ratings[movie_filter].pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

# Model training
model = TruncatedSVD(n_components=16, random_state=42)
model.fit(movie_user_mat)

# Function to get similar movies
def similar_movies(movie_id, num_of_movies=10):
    movie_index = movies[movies['movieId'] == movie_id].index.tolist()[0]
    movie_row = model.components_[movie_index]
    dists = pairwise_distances(model.components_, [movie_row])
    indices = dists.argsort(axis=0)
    similar_movie_indexes = indices[:num_of_movies].flatten().tolist()
    movies_data = []
    for index in similar_movie_indexes:
        movie_id = movie_user_mat.index[index]
        movie_title = movies[movies['movieId'] == movie_id]['title'].tolist()[0]
        movie_genres = movies[movies['movieId'] == movie_id]['genres'].tolist()[0]
        movies_data.append({'movie_id': movie_id, 'title': movie_title, 'genres': movie_genres})
    return pd.DataFrame(movies_data)

# Retrieve the top 10 movie recommendations for the movie with the ID of 1
similar_movies(1, 10)
5. 波士顿房价预测

构建房价预测模型是理解回归算法的一种好方法。可以使用公共波士顿房价数据集,例如这个数据集.. 您也可以使用Python中的Scikit Learn库来实现。

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Data loading
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["MEDV"])

# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Model training
regr = LinearRegression()
regr.fit(X_train, y_train)

# Model testing
y_pred = regr.predict(X_test)
print('Coefficients:', regr.coef_)
print('Mean squared error:', mean_squared_error(y_test, y_pred))
print('R2 score:', r2_score(y_test, y_pred))

上面是五个初学者可以实现的机器学习项目。每个项目都可以让你了解机器学习的某个方面,以便在以后的学习中更加深入。