Scikit学习-使用朴素贝叶斯进行分类

📌 相关文章

📜 Scikit学习-使用朴素贝叶斯进行分类

📅 最后修改于: 2020-12-10 05:54:41 🧑 作者: Mango

朴素贝叶斯方法是一组基于应用贝叶斯定理的有监督学习算法，并强烈假设所有预测变量彼此独立，即一个类中某个特征的存在与任何其他特征的存在无关在同一个班。这是朴素的假设，这就是为什么这些方法被称为朴素贝叶斯方法。

贝叶斯定理陈述以下关系以便找到类的后验概率，即标签的概率和一些观察到的特征，$ P \ left(\ begin {array} {c} Y \ arrowvert features \ end {array} \ right )$。

$$ P \ left(\ begin {array} {c} Y \ arrowvert features \ end {array} \ right)= \ left(\ frac {P \ lgroup Y \ rgroup P \ left(\ begin {array} {c } features \ arrowvert Y \ end {array} \ right)} {P \ left(\ begin {array} {c} features \ end {array} \ right)} \ right)$$

在这里，$ P \ left(\ begin {array} {c} Y \ arrowvert features \ end {array} \ right)$是类的后验概率。

$ P \ left(\ begin {array} {c} Y \ end {array} \ right)$是类别的先验概率。

$ P \ left(\ begin {array} {c} features \ arrowvert Y \ end {array} \ right)$是可能性，这是给定类别的预测变量的概率。

$ P \ left(\ begin {array} {c} features \ end {array} \ right)$是预测变量的先验概率。

Scikit学习提供了不同的朴素贝叶斯分类器模型，即高斯，多项式，补码和伯努利。所有这些变量的主要区别在于它们对ð’·$ P \ left(\ begin {array} {c} features \ arrowvert Y \ end {array} \ right)$的分布所做的假设，即给定类别的预测变量的概率。

Sr.No	Model & Description
1	Gaussian NaÃ¯ve Bayes Gaussian NaÃ¯ve Bayes classifier assumes that the data from each label is drawn from a simple Gaussian distribution.
2	Multinomial NaÃ¯ve Bayes It assumes that the features are drawn from a simple Multinomial distribution.
3	Bernoulli NaÃ¯ve Bayes The assumption in this model is that the features binary (0s and 1s) in nature. An application of Bernoulli NaÃ¯ve Bayes classification is Text classification with â€˜bag of wordsâ€™ model
4	Complement NaÃ¯ve Bayes It was designed to correct the severe assumptions made by Multinomial Bayes classifier. This kind of NB classifier is suitable for imbalanced data sets

建立朴素贝叶斯分类器

我们还可以在Scikit学习数据集上应用朴素贝叶斯分类器。在下面的示例中，我们将应用GaussianNB并拟合Scikit-leran的breast_cancer数据集。

例

Import Sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
   print(label_names)
   print(labels[0])
   print(feature_names[0])
   print(features[0])
train, test, train_labels, test_labels = train_test_split(
   features,labels,test_size = 0.40, random_state = 42
)
from sklearn.naive_bayes import GaussianNB
GNBclf = GaussianNB()
model = GNBclf.fit(train, train_labels)
preds = GNBclf.predict(test)
print(preds)

输出

[
   1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1
   1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 
   1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 
   1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 
   1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 
   0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 
   1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 
   1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 
   1 1 1 1 0 1 0 0 1 1 0 1
]

以上输出由一系列0和1组成，它们基本上是来自肿瘤类别的预测值，即恶性和良性。