📜  Scikit学习-使用朴素贝叶斯进行分类

📅  最后修改于: 2020-12-10 05:54:41             🧑  作者: Mango


朴素贝叶斯方法是一组基于应用贝叶斯定理的有监督学习算法,并强烈假设所有预测变量彼此独立,即一个类中某个特征的存在与任何其他特征的存在无关在同一个班。这是朴素的假设,这就是为什么这些方法被称为朴素贝叶斯方法。

贝叶斯定理陈述以下关系以便找到类的后验概率,即标签的概率和一些观察到的特征,$ P \ left(\ begin {array} {c} Y \ arrowvert features \ end {array} \ right )$。

$$ P \ left(\ begin {array} {c} Y \ arrowvert features \ end {array} \ right)= \ left(\ frac {P \ lgroup Y \ rgroup P \ left(\ begin {array} {c } features \ arrowvert Y \ end {array} \ right)} {P \ left(\ begin {array} {c} features \ end {array} \ right)} \ right)$$

在这里,$ P \ left(\ begin {array} {c} Y \ arrowvert features \ end {array} \ right)$是类的后验概率。

$ P \ left(\ begin {array} {c} Y \ end {array} \ right)$是类别的先验概率。

$ P \ left(\ begin {array} {c} features \ arrowvert Y \ end {array} \ right)$是可能性,这是给定类别的预测变量的概率。

$ P \ left(\ begin {array} {c} features \ end {array} \ right)$是预测变量的先验概率。

Scikit学习提供了不同的朴素贝叶斯分类器模型,即高斯,多项式,补码和伯努利。所有这些变量的主要区别在于它们对ð’·$ P \ left(\ begin {array} {c} features \ arrowvert Y \ end {array} \ right)$的分布所做的假设,即给定类别的预测变量的概率。

Sr.No Model & Description
1 Gaussian Naïve Bayes

Gaussian Naïve Bayes classifier assumes that the data from each label is drawn from a simple Gaussian distribution.

2 Multinomial Naïve Bayes

It assumes that the features are drawn from a simple Multinomial distribution.

3 Bernoulli Naïve Bayes

The assumption in this model is that the features binary (0s and 1s) in nature. An application of Bernoulli Naïve Bayes classification is Text classification with ‘bag of words’ model

4 Complement Naïve Bayes

It was designed to correct the severe assumptions made by Multinomial Bayes classifier. This kind of NB classifier is suitable for imbalanced data sets

建立朴素贝叶斯分类器

我们还可以在Scikit学习数据集上应用朴素贝叶斯分类器。在下面的示例中,我们将应用GaussianNB并拟合Scikit-leran的breast_cancer数据集。

Import Sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
   print(label_names)
   print(labels[0])
   print(feature_names[0])
   print(features[0])
train, test, train_labels, test_labels = train_test_split(
   features,labels,test_size = 0.40, random_state = 42
)
from sklearn.naive_bayes import GaussianNB
GNBclf = GaussianNB()
model = GNBclf.fit(train, train_labels)
preds = GNBclf.predict(test)
print(preds)

输出

[
   1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1
   1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 
   1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 
   1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 
   1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 
   0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 
   1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 
   1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 
   1 1 1 1 0 1 0 0 1 1 0 1
]

以上输出由一系列0和1组成,它们基本上是来自肿瘤类别的预测值,即恶性和良性。