C1W2.LAB.Visualizing Naive Bayes

理论课：C1W2.Sentiment Analysis with Naïve Bayes

文章目录

导入包
Calculate the likelihoods for each tweet
Using Confidence Ellipses to interpret Naïve Bayes

理论课： C1W2.Sentiment Analysis with Naïve Bayes

导入包

在下面的练习中，将使用朴素贝叶斯特征对推文数据集进行可视化检查，重点理解对数似然比=一对可输入机器学习算法的数字特征。

最后，将介绍置信度椭圆的概念，作为直观表示朴素贝叶斯模型的工具。

import numpy as np # Library for linear algebra and math utils
import pandas as pd # Dataframe libraryimport matplotlib.pyplot as plt # Library for plots
from utils import confidence_ellipse # Function to add confidence ellipses to charts

Calculate the likelihoods for each tweet

对于每条推文，我们都计算了该推文的正面可能性和负面可能性。下面给出可能性比率的分子和分母。
$\frac{P(tweet|pos)}{P(tweet|neg)} = log(P(tweet|pos)) - log(P(tweet|neg))$
$\sum_{i=0}^{n}{log P(W_i|pos)}$
$\sum_{i=0}^{n}{log P(W_i|neg)}$
以上公式对应的代码本次实验不做要求，但运行得到的结果放在：'bayes_features.csv’文件中。

data = pd.read_csv('./data/bayes_features.csv'); # Load the data from the csv filedata.head(5) # Print the first 5 tweets features. Each row represents a tweet

结果：
在这里插入图片描述
画图：

# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8)) #Create a new figure with a custom sizecolors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] index = data.index# Color base on sentiment
for sentiment in data.sentiment.unique():ix = index[data.sentiment == sentiment]ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])ax.legend(loc='best')    # Custom limits for this chart
plt.xlim(-250,0)
plt.ylim(-250,0)plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label
plt.show()

在这里插入图片描述

Using Confidence Ellipses to interpret Naïve Bayes

本节我们将使用置信度椭圆分析朴素贝叶斯的结果。

置信椭圆是可视化二维随机变量的一种方法。它比在直角坐标平面上绘制点更好，因为在大数据集上，点会严重重叠，从而掩盖数据的真实分布。置信度椭圆只需四个参数就能概括数据集的信息：

中心：中心：是属性的数值平均值。
高度和宽度：高度和宽度：与每个属性的方差有关。用户必须指定绘制椭圆所需的标准偏差量。
角度：与属性间的协方差有关。

参数 n_std 代表椭圆边界的标准差个数。请记住，对于正态随机分布来说

曲线下约 68% 的面积落在均值周围 1 个标准差的范围内。
约 95% 的曲线下面积在均值周围 2 个标准差以内。
约 99.7% 的曲线下面积在均值周围 3 个标准差以内。

在这里插入图片描述
下面代码将绘制2_std和3_std

# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] 
index = data.index# Color base on sentiment
for sentiment in data.sentiment.unique():ix = index[data.sentiment == sentiment]ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])# Custom limits for this chart
plt.xlim(-200,40)  
plt.ylim(-200,40)plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis labeldata_pos = data[data.sentiment == 1] # Filter only the positive samples
data_neg = data[data.sentiment == 0] # Filter only the negative samples# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')plt.show()

在这里插入图片描述
下面，修改正例推文的特征，使其与负例重合：

data2 = data.copy() # Copy the whole data frame# The following 2 lines only modify the entries in the data frame where sentiment == 1
#data2.negative[data.sentiment == 1] =  data2.negative * 1.5 + 50 # Modify the negative attribute
#data2.positive[data.sentiment == 1] =  data2.positive / 1.5 - 50 # Modify the positive attribute 
# 对于情感值为1的数据点，修改negative属性
data2.loc[data2.sentiment == 1, 'negative'] = data2.loc[data2.sentiment == 1, 'negative'] * 1.5 + 50# 对于情感值为1的数据点，修改positive属性
data2.loc[data2.sentiment == 1, 'positive'] = data2.loc[data2.sentiment == 1, 'positive'] / 1.5 - 50

重新绘制图像：

# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] 
index = data2.index# Color base on sentiment
for sentiment in data2.sentiment.unique():ix = index[data2.sentiment == sentiment]ax.scatter(data2.iloc[ix].positive, data2.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])#ax.scatter(data2.positive, data2.negative, c=[colors[int(k)] for k in data2.sentiment], s = 0.1, marker='*')  # Plot a dot for tweet
# Custom limits for this chart
plt.xlim(-200,40)  
plt.ylim(-200,40)plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis labeldata_pos = data2[data2.sentiment == 1] # Filter only the positive samples
data_neg = data[data2.sentiment == 0] # Filter only the negative samples# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')plt.show()