理论课:C1W2.Sentiment Analysis with Naïve Bayes
文章目录
- 导入包
- Calculate the likelihoods for each tweet
- Using Confidence Ellipses to interpret Naïve Bayes
理论课: C1W2.Sentiment Analysis with Naïve Bayes
导入包
在下面的练习中,将使用朴素贝叶斯特征对推文数据集进行可视化检查,重点理解对数似然比=一对可输入机器学习算法的数字特征。
最后,将介绍置信度椭圆的概念,作为直观表示朴素贝叶斯模型的工具。
import numpy as np # Library for linear algebra and math utils
import pandas as pd # Dataframe libraryimport matplotlib.pyplot as plt # Library for plots
from utils import confidence_ellipse # Function to add confidence ellipses to charts
Calculate the likelihoods for each tweet
对于每条推文,我们都计算了该推文的正面可能性和负面可能性。下面给出可能性比率的分子和分母。
l o g P ( t w e e t ∣ p o s ) P ( t w e e t ∣ n e g ) = l o g ( P ( t w e e t ∣ p o s ) ) − l o g ( P ( t w e e t ∣ n e g ) ) log \frac{P(tweet|pos)}{P(tweet|neg)} = log(P(tweet|pos)) - log(P(tweet|neg)) logP(tweet∣neg)P(tweet∣pos)=log(P(tweet∣pos))−log(P(tweet∣neg))
p o s i t i v e = l o g ( P ( t w e e t ∣ p o s ) ) = ∑ i = 0 n l o g P ( W i ∣ p o s ) positive = log(P(tweet|pos)) = \sum_{i=0}^{n}{log P(W_i|pos)} positive=log(P(tweet∣pos))=i=0∑nlogP(Wi∣pos)
n e g a t i v e = l o g ( P ( t w e e t ∣ n e g ) ) = ∑ i = 0 n l o g P ( W i ∣ n e g ) negative = log(P(tweet|neg)) = \sum_{i=0}^{n}{log P(W_i|neg)} negative=log(P(tweet∣neg))=i=0∑nlogP(Wi∣neg)
以上公式对应的代码本次实验不做要求,但运行得到的结果放在:'bayes_features.csv’文件中。
data = pd.read_csv('./data/bayes_features.csv'); # Load the data from the csv filedata.head(5) # Print the first 5 tweets features. Each row represents a tweet
结果:
画图:
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8)) #Create a new figure with a custom sizecolors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] index = data.index# Color base on sentiment
for sentiment in data.sentiment.unique():ix = index[data.sentiment == sentiment]ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])ax.legend(loc='best') # Custom limits for this chart
plt.xlim(-250,0)
plt.ylim(-250,0)plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label
plt.show()
Using Confidence Ellipses to interpret Naïve Bayes
本节我们将使用 置信度椭圆 分析朴素贝叶斯的结果。
置信椭圆是可视化二维随机变量的一种方法。它比在直角坐标平面上绘制点更好,因为在大数据集上,点会严重重叠,从而掩盖数据的真实分布。置信度椭圆只需四个参数就能概括数据集的信息:
- 中心: 中心:是属性的数值平均值。
- 高度和宽度: 高度和宽度:与每个属性的方差有关。用户必须指定绘制椭圆所需的标准偏差量。
- 角度: 与属性间的协方差有关。
参数 n_std 代表椭圆边界的标准差个数。请记住,对于正态随机分布来说
- 曲线下约 68% 的面积落在均值周围 1 个标准差的范围内。
- 约 95% 的曲线下面积在均值周围 2 个标准差以内。
- 约 99.7% 的曲线下面积在均值周围 3 个标准差以内。
下面代码将绘制2_std和3_std
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive']
index = data.index# Color base on sentiment
for sentiment in data.sentiment.unique():ix = index[data.sentiment == sentiment]ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])# Custom limits for this chart
plt.xlim(-200,40)
plt.ylim(-200,40)plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis labeldata_pos = data[data.sentiment == 1] # Filter only the positive samples
data_neg = data[data.sentiment == 0] # Filter only the negative samples# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')plt.show()
下面,修改正例推文的特征,使其与负例重合:
data2 = data.copy() # Copy the whole data frame# The following 2 lines only modify the entries in the data frame where sentiment == 1
#data2.negative[data.sentiment == 1] = data2.negative * 1.5 + 50 # Modify the negative attribute
#data2.positive[data.sentiment == 1] = data2.positive / 1.5 - 50 # Modify the positive attribute
# 对于情感值为1的数据点,修改negative属性
data2.loc[data2.sentiment == 1, 'negative'] = data2.loc[data2.sentiment == 1, 'negative'] * 1.5 + 50# 对于情感值为1的数据点,修改positive属性
data2.loc[data2.sentiment == 1, 'positive'] = data2.loc[data2.sentiment == 1, 'positive'] / 1.5 - 50
重新绘制图像:
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive']
index = data2.index# Color base on sentiment
for sentiment in data2.sentiment.unique():ix = index[data2.sentiment == sentiment]ax.scatter(data2.iloc[ix].positive, data2.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])#ax.scatter(data2.positive, data2.negative, c=[colors[int(k)] for k in data2.sentiment], s = 0.1, marker='*') # Plot a dot for tweet
# Custom limits for this chart
plt.xlim(-200,40)
plt.ylim(-200,40)plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis labeldata_pos = data2[data2.sentiment == 1] # Filter only the positive samples
data_neg = data[data2.sentiment == 0] # Filter only the negative samples# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')plt.show()
修改后,两个数据的分布开始重合。