朴素贝叶斯 半朴素贝叶斯
In this article, we’ll show you how to classify a tweet into either positive or negative, using two famous machine learning algorithms: Naive Bayes and N-Gram.
在本文中,我们将向您展示如何使用两种著名的机器学习算法:朴素贝叶斯(Naive Bayes)和N-Gram将推文分类为肯定或否定。
First, what is sentiment analysis?
首先,什么是情感分析?
Sentiment analysis is the automated process of analyzing text data and sorting it into sentiments positive, negative, or neutral. Using sentiment analysis tools to analyze opinions in Twitter data can help companies understand how people are talking about their brand.
情感分析是分析文本数据并将其分类为正面,负面或中性的自动化过程。 使用情绪分析工具分析Twitter数据中的观点可以帮助公司了解人们如何谈论自己的品牌。
Now that you know what sentiment analysis is, let’s start coding.
现在您已经了解了情感分析,让我们开始编码。
We have divided the whole program into three parts:
我们将整个程序分为三个部分:
- Importing the datasets 导入数据集
- Preprocessing of datasets 数据集的预处理
- Applying machine learning algorithms 应用机器学习算法
Note: We have used Jupyter Notebook but you can use the editor of your choice.
注意:我们使用了Jupyter Notebook,但您可以使用自己选择的编辑器。
步骤1:导入数据集 (Step 1: Importing the Datasets)
Displaying the top ten columns of the dataset:
显示数据集的前十列:
data.head(10)
From the dataset above we can clearly see the use of the following (none of which is of any use in determining the sentiment of a tweet):
从上面的数据集中,我们可以清楚地看到以下内容的用途(在确定推文情感时,没有任何用处):
- Acronyms 缩略语
- Sequences of repeated characters 重复字符序列
- Emoticons 表情符号
- Spelling mistakes 拼写错误
- Nouns 名词
Let’s see if our dataset is balanced around the label class sentiment:
让我们看看我们的数据集是否在标签类情感上保持平衡:
plt.close()
fig, ax = plt.subplots()
counts, bins, patches = ax.hist(data.Sentiment.as_matrix(), edgecolor='gray')ax.set_title("Histogram of Sentiments")ax.set_xlabel("Sentiment")ax.set_ylabel("Frequecy")patches[0].set_facecolor("#5d4037")
patches[0].set_label("negative")patches[-1].set_facecolor("#ff9100")
patches[-1].set_label("positive")plt.legend()
The dataset seems to be very balanced between negative and positive sentiment.
数据集似乎在消极情绪和积极情绪之间非常平衡。
Now, we need to import other datasets which will help us with the preprocessing, such as:
现在,我们需要导入其他可以帮助我们进行预处理的数据集,例如:
- An emoticon dictionary regrouping 132 of the most used emoticons in western with their sentiment, negative or positive: 表情符号字典将132个西方最常用的表情符号及其负面或正面情绪重新组合:
emoticons = pd.read_csv('data/smileys.csv')
positive_emoticons = emoticons[emoticons.Sentiment == 1]
negative_emoticons = emoticons[emoticons.Sentiment == 0]
emoticons.head(5)
- An acronym dictionary of 5465 acronyms with their translations: 一个缩略词词典,包含5465个缩略语及其翻译:
acronyms = pd.read_csv('data/acronyms.csv')
acronyms.tail(5)
- A stop word dictionary, corresponding to words that are filtered out before or after processing of natural language data because they’re not useful in our case. 停用词字典,对应于在处理自然语言数据之前或之后过滤掉的词,因为它们在我们的案例中没有用。
stops = pd.read_csv('data/stopwords.csv')
stops.columns = ['Word']
stops.head(5)
- A positive and negative word dictionary: 正负词词典:
positive_words = pd.read_csv('data/positive-words.csv', sep='\t')
positive_words.columns = ['Word', 'Sentiment']
negative_words = pd.read_csv('data/negative-words.csv', sep='\t')
negative_words.columns = ['Word', 'Sentiment']
positive_words.head(5)
negative_words.head(5)
步骤2: 数据集的预处理 (Step 2: Preprocessing of Datasets)
什么是数据预处理? (What is data preprocessing?)
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
数据预处理是一种用于将原始数据转换为干净数据集的技术。 换句话说,无论何时从不同来源收集数据,数据都以原始格式收集,这对于分析是不可行的。
Now, let's begin with the preprocessing part.
现在,让我们从预处理部分开始。
To do this we are going to pass our data through various steps:
为此,我们将通过各种步骤传递数据:
Replace all emoticons by their sentiment polarity
||pos||
/||neg||
using the emoticon dictionary:用表情极性替换所有表情
||pos||
/||neg||
使用表情词典:
import re
def make_emoticon_pattern(emoticons):pattern = "|".join(map(re.escape, emoticons.Smiley))pattern = "(?<=\s)(" + pattern + ")(?=\s)"return pattern
def find_with_pattern(pattern, replace=False, tag=None):if replace and tag == None:raise Exception("Parameter error", "If replace=True you should add the tag by which the pattern will be replaced")regex = re.compile(pattern)if replace:return data.SentimentText.apply(lambda tweet: re.sub(pattern, tag, " " + tweet + " "))return data.SentimentText.apply(lambda tweet: re.findall(pattern, " " + tweet + " "))
pos_emoticons_found = find_with_pattern(make_emoticon_pattern(positive_emoticons))
neg_emoticons_found = find_with_pattern(make_emoticon_pattern(negative_emoticons))
nb_pos_emoticons = len(pos_emoticons_found[pos_emoticons_found.map(lambda emoticons : len(emoticons) > 0)])
nb_neg_emoticons = len(neg_emoticons_found[neg_emoticons_found.map(lambda emoticons : len(emoticons) > 0)])
print "Number of positive emoticons: " + str(nb_pos_emoticons) + " Number of negative emoticons: " + str(nb_neg_emoticons)
--------------------------------------------------------------------
data.SentimentText = find_with_pattern(make_emoticon_pattern(positive_emoticons), True, '||pos||')
data.SentimentText = find_with_pattern(make_emoticon_pattern(negative_emoticons), True, '||neg||') data.head(10)
Replace all URLs with a tag
||url||
:用标签
||url||
替换所有URL。 :
pattern_url = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
url_found = find_with_pattern(pattern_url)data.SentimentText = find_with_pattern(pattern_url, True, '||url||') data[50:60]
- Remove unicode characters: 删除unicode字符:
def remove_unicode(string):
try:
string = string.decode('unicode_escape').encode('ascii','ignore')
except UnicodeDecodeError:
pass
return string
data.SentimentText = data.SentimentText.apply(lambda tweet: remove_unicode(tweet))data[1578592:1578602]
- Decode HTML entities: 解码HTML实体:
data.SentimentText[599982]
import HTMLParser
html_parser = HTMLParser.HTMLParser() data.SentimentText = data.SentimentText.apply(lambda tweet: html_parser.unescape(tweet)) data.SentimentText[599982]
- Reduce all letters to lowercase: 将所有字母都减小为小写:
data.SentimentText = data.SentimentText.str.lower() data.head(10)
Replace all usernames/targets
@
with||target||
:将所有用户名/目标
@
替换为||target||
:
pattern_usernames = "@\w{1,}"usernames_found = find_with_pattern(pattern_usernames)data.SentimentText = find_with_pattern(pattern_usernames, True, '||target||')data[45:55]
- Replace all acronyms with their translation: 用其翻译替换所有首字母缩写词:
https://gist.github.com/BetterProgramming/fdcccacf21fa02a8a4d697da24a8cd54.js
https://gist.github.com/BetterProgramming/fdcccacf21fa02a8a4d697da24a8cd54.js
for i, (acronym, value) in enumerate(top20acronyms):
print str(i + 1) + ") " + acronym + " => " + acronym_dictionary[acronym] + " : " + str(value)
plt.close()
top20acronym_keys = [x[0] for x in top20acronyms]
top20acronym_values = [x[1] for x in top20acronyms]
indexes = np.arange(len(top20acronym_keys))
width = 0.7
plt.bar(indexes, top20acronym_values, width)
plt.xticks(indexes + width * 0.5, top20acronym_keys, rotation="vertical")
Replace all negations (e.g: not, no, never) by tag
||not||
.用标签
||not||
替换所有否定(例如:不,不,从不) 。
negation_dictionary = dict(zip(negation_words.Negation, negation_words.Tag)) def replace_negation(tweet):
return [negation_dictionary[word] if negation_dictionary.has_key(word) else word for word in tweet] data.SentimentText = data.SentimentText.apply(lambda tweet: replace_negation(tweet)) print data.SentimentText[29]
- Replace a sequence of repeated characters with two characters (e.g: “helloooo” = “helloo”) to keep the emphasized usage of the word. 用两个字符代替重复的字符序列(例如:“ helloooo” =“ helloo”),以保持单词的强调用法。
data[1578604:]
pattern = re.compile(r'(.)\1*') def reduce_sequence_word(word):
return ''.join([match.group()[:2] if len(match.group()) > 2 else match.group() for match in pattern.finditer(word)]) def reduce_sequence_tweet(tweet):
return [reduce_sequence_word(word) for word in tweet] data.SentimentText = data.SentimentText.apply(lambda tweet: reduce_sequence_tweet(tweet)) data[1578604:]
We’ve finished with the most important and tricky part of our Twitter sentiment analysis project, we can now apply our machine learning algorithms to the processed datasets.
我们已经完成了Twitter情绪分析项目中最重要,最棘手的部分,现在我们可以将机器学习算法应用于处理后的数据集。
步骤3: 应用机器学习算法 (Step 3: Applying Machine Learning Algorithms)
什么是机器学习? (What is machine learning?)
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
机器学习是人工智能(AI)的一种应用,它使系统能够自动学习并从经验中进行改进,而无需进行明确的编程。 机器学习专注于计算机程序的开发,该程序可以访问数据并使用它自己学习。
There are three major methods used to classify a sentence in a given category, in our case, positive(1) or negative(0): SVM, Naive Bayes, and N-Gram.
在给定类别中,可以使用三种主要方法对句子进行分类,在我们的例子中,这是肯定(1)或否定(0):SVM,朴素贝叶斯和N-Gram。
We have used only Naive Bayes and N-Gram which are the most commonly used in determining the sentiment of tweets.
我们仅使用了朴素贝叶斯(Naive Bayes)和N-Gram,它们是确定推文情感最常用的方法。
Let us start with Naive Bayes.
让我们从朴素贝叶斯开始。
朴素贝叶斯 (Naive Bayes)
There are different types of Naive Bayes classifiers but we’ll be using the Multinomial Naive Bayes.
朴素贝叶斯分类器有不同类型,但我们将使用多项朴素贝叶斯。
基准线 (Baseline)
We use the Multinomial Naive Bayes as the learning algorithm with Laplace smoothing representing the classic way of doing text classification. Since we need to extract features from our data set of tweets, we use the bag of words model to represent it.
我们使用多项式朴素贝叶斯作为学习算法,拉普拉斯平滑表示经典的文本分类方法。 由于我们需要从推文数据集中提取特征,因此我们使用词袋模型来表示它。
The bag of words model is a simplifying representation of a document where it’s represented as a bag of its words without taking consideration of the grammar or word order. In-text classification, the frequency of each word is used as a feature for training a classifier.
单词袋模型是文档的简化表示,其中将文档表示为单词袋,而无需考虑语法或单词顺序。 在文本分类中,每个单词的出现频率用作训练分类器的功能。
For simplicity, we use the library sci-kit-learn.
为简单起见,我们使用库sci-kit-learn。
Let’s first start by dividing our data set into training and test set:
首先,将数据集分为训练集和测试集:
def make_training_test_sets(data):data_shuffled = data.iloc[np.random.permutation(len(data))]data_shuffled = data_shuffled.reset_index(drop=True)data_shuffled.SentimentText = data_shuffled.SentimentText.apply(lambda tweet: " ".join(tweet))positive_tweets = data_shuffled[data_shuffled.Sentiment == 1]negative_tweets = data_shuffled[data_shuffled.Sentiment == 0]positive_tweets_cutoff = int(len(positive_tweets) * (3./4.))negative_tweets_cutoff = int(len(negative_tweets) * (3./4.))training_tweets = pd.concat([positive_tweets[:positive_tweets_cutoff], negative_tweets[:negative_tweets_cutoff]])test_tweets = pd.concat([positive_tweets[positive_tweets_cutoff:], negative_tweets[negative_tweets_cutoff:]])training_tweets = training_tweets.iloc[np.random.permutation(len(training_tweets))].reset_index(drop=True)test_tweets = test_tweets.iloc[np.random.permutation(len(test_tweets))].reset_index(drop=True)return training_tweets, test_tweetstraining_tweets, test_tweets = make_training_test_sets(data)print "size of training set: " + str(len(training_tweets))
print "size of test set: " + str(len(test_tweets))
- Size of training set: 1183958 培训规模:1183958
- Size of test set: 394654 测试集的大小:394654
Once the training set and the test set are created we need a third set of data called the validation set. This is really useful because it will be used to validate our model against unseen data and tune the possible parameters of the learning algorithm to avoid underfitting and overfitting, for example.
创建训练集和测试集后,我们需要称为验证集的第三组数据。 这真的很有用,因为它将用于针对看不见的数据验证我们的模型,并调整学习算法的可能参数,例如,避免欠拟合和过拟合。
We need this validation set because our test set should be used only to verify how well the model will generalize. If we use the test set rather than the validation set, our model could be overly optimistic and twist our results.
我们需要此验证集,因为我们的测试集仅应用于验证模型的泛化程度。 如果我们使用测试集而不是验证集,那么我们的模型可能会过于乐观并扭曲我们的结果。
To make the validation set, there are two main options:
要创建验证集,有两个主要选项:
- Split the training set into two parts (60%/20%) with a ratio of 2:8 where each part contains an equal distribution of example types. We train the classifier with the largest part and make predictions with the smaller one to validate the model. This technique works well but has the disadvantage of our classifier not getting trained and validated on all examples in the data set (without counting the test set). 将训练集按2:8的比例分为两部分(60%/ 20%),其中每个部分包含示例类型的相等分布。 我们训练分类器的最大部分,并用较小的部分进行预测以验证模型。 该技术效果很好,但缺点是我们的分类器没有针对数据集中的所有示例进行训练和验证(不对测试集进行计数)。
- The K-fold cross-validation. We split the data set into k parts, hold out one, combine the others and train on them, then validate against the held-out portion. We repeat that process k times (each fold), holding out a different portion each time. Then we average the score measured for each fold to get a more accurate estimation of our model’s performance. K折交叉验证。 我们将数据集分为k部分,提供一个部分,合并其他部分并对其进行训练,然后针对保留部分进行验证。 我们重复该过程k次(每次折叠),每次都保留不同的部分。 然后,我们对每次折叠的得分进行平均,以更准确地估算模型的性能。
We split the training data into ten folds and cross-validate them using scikit-learn:
我们将训练数据分为十个部分,并使用scikit-learn对其进行交叉验证:
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNBdef classify(training_tweets, test_tweets, ngram=(1, 1)):scores = []k_fold = KFold(n=len(training_tweets), n_folds=10)count_vectorizer = CountVectorizer(ngram_range=ngram)confusion = np.array([[0, 0], [0, 0]])for training_indices, validation_indices in k_fold:training_features = count_vectorizer.fit_transform(training_tweets.iloc[training_indices]['SentimentText'].values)training_labels = training_tweets.iloc[training_indices]['Sentiment'].valuesvalidation_features = count_vectorizer.transform(training_tweets.iloc[validation_indices]['SentimentText'].values)validation_labels = training_tweets.iloc[validation_indices]['Sentiment'].valuesclassifier = MultinomialNB()classifier.fit(training_features, training_labels)validation_predictions = classifier.predict(validation_features)confusion += confusion_matrix(validation_labels, validation_predictions)score = f1_score(validation_labels, validation_predictions)scores.append(score)return (sum(scores) / len(scores)), confusionscore, confusion = classify(training_tweets, test_tweets)print 'Total tweets classified: ' + str(len(training_tweets))
print 'Score: ' + str(sum(scores) / len(scores))
print 'Confusion matrix:'
print(confusion)
Total tweets classified: 1183958
分类的总推文:1183958
Score: 0.77653600187
得分:0.77653600187
Confusion matrix: [[465021 126305][136321 456311]]
混淆矩阵:[[465021 126305] [136321 456311]]
We get about 0.77 using our baseline.
使用基线,我们得到约0.77。
N-Gram(语言模型) (N-Gram (Language Models ))
Note: An important note is that n-gram classifiers are in fact a generalization of Naive Bayes. A unigram classifier with Laplace smoothing corresponds exactly to the traditional naive Bayes classifier.
注意 :重要的一点是,n-gram分类器实际上是朴素贝叶斯的概括。 具有拉普拉斯平滑的unigram分类器与传统的朴素贝叶斯分类器完全对应。
Since we use bag of words model, meaning we translate this sentence: “I don’t like chocolate” into “I”, “don’t”, “like”, “chocolate”, we could try to use bigram model to take care of negation with “don’t like” for this example. We are still going to use Laplace smoothing but we use the parameter ngram_range in CountVectorizer to add the bigram features.
由于我们使用词袋模型,这意味着我们将以下句子翻译:“我不喜欢巧克力”转换为“我”,“不喜欢”,“喜欢”,“巧克力”,我们可以尝试使用bigram模型在这个例子中,用“不喜欢”表示否定。 我们仍将使用拉普拉斯平滑,但我们在CountVectorizer中使用参数ngram_range来添加bigram功能。
score, confusion = classify(training_tweets, test_tweets, (2, 2))print 'Total tweets classified: ' + str(len(training_tweets))
print 'Score: ' + str(score)
print 'Confusion matrix:' print(confusion)
Using only bigram features we have slightly improved our accuracy score of about 0.01. Based on that we could think of adding unigram and bigram should increase the accuracy score more.
仅使用bigram功能,我们的准确性得分略有提高,约为0.01。 基于此,我们可以考虑添加unigram和bigram可以进一步提高准确性得分。
score, confusion = classify(training_tweets, test_tweets, (1, 2))print 'Total tweets classified: ' + str(len(training_tweets))
print 'Score: ' + str(score)
print 'Confusion matrix:'
print(confusion)
Indeed, the accuracy score of about 0.02 has improved compared to the baseline.
实际上,与基线相比,大约0.02的准确性得分有所提高。
结论 (Conclusion)
In this project, we tried to show a basic way of classifying tweets into positive or negative categories using Naive Bayes as a baseline. We also tried to show how language models are related to the Naive Bayes and can produce better results.
在此项目中,我们试图展示一种以朴素贝叶斯为基准将推文分为正面或负面类别的基本方法。 我们还试图说明语言模型与朴素贝叶斯的关系,并可以产生更好的结果。
This was our group’s final year project. We faced a lot of challenges digging into the details and selecting the right algorithm for the task. I hope you guys don’t have to go through the same process!
这是我们小组的最后一个项目。 我们在挖掘细节并为任务选择正确的算法时面临许多挑战。 希望你们不必经历相同的过程!
Since you have come all this far, I am sharing the code link with you guys (do give a star to the repository if you find it helpful). This is an open initiative to help those in need.
既然您到此为止,我将与大家共享代码链接 (如果发现有帮助,请在资源库中加注星号)。 这是一项开放的倡议,旨在帮助有需要的人。
Thanks for reading this article. I hope it’s helpful to you all!
感谢您阅读本文。 希望对您有帮助!
翻译自: https://medium.com/better-programming/twitter-sentiment-analysis-using-naive-bayes-and-n-gram-5df42ae4bfc6
朴素贝叶斯 半朴素贝叶斯
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392257.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!