twitter 数据集处理_Twitter数据清理和数据科学预处理

twitter 数据集处理

In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.

过去的十年中,诸如微博和文本消息之类的新通信形式已经出现并无处不在。 尽管对推文和文本传达的信息范围没有限制,但这些短消息通常用于分享人们对周围世界正在发生的事情的看法和观点。

Opinion mining (known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

观点挖掘(称为情感分析或情感AI)是指使用自然语言处理,文本分析,计算语言学和生物识别技术来系统地识别,提取,量化和研究情感状态和主观信息。 情绪分析广泛应用于客户材料的声音,例如评论和调查响应,在线和社交媒体以及医疗保健材料,其应用范围从营销到客户服务再到临床医学。

Both Lexion and Machine learning-based approach will be used to for Emoticons based sentiment analysis. Firstly we stand up with the Machine Learning based clustering. In MachineLearning based approach we are used Supervised and Unsupervised learning methods. The twitter data are collected and given as input in the system. The system classifies each tweets data as Positive, Negative and Neutral and also produce the positive, negative and neutral no of tweets of each emoticon separately in the output. Besides being the polarity of each tweet is also determined on the basis of polarity.

Lexion和基于机器学习的方法都将用于基于表情的情绪分析。 首先,我们支持基于机器学习的集群。 在基于MachineLearning的方法中,我们使用了有监督和无监督的学习方法。 收集twitter数据并作为系统中的输入给出。 系统将每个推文数据分类为“正”,“负”和“中性”,并且还分别在输出中生成每个表情符号的正,负和中性no。 除了作为每个推文的极性之外,还基于极性来确定。

Collection of Data

资料收集

To collecting the twitter data, we have to do some data mining process. In that process, we have created our own applicating with help of twitter API. With the help of twitter API, we have collected a large no of the dataset . From this, we have to create a developer account and register our app. Here we received a consumer key and a consumer secret: these are used in application settings and from the configuration page of the app we also require an access token and an access token secrets which provide the application access to Twitter on behalf of the account. The process is divided into two sub-process. This is discussed in the next subsection.

要收集Twitter数据,我们必须执行一些数据挖掘过程。 在此过程中,我们借助twitter API创建了自己的应用程序。 借助twitter API,我们已收集了大量数据集。 由此,我们必须创建一个开发人员帐户并注册我们的应用程序。 在这里,我们收到了一个用户密钥和一个消费者密钥:这些密钥用于应用程序设置中,并且在应用程序的配置页面中,我们还需要访问令牌和访问令牌密钥,以代表帐户向Twitter提供应用程序访问权限。 该过程分为两个子过程。 下一部分将对此进行讨论。

Accessing Twitter Data and Strimming

访问Twitter数据并加强

To make the application and to interact with twitter services we use Twitter provided REST API. We use a bunch of Python-based clients. The API variable is now our entry point for most of the operations we can perform with Twitter. The API provides features to access different types of data. In this way, we can easily collect tweets (and more) and store them in the system. By default, the data is in JSON format, we change it to txt format for easy accessibility.

为了制作应用程序并与Twitter服务进行交互,我们使用Twitter提供的REST API。 我们使用了许多基于Python的客户端。 现在,API变量是我们可以使用Twitter执行的大多数操作的入口点。 该API提供了访问不同类型数据的功能。 这样,我们可以轻松地收集(和更多)推文并将其存储在系统中。 默认情况下,数据采用JSON格式,为了方便访问,我们将其更改为txt格式。

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. By extending and customizing the stream-listener process, we processed the incoming data. This way, we gather a lot of tweets. This is especially true for live events with worldwide live coverage.

如果我们想“保持连接打开”并收集有关特定事件的所有即将发布的推文,则需要流API。 通过扩展和定制流侦听器过程,我们处理了传入的数据。 这样,我们收集了很多推文。 对于具有全球实时报道的现场活动尤其如此。

# Twitter Sentiment Analysis
import sys
import csv
import tweepy
import matplotlib.pyplot as pltfrom collections import Counterif sys.version_info[0] < 3:input = raw_input## Twitter credentials
consumer_key = "------------"
consumer_secret = "------------"
access_token = "----------"
access_token_secret = "-----------"## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")results = api.search(lang="en",q=query + " -rt",count=number,result_type="recent"
)print("--- Gathered Tweets \n")## open a csv file to store the Tweets and their sentiment 
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)with open(file_name, 'w', newline='') as csvfile:csv_writer = csv.DictWriter(f=csvfile,fieldnames=["Tweet", "Sentiment"])csv_writer.writeheader()print("--- Opened a CSV file to store the results of your sentiment analysis... \n")## tidy up the Tweets and send each to the AYLIEN Text APIfor c, result in enumerate(results, start=1):tweet = result.texttidy_tweet = tweet.strip().encode('ascii', 'ignore')if len(tweet) == 0:print('Empty Tweet')continueresponse = client.Sentiment({'text': tidy_tweet})csv_writer.writerow({'Tweet': response['text'],'Sentiment': response['polarity']})print("Analyzed Tweet {}".format(c))

Data Pre-Processing and Cleaning

数据预处理和清理

The data pre-processing steps perform the necessary data pre-processing and cleaning on the collected dataset. On the previously collected dataset, the are some key attributes text: the text of the tweet itself, created_at: the date of creation,favorite_count, retweet_count: the number of favourites and retweets, favourited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet etc. We have applied an extensive set of pre-processing steps to decrease the size of the feature set to make it suitable for learning algorithms. The cleaning method is based on dictionary methods.

数据预处理步骤对收集的数据集执行必要的数据预处理和清理。 在先前收集的数据集上,有一些关键属性文本:tweet本身的文本,created_at:创建日期,favorite_count,retweet_count:收藏和转推的数量,已收藏,已转推:布尔值,指明是否通过身份验证的用户(您)对此推文等有帮助或转发。我们已应用了广泛的预处理步骤,以减小功能集的大小,使其适合于学习算法。 清洁方法基于字典方法。

Data obtained from twitter usually contains a lot of HTML entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Hare, we are using the HTML parser module of Python which can convert these entities to standard HTML tags. For example &lt; is converted to “<” and &amp; is converted to “&”. After this, we are removing this special HTML Character and links. In decoding data, this is the process of transforming information from complex symbols to simple and easier to understand characters. The collected data uses different forms of decoding like “Latin”, “UTF8” etc.

从Twitter获得的数据通常包含许多HTML实体,例如&lt; &gt; &amp; 嵌入到原始数据中。 因此有必要摆脱这些实体。 一种方法是通过使用特定的正则表达式直接删除它们。 野兔,我们正在使用PythonHTML解析器模块,该模块可以将这些实体转换为标准HTML标记。 例如&lt; 转换为“ <”和&amp; 转换为“&”。 此后,我们将删除此特殊HTML字符和链接。 在解码数据时,这是将信息从复杂的符号转换为简单易懂的字符的过程。 收集的数据使用不同的解码形式,例如“拉丁”,“ UTF8”等。

In the twitter datasets, there is also other information as retweet, Hashtag, Username and modified tweets. All of this is ignored and removed from the dataset.

在Twitter数据集中,还有其他信息,如转推,标签,用户名和已修改的推文。 所有这些都将被忽略并从数据集中删除。

from nltk import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import words
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, pos_tag_sents#import for bag of word
import numpy as np
#For the regular expression
import re
#Textblob dependency
from textblob import TextBlob
from textblob import Word
#set to string 
from ast import literal_eval
#From src dependency 
from sentencecounter import no_sentences,getline,gettempwords import os
def getsysets(word):syns = wordnet.synsets(word)  #wordnet from ntlk.corpus  will not work with textblob#print(syns[0].name()) #print(syns[0].lemmas()[0].name())  #get synsets names #print(syns[0].definition())  #defination #print(syns[0].examples())    #example# getsysets("good")def getsynonyms(word):synonyms = []# antonyms = []for syn in wordnet.synsets(word):for l in syn.lemmas():synonyms.append(l.name())# if l.antonyms():#     antonyms.append(l.antonyms()[0].name())# print(set(synonyms))return(set(synonyms))# print(set(antonyms))# getsynonyms_and_antonyms("good")def extract_words(sentence):ignore_words = ['a']words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence)words_cleaned = [w.lower() for w in words if w not in ignore_words]return words_cleaned    def tokenize_sentences(sentences):words = []for sentence in sentences:w = extract_words(sentence)words.extend(w)words = sorted(list(set(words)))return wordsdef bagofwords(sentence, words):sentence_words = extract_words(sentence)# frequency word countbag = np.zeros(len(words))for sw in sentence_words:for i,word in enumerate(words):if word == sw: bag[i] += 1return np.array(bag)def tokenizer(sentences):token = word_tokenize(sentences)return tokenprint("#"*100)print (sent_tokenize(sentences))print (token)print("#"*100)# sentences = "Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"
# vocabulary = tokenize_sentences(sentences)
# print (vocabulary)
# tokenizer(sentences)def createposfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word+'\n')def createnegfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word)def getsortedsynonyms(word):sortedsynonyms = sorted(getsynonyms(word))return sortedsynonymsdef getlengthofarray(word):return getsortedsynonyms(word).__len__()def readposfile():f = open('list of positive words.txt')return f# def searchword(word, sourcename):
#     if word in open('list of negative words.txt').read():
#             createnegfile('destinationposfile.txt',word)
#     elif word in open('list of positive words.txt').read():
#             createposfile('destinationnegfile.txt',word)     #     else:
#         for i in range (0,getlengthofarray(word)):
#             searchword(getsortedsynonyms(word)[i],sourcename)def searchword(word,srcfile):# if word in open('list of negative words.txt').read():#         createnegfile('destinationposfile.txt',word)if word in open('list of positive words.txt').read():createposfile('destinationnegfile.txt',word)else:for i in range(0,getlengthofarray(word)):searchword(sorted(getsynonyms(word))[i],srcfile)f = open(srcfile,'w')f.writelines(word)print ('#'*50)
# searchword('lol','a.txt')
print(readposfile())
# tokenizer(sentences)
# getsynonyms('good')
# print(sorted(getsynonyms('good'))[2])  #finding an array object [hear it's 3rd object]
print ('#'*50)
# print (getsortedsynonyms('bad').__len__())
# createposfile('created.txt','lol')
# for word in word_tokenize(getline()):
#     searchword(word,'a.txt')

Stop words are generally thought to be a “single set of words”. We would not want these words taking up space in our database. For this using NLTK and using a “Stop Word Dictionary” . The stop words are removed as they are not useful.All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed. In the twitter datasets, there is also other information as retweet, Hashtag, Username and Modified tweets. All of this is ignored and removed from the dataset. We should remove these duplicates, which we already did. Sometimes it is better to remove duplicate data based on a set of unique identifiers. For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.

停用词通常被认为是“单个词集”。 我们不希望这些单词占用数据库中的空间。 为此,请使用NLTK并使用“停用词词典”。 停用词因无用而被删除。应根据优先级处理所有标点符号。 例如: ”。”, ”,”,”?” 是重要的标点符号,应保留下来,而其他标点符号则需要删除。 在Twitter数据集中,还存在其他信息,如转推,标签,用户名和修改过的推文。 所有这些都将被忽略并从数据集中删除。 我们应该删除这些重复项,而我们已经这样做了。 有时最好根据一组唯一的标识符删除重复的数据。 例如,以相同的平方英尺,相同的价格和相同的建造年份,同时进行两次交易的机会几乎为零。

Thank you for reading.

感谢您的阅读。

I hope you found this data cleaning guide helpful. Please leave any comments to let us know your thoughts.

希望本数据清理指南对您有所帮助。 请留下任何评论,让我们知道您的想法。

To read previous part of the series -

要阅读本系列的前一部分-

https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f

https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f

翻译自: https://medium.com/swlh/twitter-data-cleaning-and-preprocessing-for-data-science-3ca0ea80e5cd

twitter 数据集处理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392164.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

ios 动态化视图_如何在iOS应用中使高度收集视图动态化

ios 动态化视图by Payal Gupta通过Payal Gupta 如何在iOS应用中使集合视图的高度动态化 (How to make height of collection views dynamic in your iOS apps) 充满活力&#xff0c;就像生活一样… (Be dynamic, just like life…) Table views and collection views have alw…

新开通博客

新开通博客&#xff0c;希望兄弟们积极更新。 转载于:https://www.cnblogs.com/ydhliphonedev/archive/2011/07/28/2119720.html

思维导图分析http之http协议版本

1.结构总览 在http协议这一章&#xff0c;我将先后介绍上图六个部分&#xff0c;本文先介绍http的协议版本。 2.http协议版本 http协议的历史并不长&#xff0c;从1991的0.9版本到现在(2017)仅仅才20多年&#xff0c;算算下来,http还是正处青年&#xff0c;正是大好发展的好时光…

分布与并行计算—生产者消费者模型RabbitMQ(Java)

连接工具 public class ConnectionUtil {public static final String QUEUE_NAME"firstQueue";private static final String RABBIT_HOST "11";private static final String RABBIT_USERNAME "";private static final String RABBIT_PASSWORD…

飞腾 linux 内核,FT2004-Xenomai

移植Xenomai到基于飞腾FT2004 CPU的FT Linux系统1 目前飞腾FT2000/4相关设备驱动还没有开源&#xff0c;需要先联系飞腾软件生态部获取FT Linux源代码2 如需在x86交叉编译arm64内核&#xff0c;推荐使用Linaro gcc编译器&#xff0c;链接如下&#xff1a;https://releases.lina…

使用管道符组合使用命令_如何使用管道的魔力

使用管道符组合使用命令Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of…

关于网页授权的两种scope的区别说明

关于网页授权的两种scope的区别说明 1、以snsapi_base为scope发起的网页授权&#xff0c;是用来获取进入页面的用户的openid的&#xff0c;并且是静默授权并自动跳转到回调页的。用户感知的就是直接进入了回调页&#xff08;往往是业务页面&#xff09; 2、以snsapi_userinfo为…

安卓流行布局开源库_如何使用流行度在开源库之间进行选择

安卓流行布局开源库by Ashish Singal通过Ashish Singal 如何使用流行度在开源库之间进行选择 (How to choose between open source libraries using popularity) Through my career as a product manager, I’ve worked closely with engineers to build many technology prod…

TCP/IP分析(一) 协议概述

各协议层分工明确 转载于:https://www.cnblogs.com/HonkerYblogs/p/11247604.html

window 下分linux分区,如何在windows9x下访问linux分区

1. 简 介Linux 内 核 支 持 众 多 的 文 件 系 统 类 型, 目 前 它 可 以 读 写( 至 少 是 读) 大 部 分 的 文 件 系 统.Linux 经 常 与Microsoft Windows 共 存 于 一 个 系 统 或 者 硬 盘 中.Linux 对windows9x/NT 的 文 件 系 统 支 持 的 很 好, 反 之 你 想 在windows 下…

C# new关键字和对象类型转换(双括号、is操作符、as操作符)

一、new关键字 CLR要求所有的对象都通过new来创建,代码如下: Object objnew Object(); 以下是new操作符做的事情 1、计算类型及其所有基类型(一直到System.Object,虽然它没有定义自己的实例字段)中定义的所有实例字段需要的字节数.堆上每个对象都需要一些额外的成员,包括“类型…

JDBC01 利用JDBC连接数据库【不使用数据库连接池】

目录&#xff1a; 1 什么是JDBC 2 JDBC主要接口 3 JDBC编程步骤【学渣版本】 5 JDBC编程步骤【学神版本】 6 JDBC编程步骤【学霸版本】 1 什么是JDBC JDBC是JAVA提供的一套标准连接数据库的接口&#xff0c;规定了连接数据库的步骤和功能&#xff1b;不同的数据库提供商提供了一…

leetcode 778. 水位上升的泳池中游泳(并查集)

在一个 N x N 的坐标方格 grid 中&#xff0c;每一个方格的值 grid[i][j] 表示在位置 (i,j) 的平台高度。 现在开始下雨了。当时间为 t 时&#xff0c;此时雨水导致水池中任意位置的水位为 t 。你可以从一个平台游向四周相邻的任意一个平台&#xff0c;但是前提是此时水位必须…

2020年十大币预测_2020年十大商业智能工具

2020年十大币预测In the rapidly growing world of today, when technology is expanding at a rate like never before, there are plenty of tools and skills to explore, learn, and master. In this digital and data age, Business Information and Intelligence have cl…

如何使用MySQL和JPA使用Spring Boot构建Rest API

Hi Everyone! For the past year, I have been learning JavaScript for full-stack web development. For a change, I started to master Java — the powerful Object Oriented Language.嗨&#xff0c;大家好&#xff01; 在过去的一年中&#xff0c;我一直在学习用于全栈W…

翻译

令 $m>n>1$ 为正整数. 一个集合含有 $m$ 个给定的实数. 我们从中选取任意 $n$ 个数, 记作 $a_1$, $a_2$, $\dotsc$, $a_n$, 并提问: 是否 $a_1<a_2<\dotsb < a_n$ 正确? 证明: 我们可以最多问 $n!-n^22n-2m(n-1)(1\lfloor \log_{n} m \rfloor)-m$ 个问题&#…

720 智能硬件与 LeanCloud 云端的默契协作

【 玩转 LeanCloud 】开发者经验分享&#xff1a; 作者&#xff1a;谢子超 720技术负责人&#xff0c;从业十余年&#xff0c;一直负责软件开发工作。 我们的产品是与监控和改善室内空气质量相关的智能硬件&#xff0c;我们使用 LeanCloud 平台已经有 2 年多了&#xff0c;借此…

linux cifs windows 慢,windows上使用dockerIO特别慢有没有更优的解决方案?

复制一个大佬的回答Docker for Windows是在Hyper-V虚拟机上跑Linux&#xff0c;文件挂载是通过SMB协议从Windows挂载到Linux&#xff0c;文件读写都经过网络&#xff0c;遇到Laravel这种每次启动就要加载几百个文件的框架&#xff0c;文件性能问题就尤其明显。最好的验证方法就…

编译原理—词法分析器(Java)

1.当运行程序时&#xff0c;程序会读取项目下的program.txt文件 2. 程序将会逐行读取program.txt中的源程序&#xff0c;进行词法分析&#xff0c;并将分析的结果输出。 3. 如果发现错误&#xff0c;程序将会中止读取文件进行分析&#xff0c;并输出错误提示 所用单词的构词规…

【BZOJ4653】[Noi2016]区间 双指针法+线段树

【BZOJ4653】[Noi2016]区间 Description 在数轴上有 n个闭区间 [l1,r1],[l2,r2],...,[ln,rn]。现在要从中选出 m 个区间&#xff0c;使得这 m个区间共同包含至少一个位置。换句话说&#xff0c;就是使得存在一个 x&#xff0c;使得对于每一个被选中的区间 [li,ri]&#xff0c;都…