抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗

The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social media is a powerful indicator of the mental state of people at a given location and time. In order to study the link between the corona virus pandemic and the accelerating pace of depression and anxiety in the general population, I decided to explore tweets related to corona virus.

我们生活方式的急剧变化,加上为对抗日冕病毒爆发而采取的限制措施,隔离措施和社会疏远措施,已导致全世界范围内令人震惊的精神健康问题。 社交媒体是在给定位置和时间的人们心理状态的有力指标。 为了研究普通人群中冠状病毒大流行与抑郁和焦虑的加速步伐之间的联系,我决定探索与冠状病毒有关的推文。

这个博客是如何组织的? (How is this blog organized?)

In this blog post, I will first use keras to train a neural network to recognize depressive tweets. For this, I will use a data set of 10,314 tweets divided into depressive tweets (labelled 1) and non depressive tweets (labelled 0). This data set is made by Viridiana Romero Martinez. Here is the link to her github profile: https://github.com/viritaromero

在这篇博客文章中,我将首先使用keras训练神经网络来识别令人沮丧的推文。 为此,我将使用10,314条推文的数据集,分为压抑推文(标记为1)和非压抑推文(标记为0)。 该数据集由Viridiana Romero Martinez创建。 这是她的github个人资料的链接: https : //github.com/viritaromero

Once I have the network trained, I will use it for testing tweets scraped from twitter. To establish the link between COVID-19 and depression, I will obtain two different data sets. The first data set will be comprised of tweets with corona virus related keywords such as ‘COVID-19’, ‘quarantine’, ‘pandemic’, and ‘virus’. The second data set will be comprised of random tweets searched using neutral keywords such as ‘and’, ‘I’, ‘the’ etc. The second data set will serve as a control to check the percentage of depressive tweets in a random sample of tweets. This will allow us to measure the difference in percentage of depressive tweets in a random sample and a sample with COVID-19 specific tweets.

一旦对网络进行了培训,我将使用它来测试从Twitter抓取的推文。 为了建立COVID-19与抑郁症之间的联系,我将获得两个不同的数据集。 第一个数据集将包含带有与日冕病毒相关的关键字的推文,例如“ COVID-19”,“隔离”,“大流行”和“病毒”。 第二个数据集将包含使用中性关键字(例如“ and”,“ I”,“ the”等)搜索的随机推文。第二个数据集将用作检查随机抽样的抑郁性推文百分比的控件。鸣叫。 这将使我们能够测量随机样本和具有COVID-19特定推文的样本中压抑推文所占百分比的差异。

预处理数据 (Preprocessing the data)

Image for post
Image source: https://xaltius.tech/why-is-data-cleaning-important/
图片来源: https : //xaltius.tech/why-is-data-cleaning-important/

Before we can get started with training the neural networks, we need to collect and clean the data.

在开始训练神经网络之前,我们需要收集和清理数据。

导入库 (Importing the libraries)

To get started with the project, we will first need to import all the necessary libraries and modules.

要开始该项目,我们首先需要导入所有必需的库和模块。

演示地址

Once we have all the libraries in place, we need to get the data and pre-process it. You can download the data set from this link: https://github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv

一旦所有库都准备就绪,我们需要获取数据并对其进行预处理。 您可以从以下链接下载数据集: https : //github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv

快速检查数据 (Quick examination of the data)

We can quickly check the structure of the data set by reading it into a pandas data frame.

我们可以通过将数据读取到熊猫数据框中来快速检查数据集的结构。

演示地址

Now we will store the text of the tweets into an array called text. The corresponding labels of the tweets will be stored into a separate array called labels. The code is as follows:

现在,我们将推文的文本存储到名为text的数组中。 推文的相应标签将存储到称为标签的单独数组中。 代码如下:

演示地址

Apologies for printing out a rather huge data set, but I just did it so that we can quickly examine the overall structure. The first thing I notice is that in the labels array, there are much more zeroes than ones. This means that we have roughly 3.5 times more non-depressive tweets than depressive tweets in the data set. In an ideal situation, I would like to train my neural network on a data set of equal number of depressive and non-depressive tweets. However, in order to obtain equal number of depressive and non-depressive tweets, I will have to substantially truncate my data. I think a larger and imbalanced data set is better than a very small and balanced data set, therefore, I am going to go ahead and use the data set in its original state.

很抱歉打印出相当大的数据集,但我只是这样做了,以便我们可以快速检查整体结构。 我注意到的第一件事是在labels数组中,零比1多得多。 这意味着我们在数据集中拥有的非抑郁性推文大约是抑郁性推文的3.5倍。 在理想情况下,我想在压抑和非压抑推文数量相等的数据集上训练我的神经网络。 但是,为了获得相等数量的压抑和非压抑推文,我将不得不截断我的数据。 我认为,较大的数据集和不平衡的数据集要比非常小的数据集和平衡的数据集更好,因此,我将继续使用原始状态的数据集。

清理数据 (Cleaning the data)

The second thing you’ll notice is that the tweets contain a lot of the so called ‘stopwords’ such as ‘a’, ‘the’, ‘and’ etc. These words are not important for classification of a tweet as depressive or non-depressive, hence we will remove these. We also need to remove the punctuation as it is again unnecessary and will only decrease the performance of our neural network.

您会注意到的第二件事是,这些推文包含很多所谓的“停用词”,例如“ a”,“ the”,“ and”等。这些词对于将推文归类为沮丧或不重要并不重要。 -抑郁,因此我们将其删除。 我们还需要删除标点符号,因为它再次是不必要的,只会降低神经网络的性能。

演示地址

I decided to do a quick visualization of the data after cleaning using the amazing wordCloud library and the result is down below. Quite unsurprisingly, the most common word in depressive tweets is depression.

我决定在清理后使用令人惊叹的wordCloud库对数据进行快速可视化,结果显示如下。 毫不奇怪,抑郁推文中最常见的词是抑郁。

Image for post
Visualization of tweets using WordCloud
使用WordCloud可视化推文

数据令牌化 (Tokenization of the data)

What the on earth is tokenization?

到底什么是令牌化?

Basically, the neural networks do not understand raw text as we humans do. Therefore, in order to make the text more palatable to our neural network, we convert it into a series of ones and zeroes.

基本上,神经网络不像人类那样理解原始文本。 因此,为了使文本更适合我们的神经网络,我们将其转换为一系列的一和零。

Image for post
Image Source: inboundhow.com
图片来源:inboundhow.com

To tokenize text in keras, we import the tokenizer class. This class basically makes a dictionary lookup for a set number of unique words in our overall text. Then using the dictionary lookup, keras allows us to create vectors replace the word with its index value in the dictionary lookup.

要对keras中的文本进行标记化,我们导入了tokenizer类。 此类基本上是对整个文本中一定数量的唯一单词进行字典查找。 然后,使用字典查找,keras允许我们创建向量以在字典查找中将单词替换为其索引值。

We also go ahead and pad the shorter tweets and truncate the larger ones to make the maximum length of each vector equal to 100.

我们还继续填充较短的tweet,截断较大的tweet,以使每个向量的最大长度等于100。

演示地址

演示地址

You might be wondering, ‘huh, we only converted words to numbers, not ones and zeroes!’ You are right. There are two ways we can take care of that: either we can covert the numbers into one-hot-encoded vectors or create an embeddings matrix. One-hot-encoding vectors are usually very high dimensional and sparse whereas matrices are lower dimensional and dense. If you are interested, you can read more about it in the ‘Deep Learning with Python’ book by Francois Chollet. In this blog, I will be using matrices, but before we initialize them, we will need to take care of a few other things first.

您可能想知道,“呵,我们只将单词转换为数字,而不是一和零!” 你是对的。 有两种方法可以解决此问题:要么将数字隐蔽为一个热编码的矢量,要么创建一个嵌入矩阵。 一键编码矢量通常具有很高的维数和稀疏度,而矩阵则具有较​​低的维数和密集度。 如果您有兴趣,可以在Francois Chollet撰写的“ Python深度学习”一书中阅读有关它的更多信息。 在此博客中,我将使用矩阵,但是在初始化矩阵之前,我们需要先处理一些其他事项。

整理数据 (Shuffling the data)

Image for post
Sergi Viladesau on unsplahSergi Viladesau在unsplah上拍摄

Another issue with the data that you might have identified earlier is that the text array contains all the non-depressive tweets first followed by the all depressive ones. We therefore need to shuffle the data to allow random samples of tweets to go into the training, validation, and test sets.

您之前可能已经确定的数据的另一个问题是,文本数组首先包含所有非压抑推文,然后是所有压抑推文。 因此,我们需要对数据进行混洗,以使随机的推文样本进入训练,验证和测试集。

演示地址

分割数据 (Splitting the data)

Now we need to split the data into the training, validation, and test sets.

现在,我们需要将数据分为训练,验证和测试集。

演示地址

Phew! Finally done with all the data munging!

! 最后完成所有数据处理!

制作神经网络 (Making a neural network)

Image for post
Image source: extremetech.com
图片来源:extremetech.com

Now we can start making the model architecture.

现在我们可以开始制作模型架构了。

I will be trying two different models: one with a pre-trained word embeddings layer and one with a trainable word embeddings layer.

我将尝试两种不同的模型:一种具有预训练的单词嵌入层,另一种具有可训练的单词嵌入层。

In order to define the neural network architecture, you need to understand how word embeddings work. There is a wealth of information online about word embeddings. This blog post is one of my favorites:

为了定义神经网络架构,您需要了解单词嵌入的工作方式。 在线上有大量有关单词嵌入的信息。 这篇博客文章是我的最爱之一:

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

Now that you hopefully have an idea of the function of the embeddings layer, I will go ahead and create it in code.

现在,您希望对嵌入层的功能有所了解,我将继续在代码中创建它。

演示地址

第一个模型 (First model)

For the first model, the architecture consists of a pre-trained word embeddings layer followed by two dense layers. The code from training the model is as follows:

对于第一个模型,该体系结构包括一个预训练的单词嵌入层,然后是两个密集层。 训练模型的代码如下:

演示地址

演示地址

Image for post
Image for post
Figure: Accuracy and loss for training and validation sets in model 1
图:模型1中的训练和验证集的准确性和损失

Here we can see that the model performs very well with an accuracy of 98 % on the test set. Overfitting is likely to not be an issue because the validation accuracy and loss are almost the same as the training accuracy and loss.

在这里,我们可以看到该模型在测试集上的表现非常好,准确度达到98%。 过度拟合可能不会成为问题,因为验证的准确性和损失与训练的准确性和损失几乎相同。

第二种模式 (The second model)

For the second model, I decided to exclude the pre-trained embeddings layer. The code is as follows.

对于第二个模型,我决定排除预训练的嵌入层。 代码如下。

演示地址

演示地址

Image for post
Image for post
Figure: Accuracy and loss for training and validation sets in model 2
图:模型2中的训练和验证集的准确性和损失

The accuracy of both the models on the test set are equally good. However, since the second model is less complex, I will be using it for predicting whether a tweet is depressive or not.

测试集上的两个模型的准确性都同样好。 但是,由于第二个模型不那么复杂,因此我将使用它来预测一条推文是否令人沮丧。

从Twitter获取COVID-19相关推文的数据 (Obtaining data from twitter for COVID-19 related tweets)

In order to obtain my data sets of tweets, I used twint which is an amazing webscraping tool for twitter. I prepared two different data sets of 1000 tweets each. The first one consisted of tweets containing corona related keywords such as ‘COVID-19’, ‘quarantine’, and ‘pandemic’.

为了获取我的tweet数据集,我使用了twint,twitter是一个很棒的Twitter抓取工具。 我准备了两个不同的数据集,每个数据集有1000条推文。 第一个由包含与电晕相关的关键字(例如“ COVID-19”,“隔离”和“大流行”)的推文组成。

Now in order to get a control sample to compare against, I searched for tweets containing neutral keywords such as ‘the’, ‘a’, ‘and’ etc. Using 1000 tweets from this sample, I made up the second control data set.

现在,为了比较一个对照样本,我搜索了包含中性关键字(例如“ the”,“ a”,“ and”等)的推文。使用该样本中的1000条推文,构成了第二个对照数据集。

Image for post
WordCloud of COVID related tweets
COVID相关推文的WordCloud

I cleaned the data sets using a similar procedure to the one I used for cleaning the training set. After cleaning the data, I fed it to my neural network to predict the percentage of depressive tweets. The results, I obtained were surprising.

我使用与清理训练集相似的过程清理了数据集。 清理数据后,我将其输入到我的神经网络以预测抑郁性推文的百分比。 我获得的结果令人惊讶。

One run of the code is shown below, I repeated it with different batches of data obtained using the same procedure as described above and calculated the average results.

下面显示了该代码的一次运行,我使用与上述相同的程序对不同批次的数据重复了该代码,并计算了平均结果。

演示地址

演示地址

On average, my model predicted, 35 % depressive tweets and 65 % non-depressive in a data set of tweets obtained using neutral keywords. 35% depressive tweets on a randomly obtained sample is an alarmingly high number. However, the number of depressive tweets with COVID-related keywords was even higher: 55 % depressive vs 45 % non-depressive. That is a 57 % increase in depressive tweets!

我的模型平均预测,在使用中性关键字获得的推文数据集中,有35%的抑郁推文和65%的非抑郁推文。 随机获得的样本上35%的压抑推文数量惊人地高。 但是,带有COVID相关关键字的压抑推文的数量甚至更高:55%的压抑和45%的非压抑。 令人沮丧的推文增加了57%!

This leads to the conclusion that there is indeed a correlation between COVID-19 and depressive sentiments in tweets on Twitter.

由此得出结论,在推特上的推文中,COVID-19与抑郁情绪之间确实存在关联。

结论 (Conclusion)

I hope this post helped you learn a bit more about sentiment analysis using machine learning and I hope you will try out a similar project yourself as well.

我希望这篇文章可以帮助您了解更多有关使用机器学习进行情感分析的知识,并且希望您自己也可以尝试一个类似的项目。

Happy coding!

祝您编码愉快!

演示地址

credits: Slater on giphy
学分:斯吉特·吉菲

翻译自: https://towardsdatascience.com/exploring-the-link-between-covid-19-and-depression-using-neural-networks-469030112d3d

抑郁症损伤神经细胞吗

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391282.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

倦怠和枯燥_如何不断学习(不倦怠)

倦怠和枯燥In tech, constantly learning (both in and out of work) is an unstated job requirement. 在科技界,不断学习(工作中和工作中)是一项未阐明的工作要求。 When I was growing up, I would go to the bookstore with my dad every weekend, and every t…

Xcode 9.0 新增功能大全

Xcode是用于为Apple TV,Apple Watch,iPad,iPhone和Mac创建应用程序的完整开发人员工具集。Xcode开发环境采用tvOS SDK,watchOS SDK,iOS SDK和macOS SDK的形式捆绑Instruments分析工具,Simulator和OS框架。 …

Docker 入门(4)镜像与容器

1. 镜像与容器 1.1 镜像 Docker镜像类似于未运行的exe应用程序,或者停止运行的VM。当使用docker run命令基于镜像启动容器时,容器应用便能为外部提供服务。 镜像实际上就是这个用来为容器进程提供隔离后执行环境的文件系统。我们也称之为根文件系统&a…

python:pytest中的setup和teardown

原文:https://www.cnblogs.com/peiminer/p/9376352.html  之前我写的unittest的setup和teardown,还有setupClass和teardownClass(需要配合classmethod装饰器一起使用),接下来就介绍pytest的类似于这类的固件。 &#…

如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的数据科学之旅从在德国最大的汽车制造商之一…

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成,分享了iHealth从最初的服务器端直接部署,到现在实现全自动CI/CD的实践经验。作者简介郭拓,北京爱和健康科技有限公司(iHealth)。负责公…

从早期的初创企业到MongoDB的经理(播客)

In this weeks podcast episode, I chat with Harry Wolff, an engineering manager at MongoDB in New York City. Harry has been in the world of tech for over a decade, holding jobs in various startups before ending up at Mongo. 在本周的播客节目中,我与…

leetcode 1011. 在 D 天内送达包裹的能力(二分法)

传送带上的包裹必须在 D 天内从一个港口运送到另一个港口。 传送带上的第 i 个包裹的重量为 weights[i]。每一天,我们都会按给出重量的顺序往传送带上装载包裹。我们装载的重量不会超过船的最大运载重量。 返回能在 D 天内将传送带上的所有包裹送达的船的最低运载…

python:pytest优秀博客

上海悠悠:https://www.cnblogs.com/yoyoketang/tag/pytest/ 转载于:https://www.cnblogs.com/gcgc/p/11514345.html

uva 11210

https://uva.onlinejudge.org/index.php?optioncom_onlinejudge&Itemid8&pageshow_problem&problem2151 题意:给你十三张麻将,问你需要哪几张牌就可以胡牌,这个胡牌排除了七小对以及十三幺 胡牌必须要有一个对子加n个…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root,返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1: 输入:root [10,5,15,3,7,null,18], low 7, high 15 输出:32 示例 2: 输入:root [10,5,15,3,7,13,18,1,nul…

456

456 转载于:https://www.cnblogs.com/Forever77/p/11517711.html

课后作业-结队编程项目进度-贪吃蛇

当前进度: 1.完成了窗口和蛇的绘制 2控制蛇的放向 3.绘制食物,随机出现 4.设计暂停键和开始键 有遇到过问题,但通过上网和向同学请教解决了转载于:https://www.cnblogs.com/qwsa/p/7605384.html

一百种简单整人方法_一种非常简单的用户故事方法

一百种简单整人方法User stories are a great way to plan development work. In theory. But how do you avoid getting burned in practice? I propose a radically simple approach.用户故事是计划开发工作的好方法。 理论上。 但是,如何避免在实践中被烫伤&…

COVID-19和世界幸福报告数据告诉我们什么?

For many people, the idea of ​​staying home actually sounded good at first. This process was really efficient for Netflix and Amazon. But then sad truths awaited us. What was boring was the number of dead and intubated patients one after the other. We al…

Python:self理解

Python类 class Student:# 类变量,可以通过类.类变量(Student.classroom)或者实例.类变量(a.classroom)方式调用classroom 火箭班def __init__(self, name, age):# self代表类的实例,self.name name表示当实例化Student时传入的name参数赋值给类的实例…

leetcode 633. 平方数之和(双指针)

给定一个非负整数 c ,你要判断是否存在两个整数 a 和 b,使得 a2 b2 c 。 示例 1: 输入:c 5 输出:true 解释:1 * 1 2 * 2 5 示例 2: 输入:c 3 输出:false 示例 3&…

洛谷 P2919 [USACO08NOV]守护农场Guarding the Farm

题目描述 The farm has many hills upon which Farmer John would like to place guards to ensure the safety of his valuable milk-cows. He wonders how many guards he will need if he wishes to put one on top of each hill. He has a map supplied as a matrix of int…

iOS 开发一定要尝试的 Texture(ASDK)

原文链接 - iOS 开发一定要尝试的 Texture(ASDK)(排版正常, 包含视频) 前言 本篇所涉及的性能问题我都将根据滑动的流畅性来评判, 包括掉帧情况和一些实际体验 ASDK 已经改名为 Texture, 我习惯称作 ASDK 编译环境: MacOS 10.13.3, Xcode 9.2 参与测试机型: iPhone 6 10.3.3, i…