抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗

The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social media is a powerful indicator of the mental state of people at a given location and time. In order to study the link between the corona virus pandemic and the accelerating pace of depression and anxiety in the general population, I decided to explore tweets related to corona virus.

我们生活方式的急剧变化,加上为对抗日冕病毒爆发而采取的限制措施,隔离措施和社会疏远措施,已导致全世界范围内令人震惊的精神健康问题。 社交媒体是在给定位置和时间的人们心理状态的有力指标。 为了研究普通人群中冠状病毒大流行与抑郁和焦虑的加速步伐之间的联系,我决定探索与冠状病毒有关的推文。

这个博客是如何组织的? (How is this blog organized?)

In this blog post, I will first use keras to train a neural network to recognize depressive tweets. For this, I will use a data set of 10,314 tweets divided into depressive tweets (labelled 1) and non depressive tweets (labelled 0). This data set is made by Viridiana Romero Martinez. Here is the link to her github profile: https://github.com/viritaromero

在这篇博客文章中,我将首先使用keras训练神经网络来识别令人沮丧的推文。 为此,我将使用10,314条推文的数据集,分为压抑推文(标记为1)和非压抑推文(标记为0)。 该数据集由Viridiana Romero Martinez创建。 这是她的github个人资料的链接: https : //github.com/viritaromero

Once I have the network trained, I will use it for testing tweets scraped from twitter. To establish the link between COVID-19 and depression, I will obtain two different data sets. The first data set will be comprised of tweets with corona virus related keywords such as ‘COVID-19’, ‘quarantine’, ‘pandemic’, and ‘virus’. The second data set will be comprised of random tweets searched using neutral keywords such as ‘and’, ‘I’, ‘the’ etc. The second data set will serve as a control to check the percentage of depressive tweets in a random sample of tweets. This will allow us to measure the difference in percentage of depressive tweets in a random sample and a sample with COVID-19 specific tweets.

一旦对网络进行了培训,我将使用它来测试从Twitter抓取的推文。 为了建立COVID-19与抑郁症之间的联系,我将获得两个不同的数据集。 第一个数据集将包含带有与日冕病毒相关的关键字的推文,例如“ COVID-19”,“隔离”,“大流行”和“病毒”。 第二个数据集将包含使用中性关键字(例如“ and”,“ I”,“ the”等)搜索的随机推文。第二个数据集将用作检查随机抽样的抑郁性推文百分比的控件。鸣叫。 这将使我们能够测量随机样本和具有COVID-19特定推文的样本中压抑推文所占百分比的差异。

预处理数据 (Preprocessing the data)

Image for post
Image source: https://xaltius.tech/why-is-data-cleaning-important/
图片来源: https : //xaltius.tech/why-is-data-cleaning-important/

Before we can get started with training the neural networks, we need to collect and clean the data.

在开始训练神经网络之前,我们需要收集和清理数据。

导入库 (Importing the libraries)

To get started with the project, we will first need to import all the necessary libraries and modules.

要开始该项目,我们首先需要导入所有必需的库和模块。

演示地址

Once we have all the libraries in place, we need to get the data and pre-process it. You can download the data set from this link: https://github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv

一旦所有库都准备就绪,我们需要获取数据并对其进行预处理。 您可以从以下链接下载数据集: https : //github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv

快速检查数据 (Quick examination of the data)

We can quickly check the structure of the data set by reading it into a pandas data frame.

我们可以通过将数据读取到熊猫数据框中来快速检查数据集的结构。

演示地址

Now we will store the text of the tweets into an array called text. The corresponding labels of the tweets will be stored into a separate array called labels. The code is as follows:

现在,我们将推文的文本存储到名为text的数组中。 推文的相应标签将存储到称为标签的单独数组中。 代码如下:

演示地址

Apologies for printing out a rather huge data set, but I just did it so that we can quickly examine the overall structure. The first thing I notice is that in the labels array, there are much more zeroes than ones. This means that we have roughly 3.5 times more non-depressive tweets than depressive tweets in the data set. In an ideal situation, I would like to train my neural network on a data set of equal number of depressive and non-depressive tweets. However, in order to obtain equal number of depressive and non-depressive tweets, I will have to substantially truncate my data. I think a larger and imbalanced data set is better than a very small and balanced data set, therefore, I am going to go ahead and use the data set in its original state.

很抱歉打印出相当大的数据集,但我只是这样做了,以便我们可以快速检查整体结构。 我注意到的第一件事是在labels数组中,零比1多得多。 这意味着我们在数据集中拥有的非抑郁性推文大约是抑郁性推文的3.5倍。 在理想情况下,我想在压抑和非压抑推文数量相等的数据集上训练我的神经网络。 但是,为了获得相等数量的压抑和非压抑推文,我将不得不截断我的数据。 我认为,较大的数据集和不平衡的数据集要比非常小的数据集和平衡的数据集更好,因此,我将继续使用原始状态的数据集。

清理数据 (Cleaning the data)

The second thing you’ll notice is that the tweets contain a lot of the so called ‘stopwords’ such as ‘a’, ‘the’, ‘and’ etc. These words are not important for classification of a tweet as depressive or non-depressive, hence we will remove these. We also need to remove the punctuation as it is again unnecessary and will only decrease the performance of our neural network.

您会注意到的第二件事是,这些推文包含很多所谓的“停用词”,例如“ a”,“ the”,“ and”等。这些词对于将推文归类为沮丧或不重要并不重要。 -抑郁,因此我们将其删除。 我们还需要删除标点符号,因为它再次是不必要的,只会降低神经网络的性能。

演示地址

I decided to do a quick visualization of the data after cleaning using the amazing wordCloud library and the result is down below. Quite unsurprisingly, the most common word in depressive tweets is depression.

我决定在清理后使用令人惊叹的wordCloud库对数据进行快速可视化,结果显示如下。 毫不奇怪,抑郁推文中最常见的词是抑郁。

Image for post
Visualization of tweets using WordCloud
使用WordCloud可视化推文

数据令牌化 (Tokenization of the data)

What the on earth is tokenization?

到底什么是令牌化?

Basically, the neural networks do not understand raw text as we humans do. Therefore, in order to make the text more palatable to our neural network, we convert it into a series of ones and zeroes.

基本上,神经网络不像人类那样理解原始文本。 因此,为了使文本更适合我们的神经网络,我们将其转换为一系列的一和零。

Image for post
Image Source: inboundhow.com
图片来源:inboundhow.com

To tokenize text in keras, we import the tokenizer class. This class basically makes a dictionary lookup for a set number of unique words in our overall text. Then using the dictionary lookup, keras allows us to create vectors replace the word with its index value in the dictionary lookup.

要对keras中的文本进行标记化,我们导入了tokenizer类。 此类基本上是对整个文本中一定数量的唯一单词进行字典查找。 然后,使用字典查找,keras允许我们创建向量以在字典查找中将单词替换为其索引值。

We also go ahead and pad the shorter tweets and truncate the larger ones to make the maximum length of each vector equal to 100.

我们还继续填充较短的tweet,截断较大的tweet,以使每个向量的最大长度等于100。

演示地址

演示地址

You might be wondering, ‘huh, we only converted words to numbers, not ones and zeroes!’ You are right. There are two ways we can take care of that: either we can covert the numbers into one-hot-encoded vectors or create an embeddings matrix. One-hot-encoding vectors are usually very high dimensional and sparse whereas matrices are lower dimensional and dense. If you are interested, you can read more about it in the ‘Deep Learning with Python’ book by Francois Chollet. In this blog, I will be using matrices, but before we initialize them, we will need to take care of a few other things first.

您可能想知道,“呵,我们只将单词转换为数字,而不是一和零!” 你是对的。 有两种方法可以解决此问题:要么将数字隐蔽为一个热编码的矢量,要么创建一个嵌入矩阵。 一键编码矢量通常具有很高的维数和稀疏度,而矩阵则具有较​​低的维数和密集度。 如果您有兴趣,可以在Francois Chollet撰写的“ Python深度学习”一书中阅读有关它的更多信息。 在此博客中,我将使用矩阵,但是在初始化矩阵之前,我们需要先处理一些其他事项。

整理数据 (Shuffling the data)

Image for post
Sergi Viladesau on unsplahSergi Viladesau在unsplah上拍摄

Another issue with the data that you might have identified earlier is that the text array contains all the non-depressive tweets first followed by the all depressive ones. We therefore need to shuffle the data to allow random samples of tweets to go into the training, validation, and test sets.

您之前可能已经确定的数据的另一个问题是,文本数组首先包含所有非压抑推文,然后是所有压抑推文。 因此,我们需要对数据进行混洗,以使随机的推文样本进入训练,验证和测试集。

演示地址

分割数据 (Splitting the data)

Now we need to split the data into the training, validation, and test sets.

现在,我们需要将数据分为训练,验证和测试集。

演示地址

Phew! Finally done with all the data munging!

! 最后完成所有数据处理!

制作神经网络 (Making a neural network)

Image for post
Image source: extremetech.com
图片来源:extremetech.com

Now we can start making the model architecture.

现在我们可以开始制作模型架构了。

I will be trying two different models: one with a pre-trained word embeddings layer and one with a trainable word embeddings layer.

我将尝试两种不同的模型:一种具有预训练的单词嵌入层,另一种具有可训练的单词嵌入层。

In order to define the neural network architecture, you need to understand how word embeddings work. There is a wealth of information online about word embeddings. This blog post is one of my favorites:

为了定义神经网络架构,您需要了解单词嵌入的工作方式。 在线上有大量有关单词嵌入的信息。 这篇博客文章是我的最爱之一:

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

Now that you hopefully have an idea of the function of the embeddings layer, I will go ahead and create it in code.

现在,您希望对嵌入层的功能有所了解,我将继续在代码中创建它。

演示地址

第一个模型 (First model)

For the first model, the architecture consists of a pre-trained word embeddings layer followed by two dense layers. The code from training the model is as follows:

对于第一个模型,该体系结构包括一个预训练的单词嵌入层,然后是两个密集层。 训练模型的代码如下:

演示地址

演示地址

Image for post
Image for post
Figure: Accuracy and loss for training and validation sets in model 1
图:模型1中的训练和验证集的准确性和损失

Here we can see that the model performs very well with an accuracy of 98 % on the test set. Overfitting is likely to not be an issue because the validation accuracy and loss are almost the same as the training accuracy and loss.

在这里,我们可以看到该模型在测试集上的表现非常好,准确度达到98%。 过度拟合可能不会成为问题,因为验证的准确性和损失与训练的准确性和损失几乎相同。

第二种模式 (The second model)

For the second model, I decided to exclude the pre-trained embeddings layer. The code is as follows.

对于第二个模型,我决定排除预训练的嵌入层。 代码如下。

演示地址

演示地址

Image for post
Image for post
Figure: Accuracy and loss for training and validation sets in model 2
图:模型2中的训练和验证集的准确性和损失

The accuracy of both the models on the test set are equally good. However, since the second model is less complex, I will be using it for predicting whether a tweet is depressive or not.

测试集上的两个模型的准确性都同样好。 但是,由于第二个模型不那么复杂,因此我将使用它来预测一条推文是否令人沮丧。

从Twitter获取COVID-19相关推文的数据 (Obtaining data from twitter for COVID-19 related tweets)

In order to obtain my data sets of tweets, I used twint which is an amazing webscraping tool for twitter. I prepared two different data sets of 1000 tweets each. The first one consisted of tweets containing corona related keywords such as ‘COVID-19’, ‘quarantine’, and ‘pandemic’.

为了获取我的tweet数据集,我使用了twint,twitter是一个很棒的Twitter抓取工具。 我准备了两个不同的数据集,每个数据集有1000条推文。 第一个由包含与电晕相关的关键字(例如“ COVID-19”,“隔离”和“大流行”)的推文组成。

Now in order to get a control sample to compare against, I searched for tweets containing neutral keywords such as ‘the’, ‘a’, ‘and’ etc. Using 1000 tweets from this sample, I made up the second control data set.

现在,为了比较一个对照样本,我搜索了包含中性关键字(例如“ the”,“ a”,“ and”等)的推文。使用该样本中的1000条推文,构成了第二个对照数据集。

Image for post
WordCloud of COVID related tweets
COVID相关推文的WordCloud

I cleaned the data sets using a similar procedure to the one I used for cleaning the training set. After cleaning the data, I fed it to my neural network to predict the percentage of depressive tweets. The results, I obtained were surprising.

我使用与清理训练集相似的过程清理了数据集。 清理数据后,我将其输入到我的神经网络以预测抑郁性推文的百分比。 我获得的结果令人惊讶。

One run of the code is shown below, I repeated it with different batches of data obtained using the same procedure as described above and calculated the average results.

下面显示了该代码的一次运行,我使用与上述相同的程序对不同批次的数据重复了该代码,并计算了平均结果。

演示地址

演示地址

On average, my model predicted, 35 % depressive tweets and 65 % non-depressive in a data set of tweets obtained using neutral keywords. 35% depressive tweets on a randomly obtained sample is an alarmingly high number. However, the number of depressive tweets with COVID-related keywords was even higher: 55 % depressive vs 45 % non-depressive. That is a 57 % increase in depressive tweets!

我的模型平均预测,在使用中性关键字获得的推文数据集中,有35%的抑郁推文和65%的非抑郁推文。 随机获得的样本上35%的压抑推文数量惊人地高。 但是,带有COVID相关关键字的压抑推文的数量甚至更高:55%的压抑和45%的非压抑。 令人沮丧的推文增加了57%!

This leads to the conclusion that there is indeed a correlation between COVID-19 and depressive sentiments in tweets on Twitter.

由此得出结论,在推特上的推文中,COVID-19与抑郁情绪之间确实存在关联。

结论 (Conclusion)

I hope this post helped you learn a bit more about sentiment analysis using machine learning and I hope you will try out a similar project yourself as well.

我希望这篇文章可以帮助您了解更多有关使用机器学习进行情感分析的知识,并且希望您自己也可以尝试一个类似的项目。

Happy coding!

祝您编码愉快!

演示地址

credits: Slater on giphy
学分:斯吉特·吉菲

翻译自: https://towardsdatascience.com/exploring-the-link-between-covid-19-and-depression-using-neural-networks-469030112d3d

抑郁症损伤神经细胞吗

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391282.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Docker 入门(4)镜像与容器

1. 镜像与容器 1.1 镜像 Docker镜像类似于未运行的exe应用程序,或者停止运行的VM。当使用docker run命令基于镜像启动容器时,容器应用便能为外部提供服务。 镜像实际上就是这个用来为容器进程提供隔离后执行环境的文件系统。我们也称之为根文件系统&a…

python:pytest中的setup和teardown

原文:https://www.cnblogs.com/peiminer/p/9376352.html  之前我写的unittest的setup和teardown,还有setupClass和teardownClass(需要配合classmethod装饰器一起使用),接下来就介绍pytest的类似于这类的固件。 &#…

如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的数据科学之旅从在德国最大的汽车制造商之一…

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成,分享了iHealth从最初的服务器端直接部署,到现在实现全自动CI/CD的实践经验。作者简介郭拓,北京爱和健康科技有限公司(iHealth)。负责公…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root,返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1: 输入:root [10,5,15,3,7,null,18], low 7, high 15 输出:32 示例 2: 输入:root [10,5,15,3,7,13,18,1,nul…

COVID-19和世界幸福报告数据告诉我们什么?

For many people, the idea of ​​staying home actually sounded good at first. This process was really efficient for Netflix and Amazon. But then sad truths awaited us. What was boring was the number of dead and intubated patients one after the other. We al…

iOS 开发一定要尝试的 Texture(ASDK)

原文链接 - iOS 开发一定要尝试的 Texture(ASDK)(排版正常, 包含视频) 前言 本篇所涉及的性能问题我都将根据滑动的流畅性来评判, 包括掉帧情况和一些实际体验 ASDK 已经改名为 Texture, 我习惯称作 ASDK 编译环境: MacOS 10.13.3, Xcode 9.2 参与测试机型: iPhone 6 10.3.3, i…

lisp语言是最好的语言_Lisp可能不是数据科学的最佳语言,但是我们仍然可以从中学到什么呢?...

lisp语言是最好的语言This article is in response to Emmet Boudreau’s article ‘Should We be Using Lisp for Data-Science’.本文是对 Emmet Boudreau的文章“我们应该将Lisp用于数据科学”的 回应 。 Below, unless otherwise stated, lisp refers to Common Lisp; in …

static、volatile、synchronize

原子性(排他性):不论是多核还是单核,具有原子性的量,同一时刻只能有一个线程来对它进行操作!可见性:多个线程对同一份数据操作,thread1改变了某个变量的值,要保证thread2…

1.10-linux三剑客之sed命令详解及用法

内容:1.sed命令介绍2.语法格式,常用功能查询 增加 替换 批量修改文件名第1章 sed是什么字符流编辑器 Stream Editor第2章 sed功能与版本处理出文本文件,日志,配置文件等增加,删除,修改,查询sed --versionsed -i 修改文件内容第3章 语法格式3.1 语法格式sed [选项] [sed指令…

python pca主成分_超越“经典” PCA:功能主成分分析(FPCA)应用于使用Python的时间序列...

python pca主成分FPCA is traditionally implemented with R but the “FDASRSF” package from J. Derek Tucker will achieve similar (and even greater) results in Python.FPCA传统上是使用R实现的,但是J. Derek Tucker的“ FDASRSF ”软件包将在Python中获得相…

初探Golang(2)-常量和命名规范

1 命名规范 1.1 Go是一门区分大小写的语言。 命名规则涉及变量、常量、全局函数、结构、接口、方法等的命名。 Go语言从语法层面进行了以下限定:任何需要对外暴露的名字必须以大写字母开头,不需要对外暴露的则应该以小写字母开头。 当命名&#xff08…

大数据平台构建_如何像产品一样构建数据平台

大数据平台构建重点 (Top highlight)Over the past few years, many companies have embraced data platforms as an effective way to aggregate, handle, and utilize data at scale. Despite the data platform’s rising popularity, however, little literature exists on…

初探Golang(3)-数据类型

Go语言拥有两大数据类型,基本数据类型和复合数据类型。 1. 数值类型 ##有符号整数 int8(-128 -> 127) int16(-32768 -> 32767) int32(-2,147,483,648 -> 2,147,483,647) int64&#x…

时间序列预测 时间因果建模_时间序列建模以预测投资基金的回报

时间序列预测 时间因果建模Time series analysis, discussed ARIMA, auto ARIMA, auto correlation (ACF), partial auto correlation (PACF), stationarity and differencing.时间序列分析,讨论了ARIMA,自动ARIMA,自动相关(ACF),…

(58)PHP开发

LAMP0、使用include和require命令来包含外部PHP文件。使用include_once命令,但是include和include_once命令相比的不足就是这两个命令并不关心请求的文件是否实际存在,如果不存在,PHP解释器就会直接忽略这个命令并且显示一个错误消息&#xf…

css flexbox模型_如何将Flexbox后备添加到CSS网格

css flexbox模型I shared how to build a calendar with CSS Grid in the previous article. Today, I want to share how to build a Flexbox fallback for the same calendar. 在上一篇文章中,我分享了如何使用CSS Grid构建日历。 今天,我想分享如何为…

贝塞尔修正_贝塞尔修正背后的推理:n-1

贝塞尔修正A standard deviation seems like a simple enough concept. It’s a measure of dispersion of data, and is the root of the summed differences between the mean and its data points, divided by the number of data points…minus one to correct for bias.标…

RESET MASTER和RESET SLAVE使用场景和说明【转】

【前言】在配置主从的时候经常会用到这两个语句,刚开始的时候还不清楚这两个语句的使用特性和使用场景。 经过测试整理了以下文档,希望能对大家有所帮助; 【一】RESET MASTER参数 功能说明:删除所有的binglog日志文件,…