nlp构建

Over the years, suicide has been one of the major causes of death worldwide, According to Wikipedia, Suicide resulted in 828,000 global deaths in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. There is also increasing evidence that the Internet and social media can influence suicide-related behaviour. Using Natural Language Processing, a field in Machine Learning, I built a very simple suicidal ideation classifier which predict whether a text is likely to be suicidal or not.

多年来，自杀一直是全世界主要的死亡原因之一。据维基百科称，自杀导致2015年全球死亡828,000人，比1990年的712,000人有所增加。这使自杀成为全球第十大死亡原因。越来越多的证据表明，互联网和社交媒体可以影响自杀相关行为。使用机器学习中的自然语言处理这一领域，我建立了一个非常简单的自杀意念分类器，该分类器可预测文本是否可能具有自杀意味。

数据 (Data)

I used a Twitter crawler which I found on Github, made some few changes to the code by removing hashtags, links, URL and symbols whenever it crawls data from Twitter, the data were crawled based on query parameters which contain words like:

我使用了一个在Github上找到的Twitter搜寻器，通过在每次从Twitter抓取数据时删除标签，链接，URL和符号来对代码进行了一些更改，这些数据是根据包含以下单词的查询参数进行抓取的：

Depressed, hopeless, Promise to take care of, I dont belong here, Nobody deserve me, I want to die etc.
沮丧，绝望，无极照顾，我不属于这里，没人值得我，我想死等等。

Although some of the text we’re in no way related to suicide at all, I had to manually label the data which were about 8200 rows of tweets. I also sourced for more Twitter Data and I was able to concatenate with the one I previously had which was enough for me to train.

尽管有些文本根本与自杀无关，但我不得不手动标记大约8200行tweet数据。我还获得了更多的Twitter数据，并且能够与以前拥有的足以进行训练的数据相结合。

建立模型 (Building the Model)

数据预处理 (Data Preprocessing)

I imported the following libraries:

我导入了以下库：

import pickle
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import nltk
nltk.download('stopwords')

I then wrote a function to clean the text data to remove any form of HTML markup, keep emoticon characters, remove non-word character and lastly convert to lowercase.

然后，我编写了一个函数来清除文本数据，以删除任何形式HTML标记，保留表情符号字符，删除非单词字符并最后转换为小写字母。

def preprocess_tweet(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    lowercase_text = re.sub('[\W]+', ' ', text.lower())
    text = lowercase_text+' '.join(emoticons).replace('-', '') 
    return text

After that, I applied the preprocess_tweet function to the tweet dataset to clean the data.

之后，我将preprocess_tweet函数应用于tweet数据集以清理数据。

tqdm.pandas()df = pd.read_csv('data.csv')
df['tweet'] = df['tweet'].progress_apply(preprocess_tweet)

Then I converted the text to tokens by using the .split() method and used word stemming to convert the text to their root form.

然后，我使用.split()方法将文本转换为标记，并使用词干将文本转换为其根形式。

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

Then I imported the stopwords library to remove stop words in the text.

然后，我导入了停用词库，以删除文本中的停用词。

from nltk.corpus import stopwords
stop = stopwords.words('english')

Testing the function on a single text.

在单个文本上测试功能。

[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

Output:

输出：

['runner', 'like', 'run', 'run', 'lot']

矢量化器 (Vectorizer)

For this project, I used the Hashing Vectorizer because it data-independent, which means that it is very low memory scalable to large datasets and it doesn’t store vocabulary dictionary in memory. I then created a tokenizer function for the Hashing Vectorizer

在此项目中，我使用了Hashing Vectorizer，因为它与数据无关，这意味着它的内存非常低，可扩展到大型数据集，并且不将词汇表存储在内存中。然后，我为Hashing Vectorizer创建了tokenizer函数

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\(|D|P)',text.lower())
    text = re.sub('[\W]+', ' ', text.lower())
    text += ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in tokenizer_porter(text) if w not in stop]
    return tokenized

Then I created the Hashing Vectorizer object.

然后，我创建了哈希向量化器对象。

from sklearn.feature_extraction.text import HashingVectorizervect = HashingVectorizer(decode_error='ignore', n_features=2**21, 
                         preprocessor=None,tokenizer=tokenizer)

模型 (Model)

For the Model, I used the stochastic gradient descent classifier algorithm.

对于模型，我使用了随机梯度下降分类器算法。

from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1)

培训与验证 (Training and Validation)

X = df["tweet"].to_list()
y = df['label']

For the model, I used 80% for training and 20% for testing.

对于模型，我使用了80％的训练和20％的测试。

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size=0.20,
                                                 random_state=0)

Then I transformed the text data to vectors with the Hashing Vectorizer we created earlier:

然后，使用之前创建的Hashing Vectorizer将文本数据转换为矢量：

X_train = vect.transform(X_train)
X_test = vect.transform(X_test)

Finally, I then fit the data to the algorithm

最后，然后将数据拟合到算法中

classes = np.array([0, 1])
clf.partial_fit(X_train, y_train,classes=classes)

Let's test the accuracy on our test data:

让我们在测试数据上测试准确性：

print('Accuracy: %.3f' % clf.score(X_test, y_test))

Output:

输出：

Accuracy: 0.912

I had an accuracy of 91% which is fair enough, after that, I then updated the model with the prediction

我的准确度是91％，这还算公允，之后，我用预测更新了模型

clf = clf.partial_fit(X_test, y_test)

测试和做出预测 (Testing and Making Predictions)

I added the text “I’ll kill myself am tired of living depressed and alone” to the model.

我在模型中添加了文本“我会厌倦生活在沮丧和孤独中，杀死自己”。

label = {0:'negative', 1:'positive'}
example = ["I'll kill myself am tired of living depressed and alone"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%'
      %(label[clf.predict(X)[0]],np.max(clf.predict_proba(X))*100))

And I got the output:

我得到了输出：

Prediction: positive
Probability: 93.76%

And when I used the following text “It’s such a hot day, I’d like to have ice cream and visit the park”, I got the following prediction:

当我使用以下文字“天气真热，我想吃冰淇淋并参观公园”时，我得到以下预测：

Prediction: negative
Probability: 97.91%

The model was able to predict accurately for both cases. And that's how you build a simple suicidal tweet classifier.

该模型能够准确预测这两种情况。这就是您构建简单的自杀性推文分类器的方式。

You can find the notebook I used for this article here

您可以在这里找到我用于本文的笔记本

Thanks for reading 😊

感谢您阅读😊

翻译自: https://towardsdatascience.com/building-a-suicidal-tweet-classifier-using-nlp-ff6ccd77e971

nlp构建

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/392136.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

nlp构建_使用NLP构建自杀性推文分类器

数据 (Data)

建立模型 (Building the Model)

数据预处理 (Data Preprocessing)

矢量化器 (Vectorizer)

模型 (Model)

培训与验证 (Training and Validation)

测试和做出预测 (Testing and Making Predictions)

相关文章

域名跳转

Elastic Stack 安装

区块链去中心化分布式_为什么渐进式去中心化是区块链的最大希望

编译原理—语义分析（Java）

linux问题总结

linux vi行尾总是显示颜色,【转载】Linux 下使用 vi 没有颜色的解决办法

时间序列分析 lstm_LSTM —时间序列分析

关于计算圆周率PI的经典程序

华为产品技术学习笔记之路由原理（一）

Linux网络配置:设置IP地址、网关DNS、主机名

编译原理—小型（简化）高级语言分析器前端（Java）

linux boot菜单列表,Bootstrap 下拉菜单（Dropdowns）简介

dynamodb管理ttl_如何使用DynamoDB TTL和Lambda安排临时任务

5g创业的构想_数据科学项目的五个具体构想

Microsoft Windows Phone 7 Toolkit Silverlight SDK XNA Game Studio 4.0 开发工具套件正式版下载...

数据挖掘—Apriori算法（Java实现）

怎么汇报一周开发工作情况_如何在没有经验的情况下获得第一份开发人员工作

vue.js的认知

c语言中的无符号字节,C语言之有符号数和无符号数

8种排序算法比较