基于中文垃圾短信数据集的经典文本分类算法实现

垃圾短信的泛滥给人们的日常生活带来了严重干扰，其中诈骗短信更是威胁到人们的信息与财产安全。因此，研究如何构建一种自动拦截过滤垃圾短信的机制有较强的实际应用价值。本文基于中文垃圾短信数据集，分别对比了朴素贝叶斯、逻辑回归、随机森林、SVM、LSTM、BiLSTM、BERT七种文本分类算法的垃圾短信分类效果。

1. 数据集设置与分析

统计发现，给定数据集包含正常短信679,365条，垃圾短信75,478条，垃圾短信数量约占短信总数的10%。将数据集按7:3的比例随机拆分为训练集与测试集。训练集与测试集的数据分布如下表所示：

类别	训练集	测试集
正常短信（正类）	475,560	203,805
垃圾短信（负类）	52,830	22,648
总计	528,390	226,453

另外，绘制训练集中正常短信与垃圾短信的词云图，可以对正常短信与垃圾短信的文本特征有较为直观的认识。从正常短信出现频率最高的前500词中随机选取的200个词的词云图如下图所示：

正常短信的词云图

从垃圾短信出现频率最高的前500词中随机选取的200个词的词云图如下图所示：

垃圾短信的词云图

可以发现：正常短信和垃圾短信在频繁词项上的区别是比较明显的。正常短信多与人们的日常生活相关，包含个人情感（如：“哈哈哈”、“宝宝”）、时事新闻（如：“记者”、“发布”）、衣食住行（如：“飞机”、“医疗”）等。而垃圾短信多与广告营销相关，包含促销力度（如“元起”、“钜”、“超值”、“最低”）、时间紧迫性（如：“赶紧”、“机会”）、促销手段（如：“抽奖”、“话费”）、时令节日（如：“妇女节”、“三月”）等。

2. 算法实现

基于上述数据集，本文从传统的机器学习方法中选择了朴素贝叶斯、逻辑回归、随机森林、SVM分类模型，从深度学习方法中选择了LSTM、BiLSTM以及预训练模型BERT进行对比实验。七种文本分类算法的优缺点总结如下表所示：

算法	优点	缺点
朴素贝叶斯	有着坚实的数学理论基础；实现简单；学习与预测的效率都较高。	实际往往不能满足特征条件独立性，在特征之间的相关性较大时分类效果不好；预设的先验概率分布的影响分类效果；在类别不平衡的数据上表现不佳。
逻辑回归	实现简单；训练速度快。	对于非线性的样本数据难以建模拟合；在特征空间很大时，性能不好；临界值不易确定，容易欠拟合。
随机森林	训练可以高度并行化，在大数据集上训练速度有优势；能够处理高维度数据；能给出各个特征属性对输出的重要性评分。	在噪声较大的情况下容易发生过拟合。
SVM	可以处理线性与非线性的数据；具有较良好的泛化推广能力。	参数调节与核函数选择较多地依赖于经验，具有一定的随意性。
LSTM	结合词序信息。	只能结合正向的词序信息。
BiLSTM	结合上下文信息。	模型收敛需要较长的训练时间。
BERT	捕捉上下文信息的能力更强。	预训练的[MASK]标记造成预训练与微调阶段的不匹配，影响模型效果；模型收敛需要更多时间。

下面依次介绍各文本分类算法的实现细节。

2.1 朴素贝叶斯

首先使用结巴分词工具将短信文本分词，去除停用词；然后抽取unigram和bigram特征，使用TF-IDF编码将分词后的短信文本向量化；最后训练朴素贝叶斯分类器。模型使用scikit-learn中的MultinomialNB，参数使用默认参数。其中，假设特征的先验概率分布为多项式分布，采用拉普拉斯平滑，所有的样本类别输出都有相同的类别先验概率。

代码如下：

# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath):  stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  return stopwords  if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train  = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词，并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = MultinomialNB()model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))

2.2 逻辑回归

文本向量化方式与朴素贝叶斯相同。模型使用scikit-learn中的LogisticRegression，参数使用默认参数。其中，惩罚系数设置为1，正则化参数使用L2正则化，终止迭代的阈值为0.0001。

代码如下：

# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath):  stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  return stopwords  if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train  = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词，并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = LogisticRegression(random_state=0)model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))

2.3 随机森林

文本向量化方式与朴素贝叶斯相同。模型使用scikit-learn中的RandomForestClassifier，参数使用默认参数。其中，决策树的最大个数为100，不采用袋外样本来评估模型的好坏，CART树做划分时对特征的评价标准为基尼系数。

代码如下：

# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath):  stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  return stopwords  if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train  = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词，并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = RandomForestClassifier()model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))

2.4 SVM

文本向量化方式与朴素贝叶斯相同。模型使用scikit-learn中的LinearSVC，参数使用默认参数。其中，SVM的核函数选用线性核函数，惩罚系数设置为1，正则化参数使用L2正则化，采用对偶形式优化算法，最大迭代次数为1000，终止迭代的阈值为0.0001。

代码如下：

# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath):  stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  return stopwords  if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train  = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词，并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = LinearSVC()model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))

2.5 LSTM

首先使用结巴分词工具将短信文本分词，去除停用词；然后设置保留的最大词数为最频繁出现的前50,000，序列的最大长度为100，使用200维的腾讯词向量将所有的论文标题转化为词嵌入层的权重矩阵。然后对词嵌入层的输出执行SpatialDropout1D，以0.2的比例随机将1D特征映射置零。之后输入到LSTM层，LSTM层的神经元个数为300。最后通过一个全连接层，利用softmax函数输出分类。损失函数使用交叉熵损失函数，设置batch大小为64，训练10个epoch。

代码如下：

# -*- coding: utf-8 -*-import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath):  stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  return stopwords  if __name__ == '__main__':#读取训练集数据train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')#读取测试集数据test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train  = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词stopwords = stopwordslist("stopwords.txt")#分词，并过滤停用词X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))# 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)MAX_NB_WORDS = 50000# 每个标题最大的长度MAX_SEQUENCE_LENGTH = 100# 设置Embeddingceng层的维度EMBEDDING_DIM = 200tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)tokenizer.fit_on_texts(X_train)word_index = tokenizer.word_indexprint('There are %s different words.' % len(word_index))X_train = tokenizer.texts_to_sequences(X_train)X_test = tokenizer.texts_to_sequences(X_test)#填充X,让X的各个列的长度统一X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)#多类标签的onehot展开y_train = pd.get_dummies(y_train).valuesy_test = pd.get_dummies(y_test).valuesprint(X_train.shape,y_train.shape)print(X_test.shape,y_test.shape)#加载tencent词向量wv_from_text = KeyedVectors.load_word2vec_format('tencent.txt', binary=False, unicode_errors='ignore')embedding_matrix = np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))for word, i in word_index.items():if i > MAX_NB_WORDS:continuetry:embedding_matrix[i] = wv_from_text.wv.get_vector(word)except:continuedel wv_from_text#定义模型print("Training model...")t = time()model = Sequential()model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_train.shape[1], weights = [embedding_matrix], trainable = False))model.add(SpatialDropout1D(0.2))model.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2))model.add(Dense(2, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())epochs = 10batch_size = 64history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])print("Done in {0} seconds\n".format(round(time() - t, 2)))accr = model.evaluate(X_test,y_test)print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))y_pred = y_pred.argmax(axis = 1)y_test = y_test.argmax(axis = 1)#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))

2.6 BiLSTM

与LSTM的参数设置基本一致，只是将单向的LSTM改为双向的，训练60个epoch。

代码如下：

# -*- coding: utf-8 -*-import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath):  stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  return stopwords  if __name__ == '__main__':#读取训练集数据train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')#读取测试集数据test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train  = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词stopwords = stopwordslist("stopwords.txt")#分词，并过滤停用词X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))# 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)MAX_NB_WORDS = 50000# 每个标题最大的长度MAX_SEQUENCE_LENGTH = 100# 设置Embeddingceng层的维度EMBEDDING_DIM = 200tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)tokenizer.fit_on_texts(X_train)word_index = tokenizer.word_indexprint('There are %s different words.' % len(word_index))X_train = tokenizer.texts_to_sequences(X_train)X_test = tokenizer.texts_to_sequences(X_test)#填充X,让X的各个列的长度统一X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)#多类标签的onehot展开y_train = pd.get_dummies(y_train).valuesy_test = pd.get_dummies(y_test).valuesprint(X_train.shape,y_train.shape)print(X_test.shape,y_test.shape)#加载tencent词向量wv_from_text = KeyedVectors.load_word2vec_format('tencent.txt', binary=False, unicode_errors='ignore')embedding_matrix = np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))for word, i in word_index.items():if i > MAX_NB_WORDS:continuetry:embedding_matrix[i] = wv_from_text.wv.get_vector(word)except:continuedel wv_from_text#定义模型model = Sequential()model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_train.shape[1], weights = [embedding_matrix], trainable = False))model.add(SpatialDropout1D(0.2))model.add(Bidirectional(LSTM(300)))model.add(Dense(2, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())epochs = 10batch_size = 64history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])accr = model.evaluate(X_test,y_test)print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))y_pred = model.predict(X_test)y_pred = y_pred.argmax(axis = 1)y_test = y_test.argmax(axis = 1)#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))

2.7 BERT

使用BERT-Base-Chinese预训练模型在训练集上进行微调，设置学习率为1e-5，序列的最大长度为128，batch大小设置为8，训练2个epoch。

代码如下：

import pandas as pd
from simpletransformers.model import TransformerModel
from sklearn.metrics import f1_score, accuracy_scoredef f1_multiclass(labels, preds):return f1_score(labels, preds, average='micro')if __name__ == '__main__':#读取训练集数据train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')#读取测试集数据test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Total number of labeled papers(train): %d ." % len(train_data))print("Total number of labeled papers(test): %d ." % len(test_data))#构建模型#bert-base-chinesemodel = TransformerModel('bert', 'bert-base-chinese', num_labels=2, args={'learning_rate':1e-5, 'num_train_epochs': 2, 
'reprocess_input_data': True, 'overwrite_output_dir': True, 'fp16': False})#bert-base-multilingual 前两个参数换成: 'bert', 'bert-base-multilingual-cased'#roberta 前两个参数换成: 'roberta', 'roberta-base'#xlmroberta 前两个参数换成: 'xlmroberta', 'xlm-roberta-base'#模型训练model.train_model(train_data)result, model_outputs, wrong_predictions = model.eval_model(test_data, f1=f1_multiclass, acc=accuracy_score)

3. 结果对比

为定量分析算法效果，假设正常短信为正样本，数量为P（Positive）；垃圾短信为负样本，数量为N（Negative）；文本分类算法正确分类样本数为T（True）；错误分类样本数为F（False）。因此，真正（True positive, TP）表示正常短信被正确分类的数量；假正（False positive, FP）表示垃圾短信被误认为正常短信的数量；真负（True negative, TN）表示垃圾短信被正确分类的数量；假负（False negative, FN）表示正常短信被误认为垃圾短信的数量。在此基础上，实验中使用如下五个评估指标：

（1）精确率加权平均（Precision-weighted），计算如下：
Precision-weighted $Precision_P*P+Precision_N*N)/(P+N)$
其中 $Precision_P=TP/(TP+FP)$ ， $Precision_N=TN/(TN+FN)$ 。

（2）召回率加权平均（Recall-weighted），计算如下：
Recall-weighted $Recall_P*P+Recall_N*N)/(P+N)$
其中 $Recall_P=TP/(TP+FN)$ ， $Recall_N=TN/(TN+FP)$ 。

（3）F1值加权平均（F1-score-weighted），计算如下：
F1-score-weighted $F1_P*P+F1_N*N)/(P+N)$
其中，
$F1_P=2*Precision_P*Recall_P/(Precision_P+Recall_P)$ ，
$F1_N=2*Precision_N*Recall_N/(Precision_N+Recall_N)$ 。

（4）假负率（False negative rate, FNR），计算如下：
FNR $= FN / (TP + FN)$ ，即被预测为垃圾短信的正常短信数量/正常短信实际的数量。

（5）真负率（True negative rate, TNR），计算如下：
TNR $= TN / (TN + FP)$ ，即垃圾短信的正确识别数量/垃圾短信实际的数量，亦为垃圾短信的召回率。

针对垃圾短信分类的场景，我们希望一个好的文本分类算法使得精确率加权平均、召回率加权平均、F1值加权平均、真负率要尽可能的高，即垃圾短信的正确拦截率高；同时，必须保证假负率尽可能的低，即正常短信被误认为是垃圾短信的比率低。这是因为：对于用户来说，“正常短信被误认为是垃圾短信”比“垃圾短信被误认为是正常短信”更不可容忍；对于运营商来说，宁可放过部分垃圾短信，也要保障用户的正常使用。

模型	精确率加权平均	召回率加权平均	F1值加权平均	假负率	真负率
朴素贝叶斯	0.9764	0.9761	0.9748	0.0010	0.7700
逻辑回归	0.9886	0.9887	0.9887	0.0061	0.9414
随机森林	0.9809	0.9808	0.9800	0.0012	0.8181
SVM	0.9925	0.9924	0.9924	0.0052	0.9713
LSTM	0.9963	0.9963	0.9963	0.0015	0.9771
BiLSTM	0.9964	0.9964	0.9964	0.0009	0.9720
BERT	0.9991	0.9991	0.9991	0.0002	0.9926

上表给出了七种文本分类算法的实验结果。可以发现：

第一，BERT具有最高的F1值加权平均和真负率，同时具有最低的假负率，垃圾短信的过滤效果最好。分析原因是BERT经过大规模通用语料上的预训练，对文本特征的捕捉能力更强。

第二，BiLSTM与LSTM的F1值加权平均接近，因此模型整体的分类效果接近，但二者的假负率与真负率存在差异：从假负率来看，BiLSTM的正常短信错误识别率更低；从真负率来看，LSTM的垃圾短信正确拦截率更高。

第三，SVM与逻辑回归的F1值加权平均比较接近，但相较而言，SVM的效果更好一些：SVM在精确率加权平均、召回率加权平均、F1值加权平均、假负率、真负率这五个指标上均比逻辑回归略胜一筹。分析原因可能是：SVM仅考虑支持向量，也就是和分类最相关的少数样本点；而逻辑回归考虑所有样本点，因此逻辑回归对异常值与数据分布的不平衡更敏感，分类效果受到影响。

第四，朴素贝叶斯与随机森林在F1值加权平均和真负率上表现较差。分析原因可能是：正负例数据的不平衡对二者的模型效果造成影响，模型在正常短信数据上有些过拟合。此外，朴素贝叶斯的条件独立性假设在实际中不满足，这在一定程度上影响分类效果。