1. 数据集设置与分析
类别 | 训练集 | 测试集 |
正常短信(正类) | 475,560 | 203,805 |
垃圾短信(负类) | 52,830 | 22,648 |
总计 | 528,390 | 226,453 |
2. 算法实现
算法 | 优点 | 缺点 |
朴素贝叶斯 | 有着坚实的数学理论基础;实现简单;学习与预测的效率都较高。 | 实际往往不能满足特征条件独立性,在特征之间的相关性较大时分类效果不好;预设的先验概率分布的影响分类效果;在类别不平衡的数据上表现不佳。 |
逻辑回归 | 实现简单;训练速度快。 | 对于非线性的样本数据难以建模拟合;在特征空间很大时,性能不好;临界值不易确定,容易欠拟合。 |
随机森林 | 训练可以高度并行化,在大数据集上训练速度有优势;能够处理高维度数据;能给出各个特征属性对输出的重要性评分。 | 在噪声较大的情况下容易发生过拟合。 |
SVM | 可以处理线性与非线性的数据;具有较良好的泛化推广能力。 | 参数调节与核函数选择较多地依赖于经验,具有一定的随意性。 |
LSTM | 结合词序信息。 | 只能结合正向的词序信息。 |
BiLSTM | 结合上下文信息。 | 模型收敛需要较长的训练时间。 |
BERT | 捕捉上下文信息的能力更强。 | 预训练的[MASK]标记造成预训练与微调阶段的不匹配,影响模型效果;模型收敛需要更多时间。 |
2.1 朴素贝叶斯
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()] return stopwords if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词,并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = MultinomialNB()model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))
2.2 逻辑回归
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()] return stopwords if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词,并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = LogisticRegression(random_state=0)model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))
2.3 随机森林
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()] return stopwords if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词,并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = RandomForestClassifier()model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))
2.4 SVM
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()] return stopwords if __name__ == '__main__':#读取训练集数据print("Loading train dataset ...")t = time()train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))#读取测试集数据print("Loading test dataset ...")t = time()test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词print("Loading stopwords ...")t = time()stopwords = stopwordslist("stopwords.txt")print("Done in {0} seconds\n".format(round(time() - t, 2)))#分词,并过滤停用词print("Starting word segmentation on train dataset...")t = time()X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Starting word segmentation on test dataset...")t = time()X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成TF-IDF词向量print("Vectorizing train dataset...")t = time()tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))X_train = tfidf.fit_transform(X_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Vectorizing test dataset...")t = time()X_test = tfidf.transform(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print('-----------------------------')print(X_train)print('-----------------------------')print(X_test)#训练模型print("Training model...")t = time()model = LinearSVC()model.fit(X_train, y_train)print("Done in {0} seconds\n".format(round(time() - t, 2)))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))
2.5 LSTM
# -*- coding: utf-8 -*-import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()] return stopwords if __name__ == '__main__':#读取训练集数据train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')#读取测试集数据test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词stopwords = stopwordslist("stopwords.txt")#分词,并过滤停用词X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))# 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)MAX_NB_WORDS = 50000# 每个标题最大的长度MAX_SEQUENCE_LENGTH = 100# 设置Embeddingceng层的维度EMBEDDING_DIM = 200tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)tokenizer.fit_on_texts(X_train)word_index = tokenizer.word_indexprint('There are %s different words.' % len(word_index))X_train = tokenizer.texts_to_sequences(X_train)X_test = tokenizer.texts_to_sequences(X_test)#填充X,让X的各个列的长度统一X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)#多类标签的onehot展开y_train = pd.get_dummies(y_train).valuesy_test = pd.get_dummies(y_test).valuesprint(X_train.shape,y_train.shape)print(X_test.shape,y_test.shape)#加载tencent词向量wv_from_text = KeyedVectors.load_word2vec_format('tencent.txt', binary=False, unicode_errors='ignore')embedding_matrix = np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))for word, i in word_index.items():if i > MAX_NB_WORDS:continuetry:embedding_matrix[i] = wv_from_text.wv.get_vector(word)except:continuedel wv_from_text#定义模型print("Training model...")t = time()model = Sequential()model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_train.shape[1], weights = [embedding_matrix], trainable = False))model.add(SpatialDropout1D(0.2))model.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2))model.add(Dense(2, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())epochs = 10batch_size = 64history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])print("Done in {0} seconds\n".format(round(time() - t, 2)))accr = model.evaluate(X_test,y_test)print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]))print("Predicting test dataset...")t = time()y_pred = model.predict(X_test)print("Done in {0} seconds\n".format(round(time() - t, 2)))y_pred = y_pred.argmax(axis = 1)y_test = y_test.argmax(axis = 1)#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))
2.6 BiLSTM
# -*- coding: utf-8 -*-import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')#读取停用词列表
def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()] return stopwords if __name__ == '__main__':#读取训练集数据train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')#读取测试集数据test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Total number of labeled documents(train): %d ." % len(train_data))print("Total number of labeled documents(test): %d ." % len(test_data))X_train = train_data['text']X_test = test_data['text']y_train = train_data['labels']y_test = test_data['labels']#计算训练集中每个类别的标注数量d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}df_label = pd.DataFrame(data=d).reset_index(drop=True)print(df_label)#加载停用词stopwords = stopwordslist("stopwords.txt")#分词,并过滤停用词X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))# 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)MAX_NB_WORDS = 50000# 每个标题最大的长度MAX_SEQUENCE_LENGTH = 100# 设置Embeddingceng层的维度EMBEDDING_DIM = 200tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)tokenizer.fit_on_texts(X_train)word_index = tokenizer.word_indexprint('There are %s different words.' % len(word_index))X_train = tokenizer.texts_to_sequences(X_train)X_test = tokenizer.texts_to_sequences(X_test)#填充X,让X的各个列的长度统一X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)#多类标签的onehot展开y_train = pd.get_dummies(y_train).valuesy_test = pd.get_dummies(y_test).valuesprint(X_train.shape,y_train.shape)print(X_test.shape,y_test.shape)#加载tencent词向量wv_from_text = KeyedVectors.load_word2vec_format('tencent.txt', binary=False, unicode_errors='ignore')embedding_matrix = np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))for word, i in word_index.items():if i > MAX_NB_WORDS:continuetry:embedding_matrix[i] = wv_from_text.wv.get_vector(word)except:continuedel wv_from_text#定义模型model = Sequential()model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_train.shape[1], weights = [embedding_matrix], trainable = False))model.add(SpatialDropout1D(0.2))model.add(Bidirectional(LSTM(300)))model.add(Dense(2, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())epochs = 10batch_size = 64history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])accr = model.evaluate(X_test,y_test)print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]))y_pred = model.predict(X_test)y_pred = y_pred.argmax(axis = 1)y_test = y_test.argmax(axis = 1)#生成混淆矩阵conf_mat = confusion_matrix(y_test, y_pred)print(conf_mat)print('accuracy %s' % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits=4))
2.7 BERT
import pandas as pd
from simpletransformers.model import TransformerModel
from sklearn.metrics import f1_score, accuracy_scoredef f1_multiclass(labels, preds):return f1_score(labels, preds, average='micro')if __name__ == '__main__':#读取训练集数据train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')#读取测试集数据test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')print("Total number of labeled papers(train): %d ." % len(train_data))print("Total number of labeled papers(test): %d ." % len(test_data))#构建模型#bert-base-chinesemodel = TransformerModel('bert', 'bert-base-chinese', num_labels=2, args={'learning_rate':1e-5, 'num_train_epochs': 2,
'reprocess_input_data': True, 'overwrite_output_dir': True, 'fp16': False})#bert-base-multilingual 前两个参数换成: 'bert', 'bert-base-multilingual-cased'#roberta 前两个参数换成: 'roberta', 'roberta-base'#xlmroberta 前两个参数换成: 'xlmroberta', 'xlm-roberta-base'#模型训练model.train_model(train_data)result, model_outputs, wrong_predictions = model.eval_model(test_data, f1=f1_multiclass, acc=accuracy_score)
3. 结果对比
为定量分析算法效果,假设正常短信为正样本,数量为P(Positive);垃圾短信为负样本,数量为N(Negative);文本分类算法正确分类样本数为T(True);错误分类样本数为F(False)。因此,真正(True positive, TP)表示正常短信被正确分类的数量;假正(False positive, FP)表示垃圾短信被误认为正常短信的数量;真负(True negative, TN)表示垃圾短信被正确分类的数量;假负(False negative, FN)表示正常短信被误认为垃圾短信的数量。在此基础上,实验中使用如下五个评估指标:
Precision-weighted = ( P r e c i s i o n P ∗ P + P r e c i s i o n N ∗ N ) / ( P + N ) =(Precision_P*P+Precision_N*N)/(P+N) =(PrecisionP∗P+PrecisionN∗N)/(P+N)
其中 P r e c i s i o n P = T P / ( T P + F P ) Precision_P=TP/(TP+FP) PrecisionP=TP/(TP+FP), P r e c i s i o n N = T N / ( T N + F N ) Precision_N=TN/(TN+FN) PrecisionN=TN/(TN+FN)。
Recall-weighted = ( R e c a l l P ∗ P + R e c a l l N ∗ N ) / ( P + N ) =(Recall_P*P+Recall_N*N)/(P+N) =(RecallP∗P+RecallN∗N)/(P+N)
其中 R e c a l l P = T P / ( T P + F N ) Recall_P=TP/(TP+FN) RecallP=TP/(TP+FN), R e c a l l N = T N / ( T N + F P ) Recall_N=TN/(TN+FP) RecallN=TN/(TN+FP)。
F1-score-weighted = ( F 1 P ∗ P + F 1 N ∗ N ) / ( P + N ) =(F1_P*P+F1_N*N)/(P+N) =(F1P∗P+F1N∗N)/(P+N)
F 1 P = 2 ∗ P r e c i s i o n P ∗ R e c a l l P / ( P r e c i s i o n P + R e c a l l P ) F1_P=2*Precision_P*Recall_P/(Precision_P+Recall_P) F1P=2∗PrecisionP∗RecallP/(PrecisionP+RecallP),
F 1 N = 2 ∗ P r e c i s i o n N ∗ R e c a l l N / ( P r e c i s i o n N + R e c a l l N ) F1_N=2*Precision_N*Recall_N/(Precision_N+Recall_N) F1N=2∗PrecisionN∗RecallN/(PrecisionN+RecallN)。
(4)假负率(False negative rate, FNR),计算如下:
FNR = F N / ( T P + F N ) =FN/(TP+FN) =FN/(TP+FN),即被预测为垃圾短信的正常短信数量/正常短信实际的数量。
(5)真负率(True negative rate, TNR),计算如下:
TNR = T N / ( T N + F P ) =TN/(TN+FP) =TN/(TN+FP),即垃圾短信的正确识别数量/垃圾短信实际的数量,亦为垃圾短信的召回率。
模型 | 精确率加权平均 | 召回率加权平均 | F1值加权平均 | 假负率 | 真负率 |
朴素贝叶斯 | 0.9764 | 0.9761 | 0.9748 | 0.0010 | 0.7700 |
逻辑回归 | 0.9886 | 0.9887 | 0.9887 | 0.0061 | 0.9414 |
随机森林 | 0.9809 | 0.9808 | 0.9800 | 0.0012 | 0.8181 |
SVM | 0.9925 | 0.9924 | 0.9924 | 0.0052 | 0.9713 |
LSTM | 0.9963 | 0.9963 | 0.9963 | 0.0015 | 0.9771 |
BiLSTM | 0.9964 | 0.9964 | 0.9964 | 0.0009 | 0.9720 |
BERT | 0.9991 | 0.9991 | 0.9991 | 0.0002 | 0.9926 |