文章目录
- 1. 文本处理
- 2. 文本序列化
- 3. 数据集拆分
- 4. 建立RNN模型
- 5. 训练
- 6. 测试
参考 基于深度学习的自然语言处理
1. 文本处理
数据预览
# 有两个作者的文章(A, B),定义为0, 1
A = 0 # hamilton
B = 1 # madison
UNKNOWN = -1
# 把同一作者的文章全部合并到一个文件
textA, textB = '', ''import os
for file in os.listdir('./papers/A'):textA += preprocessing('./papers/A/'+file)
for file in os.listdir('./papers/B'):textB += preprocessing('./papers/B/'+file)
- 把同一作者的文档合并,去除
\n, 多余空格
,以及作者的名字(防止数据泄露)
def preprocessing(file_path):with open(file_path, 'r') as f:lines = f.readlines()text = ' '.join(lines[1:]).replace('\n',' ').replace(' ', ' ').lower().replace('hamilton','').replace('madison','')text = ' '.join(text.split())return text
print("文本A的长度:{}".format(len(textA)))
print("文本B的长度:{}".format(len(textB)))文本A的长度:216394
文本B的长度:230867
2. 文本序列化
- 采用字符级别的 tokenizer
char_level=True
from keras.preprocessing.text import Tokenizer
char_tokenizer = Tokenizer(char_level=True)char_tokenizer.fit_on_texts(textA + textB) # 训练tokenizerlong_seq_a = char_tokenizer.texts_to_sequences([textA])[0] # 文本转 ids 序列
long_seq_b = char_tokenizer.texts_to_sequences([textB])[0]Xa, ya = make_subsequence(long_seq_a, A) # 切分成多个等长的子串样本
Xb, yb = make_subsequence(long_seq_b, B)
- ids 序列切分成等长的子串样本
SEQ_LEN = 30 # 切分序列的长度,超参数
import numpy as np
def make_subsequence(long_seq, label, seq_len=SEQ_LEN):numofsubseq = len(long_seq)-seq_len+1 # 滑窗,可以取出来这么多种X = np.zeros((numofsubseq, seq_len)) # 数据y = np.zeros((numofsubseq, 1)) # 标签for i in range(numofsubseq):X[i] = long_seq[i:i+seq_len] # seq_len 大小的滑窗y[i] = labelreturn X, y
print('字符的种类:{}'.format(len(char_tokenizer.word_index))) # 52
# {' ': 1, 'e': 2, 't': 3, 'o': 4, 'i': 5, 'n': 6, 'a': 7, 's': 8, 'r': 9, 'h': 10,
# 'l': 11, 'd': 12, 'c': 13, 'u': 14, 'f': 15, 'm': 16, 'p': 17, 'b': 18, 'y': 19, 'w': 20,
# ',': 21, 'g': 22, 'v': 23, '.': 24, 'x': 25, 'k': 26, 'j': 27, ';': 28, 'q': 29, 'z': 30,
# '-': 31, '?': 32, '"': 33, '1': 34, ':': 35, '8': 36, '7': 37, '(': 38, ')': 39, '2': 40,
# '0': 41, '3': 42, '4': 43, '6': 44, "'": 45, '!': 46, ']': 47, '5': 48, '[': 49, '@': 50,
# '9': 51, '%': 52}
print('A训练集大小:{}'.format(Xa.shape))
print('B训练集大小:{}'.format(Xb.shape))
A训练集大小:(216365, 30)
B训练集大小:(230838, 30)
3. 数据集拆分
- A、B数据集混合
# 堆叠AB训练数据在一起
X = np.vstack((Xa, Xb))
y = np.vstack((ya, yb))
- 训练集,测试集拆分
# 训练集测试集拆分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
4. 建立RNN模型
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, EmbeddingEmbedding_dim = 128 # 输出的嵌入的维度
RNN_size = 256 # RNN 单元个数model = Sequential()
model.add(Embedding(input_dim=len(char_tokenizer.word_index)+1,output_dim=Embedding_dim,input_length=SEQ_LEN))
model.add(SimpleRNN(units=RNN_size, return_sequences=False)) # 只输出最后一步
# return the last output in the output sequence
model.add(Dense(1, activation='sigmoid')) # 二分类model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model.summary()
模型结构:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 30, 128) 6784
_________________________________________________________________
simple_rnn (SimpleRNN) (None, 256) 98560
_________________________________________________________________
dense (Dense) (None, 1) 257
=================================================================
Total params: 105,601
Trainable params: 105,601
Non-trainable params: 0
_________________________________________________________________
如果return_sequences=True
,后两个输出维度如下:(增加了序列长度维度)
simple_rnn_1 (SimpleRNN) (None, 30, 256) 98560
_________________________________________________________________
dense_1 (Dense) (None, 30, 1) 257
5. 训练
batch_size = 4096 # 一次梯度下降使用的样本数量
epochs = 20 # 训练轮数
history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs,validation_data=(X_test, y_test),verbose=1)
Epoch 1/20
88/88 [==============================] - 59s 669ms/step - loss: 0.6877 - accuracy: 0.5436 - val_loss: 0.6856 - val_accuracy: 0.5540
Epoch 2/20
88/88 [==============================] - 56s 634ms/step - loss: 0.6830 - accuracy: 0.5564 - val_loss: 0.6844 - val_accuracy: 0.5550
Epoch 3/20
88/88 [==============================] - 56s 633ms/step - loss: 0.6825 - accuracy: 0.5577 - val_loss: 0.6829 - val_accuracy: 0.5563
Epoch 4/20
88/88 [==============================] - 56s 634ms/step - loss: 0.6816 - accuracy: 0.5585 - val_loss: 0.6788 - val_accuracy: 0.5641
Epoch 5/20
88/88 [==============================] - 56s 637ms/step - loss: 0.6714 - accuracy: 0.5813 - val_loss: 0.6670 - val_accuracy: 0.5877
Epoch 6/20
88/88 [==============================] - 56s 637ms/step - loss: 0.6532 - accuracy: 0.6113 - val_loss: 0.6435 - val_accuracy: 0.6235
Epoch 7/20
88/88 [==============================] - 57s 648ms/step - loss: 0.6287 - accuracy: 0.6424 - val_loss: 0.6159 - val_accuracy: 0.6563
Epoch 8/20
88/88 [==============================] - 55s 620ms/step - loss: 0.5932 - accuracy: 0.6807 - val_loss: 0.5747 - val_accuracy: 0.6971
Epoch 9/20
88/88 [==============================] - 54s 615ms/step - loss: 0.5383 - accuracy: 0.7271 - val_loss: 0.5822 - val_accuracy: 0.7178
Epoch 10/20
88/88 [==============================] - 56s 632ms/step - loss: 0.4803 - accuracy: 0.7687 - val_loss: 0.4536 - val_accuracy: 0.7846
Epoch 11/20
88/88 [==============================] - 61s 690ms/step - loss: 0.3979 - accuracy: 0.8190 - val_loss: 0.3940 - val_accuracy: 0.8195
Epoch 12/20
88/88 [==============================] - 60s 687ms/step - loss: 0.3257 - accuracy: 0.8572 - val_loss: 0.3248 - val_accuracy: 0.8564
Epoch 13/20
88/88 [==============================] - 59s 668ms/step - loss: 0.2637 - accuracy: 0.8897 - val_loss: 0.2980 - val_accuracy: 0.8742
Epoch 14/20
88/88 [==============================] - 56s 638ms/step - loss: 0.2154 - accuracy: 0.9115 - val_loss: 0.2326 - val_accuracy: 0.9023
Epoch 15/20
88/88 [==============================] - 56s 639ms/step - loss: 0.1822 - accuracy: 0.9277 - val_loss: 0.2112 - val_accuracy: 0.9130
Epoch 16/20
88/88 [==============================] - 56s 640ms/step - loss: 0.1504 - accuracy: 0.9412 - val_loss: 0.1803 - val_accuracy: 0.9267
Epoch 17/20
88/88 [==============================] - 58s 660ms/step - loss: 0.1298 - accuracy: 0.9499 - val_loss: 0.1662 - val_accuracy: 0.9331
Epoch 18/20
88/88 [==============================] - 57s 643ms/step - loss: 0.1132 - accuracy: 0.9567 - val_loss: 0.1643 - val_accuracy: 0.9358
Epoch 19/20
88/88 [==============================] - 58s 659ms/step - loss: 0.1018 - accuracy: 0.9613 - val_loss: 0.1409 - val_accuracy: 0.9441
Epoch 20/20
88/88 [==============================] - 57s 642ms/step - loss: 0.0907 - accuracy: 0.9659 - val_loss: 0.1325 - val_accuracy: 0.9475
- 绘制训练过程
import pandas as pd
import matplotlib.pyplot as plt
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()
6. 测试
# 测试for file in os.listdir('./papers/Unknown'):# 测试文本处理unk_file = preprocessing('./papers/Unknown/'+file)# 文本转ids序列unk_file_seq = char_tokenizer.texts_to_sequences([unk_file])[0]# 提取固定长度的子串,形成多个样本X_unk, _ = make_subsequence(unk_file_seq, UNKNOWN)# 预测y_pred = model.predict(X_unk)y_pred = y_pred > 0.5votesA = np.sum(y_pred==0)votesB = np.sum(y_pred==1)print("文章 {} 被预测为 {} 写的,投票数 {} : {}".format(file,"A:hamilton" if votesA > votesB else "B:madison",max(votesA, votesB),min(votesA, votesB)))
输出:5个文本的作者,都预测对了
文章 paper_1.txt 被预测为 B:madison 写的,投票数 12211 : 8563
文章 paper_2.txt 被预测为 B:madison 写的,投票数 10899 : 8747
文章 paper_3.txt 被预测为 A:hamilton 写的,投票数 7041 : 6343
文章 paper_4.txt 被预测为 A:hamilton 写的,投票数 5063 : 4710
文章 paper_5.txt 被预测为 A:hamilton 写的,投票数 6878 : 4876