pytorch 实现中文文本分类

🍨 本文为[🔗365天深度学习训练营学习记录博客🍦 参考文章:365天深度学习训练营🍖 原作者:[K同学啊 | 接辅导、项目定制]\n🚀 文章来源:[K同学的学习圈子](https://www.yuque.com/mingtian-fkmxf/zxwb45)
import torch
import torch.nn as nn
import torchvision
from torchvision import transforms, datasets
import os, PIL, pathlib, warningswarnings.filterwarnings("ignore")  # 忽略警告信息# win10系统
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

train.csv 链接:https://pan.baidu.com/s/1Vnyvo5T5eSuzb0VwTsznqA?pwd=fqok 提取码:fqok 
import pandas as pd# 加载自定义中文数据集
train_data = pd.read_csv('D:/train.csv', sep='\t', header=None)
train_data.head()# 构建数据集迭代器
def coustom_data_iter(texts, labels):for x, y in zip(texts, labels):yield x, ytrain_iter = coustom_data_iter(train_data[0].values[:], train_data[1].values[:])

1.构建词典:

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import jieba# 中文分词方法
tokenizer = jieba.lcutdef yield_tokens(data_iter):for text, in data_iter:yield tokenizer(text)vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

 调用vocab(词汇表)对一个中文句子进行索引转换,这个句子被分词后得到的词汇列表会被转换成它们在词汇表中的索引。

print(vocab(['我', '想', '看', '书', '和', '你', '一起', '看', '电影', '的', '新款', '视频']))

生成一个标签列表,用于查看在数据集中所有可能的标签类型。 

label_name = list(set(train_data[1].values[:]))
print(label_name)

 创建了两个lambda函数,一个用于将文本转换成词汇索引,另一个用于将标签文本转换成它们在label_name列表中的索引。

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: label_name.index(x)print(text_pipeline('我想看新闻或者上网站看最新的游戏视频'))
print(label_pipeline('Video-Play'))

2.生成数据批次和迭代器

from torch.utils.data import DataLoaderdef collate_batch(batch):label_list, text_list, offsets = [], [], [0]for (_text, _label) in batch:# 标签列表label_list.append(label_pipeline(_label))# 文本列表processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)text_list.append(processed_text)# 偏移量,即词汇的起始位置offsets.append(processed_text.size(0))label_list = torch.tensor(label_list, dtype=torch.int64)text_list = torch.cat(text_list)offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # 累计偏移量dim中维度元素的累计和return text_list.to(device), label_list.to(device), offsets.to(device)# 数据加载器,调用示例
dataloader = DataLoader(train_iter,batch_size=8,shuffle=False,collate_fn=collate_batch)
  • collate_batch函数用于处理数据加载器中的批次。它接收一个批次的数据,处理它,并返回适合模型训练的数据格式。
  • 在这个函数内部,它遍历批次中的每个文本和标签对,将标签添加到label_list,将文本通过text_pipeline函数处理后转换为tensor,并添加到text_list
  • offsets列表用于存储每个文本的长度,这对于后续的文本处理非常有用,尤其是当你需要知道每个文本在拼接的大tensor中的起始位置时。
  • text_listtorch.cat进行拼接,形成一个连续的tensor。
  • offsets列表的最后一个元素不包括,然后使用cumsum函数在第0维计算累积和,这为每个序列提供了一个累计的偏移量。

3.搭建模型与初始化

from torch import nnclass TextClassificationModel(nn.Module):def __init__(self, vocab_size, embed_dim, num_class):super(TextClassificationModel, self).__init__()self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)self.fc = nn.Linear(embed_dim, num_class)self.init_weights()def init_weights(self):initrange = 0.5self.embedding.weight.data.uniform_(-initrange, initrange)self.fc.weight.data.uniform_(-initrange, initrange)self.fc.bias.data.zero_()def forward(self, text, offsets):embedded = self.embedding(text, offsets)return self.fc(embedded)num_class = len(label_name)  # 类别数,根据label_name的长度确定
vocab_size = len(vocab)      # 词汇表的大小,根据vocab的长度确定
em_size = 64                 # 嵌入向量的维度设置为64
model = TextClassificationModel(vocab_size, em_size, num_class).to(device)  # 创建模型实例并移动到计算设备

4.模型训练及评估函数

trainevaluate分别用于训练和评估文本分类模型。

训练函数 train 的工作流程如下:

  1. 将模型设置为训练模式。
  2. 初始化总准确率、训练损失和总计数变量。
  3. 记录训练开始的时间。
  4. 遍历数据加载器,对每个批次:
    • 进行预测。
    • 清零优化器的梯度。
    • 计算损失(使用一个损失函数,例如交叉熵)。
    • 反向传播计算梯度。
    • 通过梯度裁剪防止梯度爆炸。
    • 执行一步优化器更新模型权重。
  5. 更新总准确率和总损失。
  6. 每隔一定间隔,打印训练进度和统计信息。

评估函数 evaluate 的工作流程如下:

  1. 将模型设置为评估模式。
  2. 初始化总准确率和总损失。
  3. 不计算梯度(为了节省内存和计算资源)。
  4. 遍历数据加载器,对每个批次:
    • 进行预测。
    • 计算损失。
    • 更新总准确率和总损失。
  5. 返回整体的准确率和平均损失。

代码实现:

import timedef train(dataloader):model.train()  # 切换到训练模式total_acc, train_loss, total_count = 0, 0, 0log_interval = 50start_time = time.time()for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)optimizer.zero_grad()  # 梯度归零loss = criterion(predicted_label, label)  # 计算损失loss.backward()  # 反向传播torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)  # 梯度裁剪optimizer.step()  # 优化器更新权重# 记录acc和losstotal_acc += (predicted_label.argmax(1) == label).sum().item()train_loss += loss.item()total_count += label.size(0)if idx % log_interval == 0 and idx > 0:elapsed = time.time() - start_timeprint('| epoch {:3d} | {:5d}/{:5d} batches ''| accuracy {:8.3f} | loss {:8.5f}'.format(epoch, idx, len(dataloader),total_acc/total_count, train_loss/total_count))total_acc, train_loss, total_count = 0, 0, 0start_time = time.time()def evaluate(dataloader):model.eval()  # 切换到评估模式total_acc, total_count = 0, 0with torch.no_grad():for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)loss = criterion(predicted_label, label)  # 计算losstotal_acc += (predicted_label.argmax(1) == label).sum().item()total_count += label.size(0)return total_acc/total_count, total_count

5.模型训练

  • 设置训练的轮数、学习率和批次大小。
  • 定义交叉熵损失函数、随机梯度下降优化器和学习率调度器。
  • 将训练数据转换为一个map样式的数据集,并将其分成训练集和验证集。
  • 创建训练和验证的数据加载器。
  • 开始训练循环,每个epoch都会训练模型并在验证集上评估模型的准确率和损失。
  • 如果验证准确率没有提高,则按计划降低学习率。
  • 打印每个epoch结束时的统计信息,包括时间、准确率、损失和学习率。
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# 参数设置
EPOCHS = 10  # epoch数量
LR = 5  # 学习速率
BATCH_SIZE = 64  # 训练的batch大小# 设置损失函数、优化器和调度器
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None# 准备数据集
train_iter = coustom_data_iter(train_data[0].values[:], train_data[1].values[:])
train_dataset = to_map_style_dataset(train_iter)split_train_, split_valid_ = random_split(train_dataset,[int(len(train_dataset)*0.8), int(len(train_dataset)*0.2)])train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)# 训练循环
for epoch in range(1, EPOCHS + 1):epoch_start_time = time.time()train(train_dataloader)val_acc, val_loss = evaluate(valid_dataloader)# 更新学习率的策略lr = optimizer.state_dict()['param_groups'][0]['lr']if total_accu is not None and total_accu > val_acc:scheduler.step()else:total_accu = val_accprint('-' * 69)print('| end of epoch {:3d} | time: {:4.2f}s | ''valid accuracy {:4.3f} | valid loss {:4.3f} | lr {:4.6f}'.format(epoch, time.time() - epoch_start_time, val_acc, val_loss, lr))print('-' * 69)

 运行结果:

| epoch   1 |    50/  152 batches | accuracy    0.423 | loss  0.03079
| epoch   1 |   100/  152 batches | accuracy    0.700 | loss  0.01912
| epoch   1 |   150/  152 batches | accuracy    0.776 | loss  0.01347
---------------------------------------------------------------------
| end of epoch   1 | time: 1.53s | valid accuracy 0.777 | valid loss 2420.000 | lr 5.000000
| epoch   2 |    50/  152 batches | accuracy    0.812 | loss  0.01056
| epoch   2 |   100/  152 batches | accuracy    0.843 | loss  0.00871
| epoch   2 |   150/  152 batches | accuracy    0.844 | loss  0.00846
---------------------------------------------------------------------
| end of epoch   2 | time: 1.45s | valid accuracy 0.842 | valid loss 2420.000 | lr 5.000000
| epoch   3 |    50/  152 batches | accuracy    0.883 | loss  0.00653
| epoch   3 |   100/  152 batches | accuracy    0.879 | loss  0.00634
| epoch   3 |   150/  152 batches | accuracy    0.883 | loss  0.00627
---------------------------------------------------------------------
| end of epoch   3 | time: 1.44s | valid accuracy 0.865 | valid loss 2420.000 | lr 5.000000
| epoch   4 |    50/  152 batches | accuracy    0.912 | loss  0.00498
| epoch   4 |   100/  152 batches | accuracy    0.906 | loss  0.00495
| epoch   4 |   150/  152 batches | accuracy    0.915 | loss  0.00461
---------------------------------------------------------------------
| end of epoch   4 | time: 1.50s | valid accuracy 0.876 | valid loss 2420.000 | lr 5.000000
| epoch   5 |    50/  152 batches | accuracy    0.935 | loss  0.00386
| epoch   5 |   100/  152 batches | accuracy    0.934 | loss  0.00390
| epoch   5 |   150/  152 batches | accuracy    0.932 | loss  0.00362
---------------------------------------------------------------------
| end of epoch   5 | time: 1.59s | valid accuracy 0.881 | valid loss 2420.000 | lr 5.000000
| epoch   6 |    50/  152 batches | accuracy    0.947 | loss  0.00313
| epoch   6 |   100/  152 batches | accuracy    0.949 | loss  0.00307
| epoch   6 |   150/  152 batches | accuracy    0.949 | loss  0.00286
---------------------------------------------------------------------
| end of epoch   6 | time: 1.68s | valid accuracy 0.891 | valid loss 2420.000 | lr 5.000000
| epoch   7 |    50/  152 batches | accuracy    0.960 | loss  0.00243
| epoch   7 |   100/  152 batches | accuracy    0.963 | loss  0.00224
| epoch   7 |   150/  152 batches | accuracy    0.959 | loss  0.00252
---------------------------------------------------------------------
| end of epoch   7 | time: 1.53s | valid accuracy 0.892 | valid loss 2420.000 | lr 5.000000
| epoch   8 |    50/  152 batches | accuracy    0.972 | loss  0.00186
| epoch   8 |   100/  152 batches | accuracy    0.974 | loss  0.00184
| epoch   8 |   150/  152 batches | accuracy    0.967 | loss  0.00201
---------------------------------------------------------------------
| end of epoch   8 | time: 1.43s | valid accuracy 0.895 | valid loss 2420.000 | lr 5.000000
| epoch   9 |    50/  152 batches | accuracy    0.981 | loss  0.00138
| epoch   9 |   100/  152 batches | accuracy    0.977 | loss  0.00165
| epoch   9 |   150/  152 batches | accuracy    0.980 | loss  0.00147
---------------------------------------------------------------------
| end of epoch   9 | time: 1.48s | valid accuracy 0.900 | valid loss 2420.000 | lr 5.000000
| epoch  10 |    50/  152 batches | accuracy    0.987 | loss  0.00117
| epoch  10 |   100/  152 batches | accuracy    0.985 | loss  0.00121
| epoch  10 |   150/  152 batches | accuracy    0.984 | loss  0.00121
---------------------------------------------------------------------
| end of epoch  10 | time: 1.45s | valid accuracy 0.902 | valid loss 2420.000 | lr 5.000000
---------------------------------------------------------------------

6.模型评估

test_acc, test_loss = evaluate(valid_dataloader)
print('模型的准确率: {:5.4f}'.format(test_acc))

7.模型测试

def predict(text, text_pipeline):with torch.no_grad():text = torch.tensor(text_pipeline(text))output = model(text, torch.tensor([0]))return output.argmax(1).item()# 示例文本字符串
# ex_text_str = "例句输入——这是一个待预测类别的示例句子"
ex_text_str = "这不仅影响到我们的方案是否可行13号的"model = model.to("cpu")print("该文本的类别是: %s" % label_name[predict(ex_text_str, text_pipeline)])

 8.全部代码(部分修改):

import torch
import torch.nn as nn
import torchvision
from torchvision import transforms, datasets
import os, PIL, pathlib, warningswarnings.filterwarnings("ignore")  # 忽略警告信息# win10系统
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)import pandas as pd# 加载自定义中文数据集
train_data = pd.read_csv('D:/train.csv', sep='\t', header=None)
train_data.head()# 构建数据集迭代器
def custom_data_iter(texts, labels):for x, y in zip(texts, labels):yield x, ytrain_iter = custom_data_iter(train_data[0].values[:], train_data[1].values[:])from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import jieba# 中文分词方法
tokenizer = jieba.lcutdef yield_tokens(data_iter):for text,_ in data_iter:yield tokenizer(text)vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])print(vocab(['我', '想', '看', '书', '和', '你', '一起', '看', '电影', '的', '新款', '视频']))label_name = list(set(train_data[1].values[:]))
print(label_name)text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: label_name.index(x)print(text_pipeline('我想看新闻或者上网站看最新的游戏视频'))
print(label_pipeline('Video-Play'))from torch.utils.data import DataLoaderdef collate_batch(batch):label_list, text_list, offsets = [], [], [0]for (_text, _label) in batch:# 标签列表label_list.append(label_pipeline(_label))# 文本列表processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)text_list.append(processed_text)# 偏移量,即词汇的起始位置offsets.append(processed_text.size(0))label_list = torch.tensor(label_list, dtype=torch.int64)text_list = torch.cat(text_list)offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # 累计偏移量dim中维度元素的累计和return text_list.to(device), label_list.to(device), offsets.to(device)# 数据加载器,调用示例
dataloader = DataLoader(train_iter,batch_size=8,shuffle=False,collate_fn=collate_batch)from torch import nnclass TextClassificationModel(nn.Module):def __init__(self, vocab_size, embed_dim, num_class):super(TextClassificationModel, self).__init__()self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)self.fc = nn.Linear(embed_dim, num_class)self.init_weights()def init_weights(self):initrange = 0.5self.embedding.weight.data.uniform_(-initrange, initrange)self.fc.weight.data.uniform_(-initrange, initrange)self.fc.bias.data.zero_()def forward(self, text, offsets):embedded = self.embedding(text, offsets)return self.fc(embedded)
num_class = len(label_name)
vocab_size = len(vocab)
em_size = 64
model = TextClassificationModel(vocab_size, em_size, num_class).to(device)import timedef train(dataloader):model.train()  # 切换到训练模式total_acc, train_loss, total_count = 0, 0, 0log_interval = 50start_time = time.time()for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)optimizer.zero_grad()  # 梯度归零loss = criterion(predicted_label, label)  # 计算损失loss.backward()  # 反向传播torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)  # 梯度裁剪optimizer.step()  # 优化器更新权重# 记录acc和losstotal_acc += (predicted_label.argmax(1) == label).sum().item()train_loss += loss.item()total_count += label.size(0)if idx % log_interval == 0 and idx > 0:elapsed = time.time() - start_timeprint('| epoch {:3d} | {:5d}/{:5d} batches ''| accuracy {:8.3f} | loss {:8.5f}'.format(epoch, idx, len(dataloader),total_acc/total_count, train_loss/total_count))total_acc, train_loss, total_count = 0, 0, 0start_time = time.time()def evaluate(dataloader):model.eval()  # 切换到评估模式total_acc, total_count = 0, 0with torch.no_grad():for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)loss = criterion(predicted_label, label)  # 计算losstotal_acc += (predicted_label.argmax(1) == label).sum().item()total_count += label.size(0)return total_acc/total_count, total_countfrom torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# 参数设置
EPOCHS = 10  # epoch数量
LR = 5  # 学习速率
BATCH_SIZE = 64  # 训练的batch大小# 设置损失函数、优化器和调度器
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None# 准备数据集
train_iter = custom_data_iter(train_data[0].values[:], train_data[1].values[:])
train_dataset = to_map_style_dataset(train_iter)split_train_, split_valid_ = random_split(train_dataset,[int(len(train_dataset)*0.8), int(len(train_dataset)*0.2)])train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)# 训练循环
for epoch in range(1, EPOCHS + 1):epoch_start_time = time.time()train(train_dataloader)val_acc, val_loss = evaluate(valid_dataloader)# 更新学习率的策略lr = optimizer.state_dict()['param_groups'][0]['lr']if total_accu is not None and total_accu > val_acc:scheduler.step()else:total_accu = val_accprint('-' * 69)print('| end of epoch {:3d} | time: {:4.2f}s | ''valid accuracy {:4.3f} | valid loss {:4.3f} | lr {:4.6f}'.format(epoch, time.time() - epoch_start_time, val_acc, val_loss, lr))print('-' * 69)test_acc, test_loss = evaluate(valid_dataloader)
print('模型的准确率: {:5.4f}'.format(test_acc))def predict(text, text_pipeline):with torch.no_grad():text = torch.tensor(text_pipeline(text))output = model(text, torch.tensor([0]))return output.argmax(1).item()# 示例文本字符串
# ex_text_str = "例句输入——这是一个待预测类别的示例句子"
ex_text_str = "这不仅影响到我们的方案是否可行13号的"model = model.to("cpu")print("该文本的类别是: %s" % label_name[predict(ex_text_str, text_pipeline)])

9.代码改进及优化

9.1优化器: 尝试不同的优化算法,如Adam、RMSprop替换原来的SGD优化器部分

9.1.1使用Adam优化器:

import torch
import torch.nn as nn
import torchvision
from torchvision import transforms, datasets
import os, PIL, pathlib, warningswarnings.filterwarnings("ignore")  # 忽略警告信息# win10系统
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)import pandas as pd# 加载自定义中文数据集
train_data = pd.read_csv('D:/train.csv', sep='\t', header=None)
train_data.head()# 构建数据集迭代器
def custom_data_iter(texts, labels):for x, y in zip(texts, labels):yield x, ytrain_iter = custom_data_iter(train_data[0].values[:], train_data[1].values[:])from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import jieba# 中文分词方法
tokenizer = jieba.lcutdef yield_tokens(data_iter):for text,_ in data_iter:yield tokenizer(text)vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])print(vocab(['我', '想', '看', '书', '和', '你', '一起', '看', '电影', '的', '新款', '视频']))label_name = list(set(train_data[1].values[:]))
print(label_name)text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: label_name.index(x)print(text_pipeline('我想看新闻或者上网站看最新的游戏视频'))
print(label_pipeline('Video-Play'))from torch.utils.data import DataLoaderdef collate_batch(batch):label_list, text_list, offsets = [], [], [0]for (_text, _label) in batch:# 标签列表label_list.append(label_pipeline(_label))# 文本列表processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)text_list.append(processed_text)# 偏移量,即词汇的起始位置offsets.append(processed_text.size(0))label_list = torch.tensor(label_list, dtype=torch.int64)text_list = torch.cat(text_list)offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # 累计偏移量dim中维度元素的累计和return text_list.to(device), label_list.to(device), offsets.to(device)# 数据加载器,调用示例
dataloader = DataLoader(train_iter,batch_size=8,shuffle=False,collate_fn=collate_batch)from torch import nnclass TextClassificationModel(nn.Module):def __init__(self, vocab_size, embed_dim, num_class):super(TextClassificationModel, self).__init__()self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)self.fc = nn.Linear(embed_dim, num_class)self.init_weights()def init_weights(self):initrange = 0.5self.embedding.weight.data.uniform_(-initrange, initrange)self.fc.weight.data.uniform_(-initrange, initrange)self.fc.bias.data.zero_()def forward(self, text, offsets):embedded = self.embedding(text, offsets)return self.fc(embedded)
num_class = len(label_name)
vocab_size = len(vocab)
em_size = 64
model = TextClassificationModel(vocab_size, em_size, num_class).to(device)import timedef train(dataloader):model.train()  # 切换到训练模式total_acc, train_loss, total_count = 0, 0, 0log_interval = 50start_time = time.time()for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)optimizer.zero_grad()  # 梯度归零loss = criterion(predicted_label, label)  # 计算损失loss.backward()  # 反向传播torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)  # 梯度裁剪optimizer.step()  # 优化器更新权重# 记录acc和losstotal_acc += (predicted_label.argmax(1) == label).sum().item()train_loss += loss.item()total_count += label.size(0)if idx % log_interval == 0 and idx > 0:elapsed = time.time() - start_timeprint('| epoch {:3d} | {:5d}/{:5d} batches ''| accuracy {:8.3f} | loss {:8.5f}'.format(epoch, idx, len(dataloader),total_acc/total_count, train_loss/total_count))total_acc, train_loss, total_count = 0, 0, 0start_time = time.time()def evaluate(dataloader):model.eval()  # 切换到评估模式total_acc, total_count = 0, 0with torch.no_grad():for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)loss = criterion(predicted_label, label)  # 计算losstotal_acc += (predicted_label.argmax(1) == label).sum().item()total_count += label.size(0)return total_acc/total_count, total_countfrom torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# 参数设置
EPOCHS = 10  # epoch数量
LR = 5  # 学习速率
BATCH_SIZE = 64  # 训练的batch大小# 设置损失函数、优化器和调度器
criterion = torch.nn.CrossEntropyLoss()
#optimizer = torch.optim.SGD(model.parameters(), lr=LR)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None# 准备数据集
train_iter = custom_data_iter(train_data[0].values[:], train_data[1].values[:])
train_dataset = to_map_style_dataset(train_iter)split_train_, split_valid_ = random_split(train_dataset,[int(len(train_dataset)*0.8), int(len(train_dataset)*0.2)])train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)# 训练循环
for epoch in range(1, EPOCHS + 1):epoch_start_time = time.time()train(train_dataloader)val_acc, val_loss = evaluate(valid_dataloader)# 更新学习率的策略lr = optimizer.state_dict()['param_groups'][0]['lr']if total_accu is not None and total_accu > val_acc:scheduler.step()else:total_accu = val_accprint('-' * 69)print('| end of epoch {:3d} | time: {:4.2f}s | ''valid accuracy {:4.3f} | valid loss {:4.3f} | lr {:4.6f}'.format(epoch, time.time() - epoch_start_time, val_acc, val_loss, lr))print('-' * 69)test_acc, test_loss = evaluate(valid_dataloader)
print('模型的准确率: {:5.4f}'.format(test_acc))def predict(text, text_pipeline):with torch.no_grad():text = torch.tensor(text_pipeline(text))output = model(text, torch.tensor([0]))return output.argmax(1).item()# 示例文本字符串
# ex_text_str = "例句输入——这是一个待预测类别的示例句子"
ex_text_str = "这不仅影响到我们的方案是否可行13号的"model = model.to("cpu")print("该文本的类别是: %s" % label_name[predict(ex_text_str, text_pipeline)])

 效果略差于SGD优化器

9.1.2调参:

 效果较SGD优化器提升1个百分点 

 9.1.2使用RMSprop优化器:

import torch
import torch.nn as nn
import torchvision
from torchvision import transforms, datasets
import os, PIL, pathlib, warningswarnings.filterwarnings("ignore")  # 忽略警告信息# win10系统
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)import pandas as pd# 加载自定义中文数据集
train_data = pd.read_csv('D:/train.csv', sep='\t', header=None)
train_data.head()# 构建数据集迭代器
def custom_data_iter(texts, labels):for x, y in zip(texts, labels):yield x, ytrain_iter = custom_data_iter(train_data[0].values[:], train_data[1].values[:])from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import jieba# 中文分词方法
tokenizer = jieba.lcutdef yield_tokens(data_iter):for text,_ in data_iter:yield tokenizer(text)vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])print(vocab(['我', '想', '看', '书', '和', '你', '一起', '看', '电影', '的', '新款', '视频']))label_name = list(set(train_data[1].values[:]))
print(label_name)text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: label_name.index(x)print(text_pipeline('我想看新闻或者上网站看最新的游戏视频'))
print(label_pipeline('Video-Play'))from torch.utils.data import DataLoaderdef collate_batch(batch):label_list, text_list, offsets = [], [], [0]for (_text, _label) in batch:# 标签列表label_list.append(label_pipeline(_label))# 文本列表processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)text_list.append(processed_text)# 偏移量,即词汇的起始位置offsets.append(processed_text.size(0))label_list = torch.tensor(label_list, dtype=torch.int64)text_list = torch.cat(text_list)offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # 累计偏移量dim中维度元素的累计和return text_list.to(device), label_list.to(device), offsets.to(device)# 数据加载器,调用示例
dataloader = DataLoader(train_iter,batch_size=8,shuffle=False,collate_fn=collate_batch)from torch import nnclass TextClassificationModel(nn.Module):def __init__(self, vocab_size, embed_dim, num_class):super(TextClassificationModel, self).__init__()self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)self.fc = nn.Linear(embed_dim, num_class)self.init_weights()def init_weights(self):initrange = 0.5self.embedding.weight.data.uniform_(-initrange, initrange)self.fc.weight.data.uniform_(-initrange, initrange)self.fc.bias.data.zero_()def forward(self, text, offsets):embedded = self.embedding(text, offsets)return self.fc(embedded)
num_class = len(label_name)
vocab_size = len(vocab)
em_size = 64
model = TextClassificationModel(vocab_size, em_size, num_class).to(device)import timedef train(dataloader):model.train()  # 切换到训练模式total_acc, train_loss, total_count = 0, 0, 0log_interval = 50start_time = time.time()for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)optimizer.zero_grad()  # 梯度归零loss = criterion(predicted_label, label)  # 计算损失loss.backward()  # 反向传播torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)  # 梯度裁剪optimizer.step()  # 优化器更新权重# 记录acc和losstotal_acc += (predicted_label.argmax(1) == label).sum().item()train_loss += loss.item()total_count += label.size(0)if idx % log_interval == 0 and idx > 0:elapsed = time.time() - start_timeprint('| epoch {:3d} | {:5d}/{:5d} batches ''| accuracy {:8.3f} | loss {:8.5f}'.format(epoch, idx, len(dataloader),total_acc/total_count, train_loss/total_count))total_acc, train_loss, total_count = 0, 0, 0start_time = time.time()def evaluate(dataloader):model.eval()  # 切换到评估模式total_acc, total_count = 0, 0with torch.no_grad():for idx, (text, label, offsets) in enumerate(dataloader):predicted_label = model(text, offsets)loss = criterion(predicted_label, label)  # 计算losstotal_acc += (predicted_label.argmax(1) == label).sum().item()total_count += label.size(0)return total_acc/total_count, total_countfrom torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# 参数设置
#EPOCHS = 10  # epoch数量
#LR = 5  # 学习速率
#BATCH_SIZE = 64  # 训练的batch大小
EPOCHS = 10  # epoch数量
LR = 0.001  # 通常Adam的学习率设置为一个较小的值,例如0.001
BATCH_SIZE = 64  # 训练的batch大小
# 设置损失函数、优化器和调度器
criterion = torch.nn.CrossEntropyLoss()
#optimizer = torch.optim.SGD(model.parameters(), lr=LR)
#optimizer = torch.optim.Adam(model.parameters(), lr=LR)
optimizer = torch.optim.RMSprop(model.parameters(), lr=LR)scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None# 准备数据集
train_iter = custom_data_iter(train_data[0].values[:], train_data[1].values[:])
train_dataset = to_map_style_dataset(train_iter)split_train_, split_valid_ = random_split(train_dataset,[int(len(train_dataset)*0.8), int(len(train_dataset)*0.2)])train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,shuffle=True, collate_fn=collate_batch)# 训练循环
for epoch in range(1, EPOCHS + 1):epoch_start_time = time.time()train(train_dataloader)val_acc, val_loss = evaluate(valid_dataloader)# 更新学习率的策略lr = optimizer.state_dict()['param_groups'][0]['lr']if total_accu is not None and total_accu > val_acc:scheduler.step()else:total_accu = val_accprint('-' * 69)print('| end of epoch {:3d} | time: {:4.2f}s | ''valid accuracy {:4.3f} | valid loss {:4.3f} | lr {:4.6f}'.format(epoch, time.time() - epoch_start_time, val_acc, val_loss, lr))print('-' * 69)test_acc, test_loss = evaluate(valid_dataloader)
print('模型的准确率: {:5.4f}'.format(test_acc))def predict(text, text_pipeline):with torch.no_grad():text = torch.tensor(text_pipeline(text))output = model(text, torch.tensor([0]))return output.argmax(1).item()# 示例文本字符串
# ex_text_str = "例句输入——这是一个待预测类别的示例句子"
ex_text_str = "这不仅影响到我们的方案是否可行13号的"model = model.to("cpu")print("该文本的类别是: %s" % label_name[predict(ex_text_str, text_pipeline)])

 最佳训练结果略优于其他两种优化器

9.2使用预训练的词嵌入,如Word2Vec、GloVe或者直接使用预训练的语言模型,如BERT,作为特征提取器

在原始代码中使用预训练的词嵌入或BERT模型,需要在定义模型类TextClassificationModel之前加载嵌入,并相应地修改该类。以下是整个流程的步骤:

  • 加载预训练嵌入:

    • 如果使用Word2Vec或GloVe,加载词嵌入并创建一个嵌入层。
    • 如果使用BERT,加载BERT模型和分词器。
  • 修改模型定义:

    • 对于Word2Vec或GloVe,替换模型中的nn.EmbeddingBag为使用预训练嵌入的层。
    • 对于BERT,定义一个新的模型类,其中包含BERT模型和一个分类层。
  • 修改数据预处理:

    • 对于BERT,使用BERT分词器处理文本。
  • 更新训练和评估函数:

    • 适应BERT模型的输入格式。
  • 修改模型初始化:

    • 使用新的模型定义来创建模型实例。
9.2.1使用预训练的词嵌入

如果要使用预训练的Word2Vec或GloVe词嵌入,需要在模型定义之前加载词嵌入,并替换嵌入层,并将它们设置为模型中nn.Embedding层的初始权重。

 替换选中部分

from torchtext.vocab import GloVe# 加载GloVe词嵌入
embedding_glove = GloVe(name='6B', dim=100)def get_embedding(word):return embedding_glove.vectors[embedding_glove.stoi[word]]# 用预训练的嵌入来替换模型中的初始权重
def create_emb_layer(weights_matrix, non_trainable=False):num_embeddings, embedding_dim = weights_matrix.size()emb_layer = nn.Embedding.from_pretrained(weights_matrix, freeze=non_trainable)return emb_layer# 创建权重矩阵
weights_matrix = torch.zeros((vocab_size, em_size))
for i, word in enumerate(vocab.get_itos()):try:weights_matrix[i] = get_embedding(word)except KeyError:# 对于词汇表中不存在于GloVe的词,随机初始化一个嵌入weights_matrix[i] = torch.randn(em_size)# 重写模型定义以使用预训练的嵌入
class TextClassificationModel(nn.Module):def __init__(self, vocab_size, embed_dim, num_class):super(TextClassificationModel, self).__init__()self.embedding = create_emb_layer(weights_matrix, True)  # 设置为True表示不训练嵌入self.fc = nn.Linear(embed_dim, num_class)def forward(self, text, offsets):embedded = self.embedding(text, offsets)return self.fc(embedded)

 创建模型实例:

# 创建新的模型实例(Word2Vec/GloVe或BERT)
model = TextClassificationModel(vocab_size, em_size, num_class).to(device)
# 或者对于BERT
# model = BertTextClassificationModel(num_class).to(device)

运行展示:

运行后自动下载GloVe嵌入截图

9.2.2 使用BERT预训练模型(同上)
from transformers import BertModel, BertTokenizer# 加载预训练的BERT模型和分词器
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
bert_model = BertModel.from_pretrained('bert-base-chinese')class BertTextClassificationModel(nn.Module):def __init__(self, num_class):super(BertTextClassificationModel, self).__init__()self.bert = bert_modelself.fc = nn.Linear(self.bert.config.hidden_size, num_class)def forward(self, text, offsets):# 因为BERT需要特殊的输入格式,所以您需要在这里调整text的处理方式# 这里仅是一个示例,您需要根据实际情况进行调整inputs = bert_tokenizer(text, return_tensors='pt', padding=True, truncation=True)outputs = self.bert(**inputs)# 使用CLS标记的输出来进行分类cls_output = outputs.last_hidden_state[:, 0, :]return self.fc(cls_output)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/649766.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MySQL十部曲之一:MySQL概述及手册说明

文章目录 数据库、数据库管理系统以及SQL之间的关系关系型数据库与非关系型数据库手册语法约定 数据库、数据库管理系统以及SQL之间的关系 名称说明数据库&#xff08;Database&#xff09;即存储数据的仓库&#xff0c;其本质是一个文件系统。它保存了一系列有组织的数据。数…

今日AI大热潮,明日智能风向标

每周跟踪AI热点新闻动向和震撼发展 想要探索生成式人工智能的前沿进展吗&#xff1f;订阅我们的简报&#xff0c;深入解析最新的技术突破、实际应用案例和未来的趋势。与全球数同行一同&#xff0c;从行业内部的深度分析和实用指南中受益。不要错过这个机会&#xff0c;成为AI领…

mybatis-plus常用使用方法

** mybaits-plus常用使用方法 ** 常用三层分别继承方法 1.1mapper层&#xff08;接口定义层&#xff09;可以用BaseMapper<> 例如&#xff1a; 1.2.里面常用的封装方法有 1.3常用方法介绍 【添加数据&#xff1a;&#xff08;增&#xff09;】int insert(T entity);…

RUST笔记:candle使用基础

candle介绍 candle是huggingface开源的Rust的极简 ML 框架。 candle-矩阵乘法示例 cargo new myapp cd myapp cargo add --git https://github.com/huggingface/candle.git candle-core cargo build # 测试&#xff0c;或执行 cargo ckeckmain.rs use candle_core::{Device…

力扣hot100 课程表 拓扑序列

Problem: 207. 课程表 文章目录 思路复杂度Code 思路 &#x1f468;‍&#x1f3eb; 三叶题解 复杂度 时间复杂度: O ( n m ) O(nm) O(nm) 空间复杂度: O ( n m ) O(nm) O(nm) Code class Solution{int N 100010, M 5010, idx;int[] in new int[N];// in[i] 表示节…

MySQL十部曲之四:MySQL中的数据类型

文章目录 前言概述数字类型数字类型语法数字类型字面量十六进制字面量位字面量布尔字面量 数字类型的属性超出范围和溢出处理 时间和日期类型时间和日期类型语法DATE、DATETIME和TIMESTAMP的异同TIMESTAMP和DATETIME的自动初始化和更新时间和日期字面量 字符串类型字符串类型语…

解读BEVFormer,新一代CV工作的基石

文章出处 BEVFormer这篇文章很有划时代的意义&#xff0c;改变了许多视觉领域工作的pipeline[2203.17270] BEVFormer: Learning Birds-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers (arxiv.org)https://arxiv.org/abs/2203.17270 BEV …

php mysql字段默认值使用问题

前提是使用了事务&#xff0c;在第一个阶段 是A表操作保存&#xff0c;第二阶段操作B表&#xff0c;操作B表的时候使用了A表的一个字段&#xff0c;这个字段在第一阶段没有设置值&#xff0c;保存的时候使用字段默认值。 【这种情况 最好是在第一阶段 把后面要使用的字段设置好…

深度学习-搭建Colab环境

Google Colab(Colaboratory) 是一个免费的云端环境&#xff0c;旨在帮助开发者和研究人员轻松进行机器学习和数据科学工作。它提供了许多优势&#xff0c;使得编写、执行和共享代码变得更加简单和高效。Colab 在云端提供了预配置的环境&#xff0c;可以直接开始编写代码&#x…

Python静态web服务器实战

准备html页面&#xff0c;包含两个页面(index.html, index2.html)和一个404(404html)页面&#xff0c;目录示意&#xff1a; 1.返回固定页面 with open("website/index.html","r") as file: import socket# # 返回固定的页面 website/index.html if __na…

arcgis 计算面积(计算经纬度、算数等同理)

arcgis 计算面积 先定义一个新的变量&#xff0c;例如&#xff1a;area 选中&#xff0c;右击&#xff0c;选择“打开属性表格”&#xff0c;在打开的属性表格中单击最左边的按钮&#xff0c;选择“添加字段” 定义新的字段为浮点型变量&#xff0c;定义变量名为area&#xff…

访问者模式-C#实现

该实例基于WPF实现&#xff0c;直接上代码&#xff0c;下面为三层架构的代码。 一 Model using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks;namespace 设计模式练习.Model.访问者模式 {public class Com…

网络安全全栈培训笔记(58-服务攻防-应用协议设备KibanaZabbix远控向日葵VNCTV)

第58天 服务攻防-应用协议&设备Kibana&Zabbix&远控向日葵&VNC&TV 知识点&#xff1a; 1、远程控制第三方应用安全 2、三方应用-向日葵&VNC&TV 3、设备平台-Zabbix&Kibanai漏洞 章节内容&#xff1a; 常见版务应用的安全测试&#xff1a; 1…

[蓝桥杯]真题讲解:冶炼金属(暴力+二分)

蓝桥杯真题视频讲解&#xff1a;冶炼金属&#xff08;暴力做法与二分做法&#xff09; 一、视频讲解二、暴力代码三、正解代码 一、视频讲解 视频讲解 二、暴力代码 //暴力代码 #include<bits/stdc.h> #define endl \n #define deb(x) cout << #x << &qu…

Flink面试题

0. 思维导图 1. 简单介绍一下Flink♥♥ Flink是一个分布式的计算框架&#xff0c;主要用于对有界和无界数据流进行有状态计算&#xff0c;其中有界数据流就是值离线数据&#xff0c;有明确的开始和结束时间&#xff0c;无界数据流就是指实时数据&#xff0c;源源不断没有界限&a…

【c语言】详解操作符(上)

1. 操作符的分类 2. 原码、反码、补码 整数的2进制表示方法有三种&#xff0c;即原码、反码、补码 有符号整数的三种表示方法均有符号位和数值位两部分&#xff0c;2进制序列中&#xff0c;最高位的1位是被当做符号位其余都是数值位。 符号位都是用0表示“正”&#xff0c;用…

JavaScript高级:闭包

1 概念 一个函数对周围状态的引用&#xff0c;捆绑在一起&#xff0c;内层函数中可以访问到外层函数的作用域。 简单理解&#xff1a;闭包 内层函数 外层函数的变量 先看个简单的代码&#xff1a; function outer() {let a 1function inner() {console.log(a)} } outer(…

Mysql-日志介绍 日志配置

环境部署 docker run -d -p 3306:3306 --privilegedtrue -v $(pwd)/logs:/var/lib/logs -v $(pwd)/conf:/etc/mysql/conf.d -v $(pwd)/data:/var/lib/mysql -e MYSQL_ROOT_PASSWORD654321 --name mysql mysql:5.7运行指令的目录下新建好这些文件&#xff1a; 日志类型 日…

顶顶通呼叫中心中间件机器人压力测试配置(mod_cti基于FreeSWITCH)

介绍 顶顶通呼叫中心中间件机器人压力测试(mod_cit基于FreeSWITCH) 一、配置acl.conf 打开ccadmin-》点击配置文件-》点击acl.conf-》我这里是已经配置好了的&#xff0c;这里的192.168.31.145是我自己的内网IP&#xff0c;你们还需要自行修改 二、配置线路 打开ccadmin-&g…

mac/macos上编译electron源码

官方教程&#xff1a;Build Instructions | Electron 准备工作这里不写了&#xff0c;参考官方文档&#xff0c;还有上一篇windows编译electron electron源码下载及编译-CSDN博客 差不多步骤&#xff0c;直接来 网络记得使用魔法 下载编译步骤 0. 选择目录很重要&#xff0…