Transformer学习(4)

在这里插入图片描述
上篇文章完成了Transformer剩下组件的编写,因此本文就可以开始训练。
本文主要介绍训练时要做的一些事情,包括定义损失函数、学习率调整、优化器等。
下篇文章会探讨如何在多GPU上进行并行训练,加速训练过程。

数据集简介

从网上找到一份中英翻译wmt数据集,数据格式如下:

[["english sentence", "中文语句"], ["english sentence", "中文语句"]
]

其中训练、验证、测试集的样本数分别为:176943、25278、50556。
下载地址:https://download.csdn.net/download/yjw123456/88694140 (固定只需要5积分)(ps: 我觉得没必要,网上有现成的数据集,用不香吗)

import pandas as pd
def build_dataframe_from_json(json_path: str,source_tokenizer: spm.SentencePieceProcessor = None,target_tokenizer: spm.SentencePieceProcessor = None,
) -> pd.DataFrame:with open(json_path, 'r', encoding="utf-8") as f:data = json.data(f)df = pd.DataFrame(data, columns=["source", "target"])def _source_vectorize(text: str) -> list[str]:return source_tokenizer.EncodeAsIds(text, add_bos=True, add_eos=True)def _target_vectorize(text: str) -> list[str]:return target_tokenizer.EncodeAsIds(text, add_bos=True, add_eos=True)tqdm.pandas()if source_tokenizer:df["source_indices"] = df.source.progress_apply(lambda x: _source_vectorize(x))if target_tokenizer:df["target_indices"] = df.target.progress_apply(lambda x: _target_vectorize(x))return df

传入原文的目的是计算BLEU分数时方便一点,当然也可以对编码后的索引反向解码成原文。

剩下的事情是通过数据加载器来加载数据集,相关代码如下:

import osassert os.path.exists(train_args.src_tokenizer_file
), "should first run train_tokenizer.py to train the tokenizer"assert os.path.exists(train_args.tgt_tokenizer_path
), "should first run train_tokenizer.py to train the tokenizer"source_tokenizer = spm.SentencePieceProgress(model_file = train_args.src_tokenizer_file
)target_tokenizer = spm.SentencePieceProgress(model_file = train_args.tgt_tokenizer_path
)if train_args.only_test:train_args.use_wandb = Falseif train_args.cuda:device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
else:device = torch.device("cpu")print(f"source tokenizer size: {source_tokenizer.vocab_size()}")
print(f"target tokenizer size: {target_tokenizer.vocab_size()}")set_random_seed(12345)train_dataframe_path = os.path.join(train_args.save_dir, train_args.dataframe_file.format("train")
)
test_dataframe_path = os.path.join(train_args.save_dir, train_args.dataframe_file.format("test")
)
valid_dataframe_path = os.path.join(train_args.save_dir, train_args.dataframe_file.format("dev")
)if os.path.exists(train_dataframe_path) and train_args.use_dataframe_cache:train_df, test_df, valid_df = (pd.read_pickle(train_dataframe_path),pd.read_pickle(test_dataframe_path),pd.read_pickle(valid_dataframe_path),)print("Loads cached dataframes.")
else:print("Create new dataframes.")valid_df = build_dataframe_from_json(f"{train_args.dataset_path}/dev.json", source_tokenizer, target_tokenizer)print("Create valid dataframe")test_df = build_dataframe_from_json(f"{train_args.dataset_path}/test.json", source_tokenizer, target_tokenizer)print("Create test dataframe")train_df = build_dataframe_from_json(f"{train_args.dataset_path}/train.json", source_tokenizer, target_tokenizer)print("Create train dataframe")train_df.to_pickle(train_dataframe_path)test_df.to_pickle(test_dataframe_path)valid_df.to_pickle(valid_dataframe_path)pad_idx = model_args.pad_idxtrain_dataset = NMTDataset(train_df, pad_idx)
valid_dataset = NMTDataset(valid_df, pad_idx)
test_dataset = NMTDateset(test_df, pad_idx)train_dataloader = DataLoader(train_dataset,shuffle=True,batch_size=train_args.batch_size,collate_fn=train_dataset.cllate_fn,
)
valid_dataloader = DataLoader(valid_dataset,shuffle=False,batch_size=train_args.batch_size,collate_fn=valid_dataset.collate_fn,
)
test_dataloader = DataLoader(test_dataset,batch_size=train_args.batch_size,collate_fn=test_dataset.collate_fn,
)

数据处理好之后我们就可以开始训练了。

模型训练

标签平滑

Transformer的训练过程中用到了标签平滑(label smoothing)技术,目的是防止模型训练时过于自信地预测标签,改善泛化能力不足的问题。
简单来说就是降低原来one-hot形式中目标类别(对应1,即100%)的概率,拿出来分给其他类别。
以下内容摘自参考8的论文,不感兴趣可以直接跳过。

在这里插入图片描述

因此需要一种机制让模型不那么自信,虽然与最大化训练标签的对数似然有点相违背,但这确实对模型进行正则化使其更具适应性。

在这里插入图片描述

在这里插入图片描述

这样,LSR可以看成是将单个交叉熵损失H ( q , p )替换为H ( q , p )和H ( u , p )的两个损失的加权和。在训练时,如果模型非常确信的预测出真实标签分布,即H ( q , p )接近0,但H ( u , p )会急剧增大,因此基于标签平滑,我们可以防止模型预测地太过自信。第二项损失惩罚了预测标签分布p 和先验分布u 之间的偏差,注意,这种偏差可以等价地通过KL散度来捕捉。为什么这么说?

在这里插入图片描述

而分布u 的熵H ( u ) 是固定的,所以H ( u , p ) 只有KL散度有关。 当u 是均匀分布时,H ( u , p ) 衡量了预测分布p 与均匀分布的不相似程度,这也可以通过负熵− H ( p ) 来衡量(但并非等价)。

PyTorch在1.10之后就支持标签平滑:

nn.CrossEntropyLoss(ignore_index = pad_idx, reduction="sum", label_smoothing=0.1)

通过传入ignore_index为pad index、reduction='sum’和设置label_smoothing值来使用。
但是光这还不够,当我们使用CrossEntropyLoss时,我们需要拉平模型的输出和标签标记索引,所以我们定义如下损失类来包装CrossEntropyLoss:

class LabelSmoothingLoss(nn.Module):def __init__(self, label_smoothing: float=0.0, pad_idx: int=0) -> None:super().__init__()self.loss_func = nn.CrossEntropyLoss(ignore_index=pad_idx, label_smoothing=label_smoothing)def forward(self, logits: Tensor, labels: Tensor) -> Tensor:vocab_size = logits.shape[-1]logits = logits.reshape(-1, vocab_size)labels = labels.reshape(-1).long()return self.loss_func(logits, labels)

注意,实际上本文用到的数据集使用标签平滑效果反而不好。因此训练过程中并未使用。

学习率&优化器

在这里插入图片描述

from torch.optim import Adamoptimizer = Adam(model.parameters(),betas = (0.9, 0.98),eps = 1e-9)

并使用warmup策略调整学习率:
在这里插入图片描述
使用固定步数warmup_steps \text{warmup_steps}warmup_steps先使学习率线性增长(预热),而后随着step_num \text{step_num}step_num的增加以step_num \text{step_num}step_num的平方根成比例逐渐减小学习率。???

我们可以封装Adam优化器,并支持预热和学习率衰减。

class WarmupScheduler(_LRSheduler):def __init__(self,optimizer,warmup_steps: int,d_model: int,factor: float = 1.0,last_epoch: int = -1,verbose: bool = False,) -> None:"""Args:optimizer(Optimizer): Wrapped optimizer.warmup_steps(int): warmup steps.d_model(int): dimension of embeddings.last_epoch(int, optional): the index of last epoch. Default to -1.verbose(bool, optional): if True, prints a message to stdout for each update. Default to False."""self.warmup_steps = warmup_stepsself.d_model = d_modelself.num_parm_groups = len(optimizer.param_groups)self.factor = factorsuper().__init__(optimizer, last_epoch, verbose)def get_lr(self) -> list[float]:lr = (self.factor*self.d_model**-0.5*min(self._step_count**-0.5, self._step_count * self.warmup_steps**-1.5))return [lr] * self.num_parm_groups

这里通过继承LRScheduler来实现,并且通过factor参数控制学习率的大小,小数据集可以尝试设置成0.5。我们可以画出学习率变化的趋势图:
在这里插入图片描述
关注上图的橙线,可以看到,学习率确实是从0开始逐渐增加,直到4000步后,开始逐渐下降。

在这里插入图片描述
为什么这个公式可以达到这个效果?好像其中包含了一个IF-ELSE似的。为了直观的理解,我们把这个公式重写成:

在这里插入图片描述

这样是不是就大概能看出来了:当warmup_step=4000时,warmup_steps ** 1.5=252982.2128。当训练步数step_num小于热身步数时,函数内右项一直小于左项,但随着训练步数的增加而线性增加;当训练步数到达热身步数warmup_steps时,min函数内的两项相等;当训练步数大于热身步数,函数内左项小于右项,并且随着训练步数的增加而(非线性)减少;这样就实现了我们上图看到的效果。从公式还以看到一点,就是模型的嵌入大小d_model越大,或者warmup_steps越大,学习率的峰值就越小,而且warmup_steps越大,学习率开始增加的越缓慢。

训练分词器

正如上文所述,我们使用sentencepiece工具包进行分词,首先将中英文语句分别读入内存。

import json
def get_mt_pairs(data_dir: str, splits=["train", "dev", "test"]):english_sentences = []chinese_sentences = []"""json content:[["english sentence", "中文语句"], ["english sentence", "中文语句"]]"""for split in splits:with open(f"{data_dir}/{split}.json", "r", encoding="utf-8") as f:data = json.load(f)for pair in data:english_sentences.append(pair[0] + "\n")chinese_sentences.append(pair[1] + "\n")assert len(chinese_sentences) == len(english_sentences)print(f"the total number of sentences: {len(chinese_sentences)}")return chinese_sentences, english_sentences

接着定义一个训练函数,这里用多进程同时训练:

def train_tokenizer(source_corpus_path: str,target_corpus_path: str,source_vocab_size: int,target_vocab_size: int,source_character_converge: float = 1.0,target_character_converge: float = 0.9995,
) -> None:with ProcessPoolExecutor() as executor:futures = [executor.submit(train_sentencepiece_bpe,source_corpus_path,"model_storage/source",source_vocab_size,source_character_converge,),executor.submit(train_sentencepiece_bpe,target_corpus_path,"model_storage/target",target_vocab_size,target_character_converage,),]for future in futures:future.result()sp = spm.SentencePieceProcessor()source_text = """Tesla is recalling nearly all 2 million of its cars on US roads to limit the use of its Autopilot feature following a two-year probe by US safety regulators of roughly 1,000 crashesin which the feature was engaged. The limitations on Autopilot serve as a blow to Tesla's effortsto market its vehicles to buyers willing to pay extra to have their cars to do the driving for them."""sp.load("model_storage/source.model")print(sp.encode_as_pieces(source_text))ids = sp.encode_as_ids(source_text)print(ids)print(sp.decode_ids(ids))target_text = """新华社北京1月2日电(记者丁雅雯、李唐宁)2024年元旦假期,旅游消费十分火爆。旅游平台数据显示,旅游相关产品订单量大幅增长,“异地跨年”“南北互跨”成关键词。业内人士认为,元旦假期旅游“开门红”彰显消费潜力,预计2024年旅游消费有望保持上升势头。"""sp.load("model_storage/target.model")print(sp.encode_as_pieces(target_text))ids = sp.encode_as_ids(target_text)print(ids)print(sp.decode_ids(ids))

最后执行训练代码:

if __name__ == "__main__":make_dirs(train_args.save_dir)chinese_sentences, english_sentences = get_mt_pairs(data_dir = train_args.dataset_path, splits=["train", "dev", "test"])with open(f"{train_args.dataset_path}/corpus.ch", "w", encoding="utf-8") as f:f.writelines(chinese_sentences)with open(f"{train_args.dataset_path}/corpus.en", "w", encoding="utf-8") as f:f.writelines(english_sentences)train_tokenizer(f"{train_args.dataset_path}/corpus.en",f"{train_args.dataset_path}/corpus.ch",source_vocab_size=model_args.source_vocab_size,target_vocab_size=model_args.target_vocab_size,)
['▁Tesla', '▁is', '▁recalling', '▁nearly', '▁all', '▁2', '▁million', '▁of', '▁its', '▁cars', '▁on', '▁US', '▁roads', '▁to', '▁limit', '▁the', '▁use', '▁of', '▁its', '▁Aut', 'op', 'ilot', '▁feature', '▁following', '▁a', 
'▁two', '-', 'year', '▁probe', '▁by', '▁US', '▁safety', '▁regulators', '▁of', '▁roughly', '▁1,000', '▁crashes', '▁in', '▁which', '▁the', '▁feature', '▁was', '▁engaged', '.', '▁The', '▁limitations', '▁on', '▁Aut', 'op', 
'ilot', '▁serve', '▁as', '▁a', '▁blow', '▁to', '▁Tesla', '’', 's', '▁efforts', '▁to', '▁market', '▁its', '▁vehicles', '▁to', '▁buyers', '▁willing', '▁to', '▁pay', '▁extra', '▁to', '▁have', '▁their', '▁cars', '▁do', '▁the', '▁driving', '▁for', '▁them', '.']
[22941, 59, 20252, 2225, 255, 216, 1132, 34, 192, 5944, 81, 247, 6980, 31, 3086, 10, 894, 34, 192, 5296, 177, 31299, 6959, 2425, 6, 600, 31847, 2541, 22423, 144, 247, 3474, 4270, 34, 2665, 8980, 23659, 26, 257, 10, 6959, 219, 5037, 31843, 99, 10725, 81, 5296, 177, 31299, 3343, 98, 6, 6296, 31, 22941, 31849, 31827, 1369, 31, 404, 192, 6287, 31, 10106, 2207, 31, 1129, 2904, 31, 147, 193, 5944, 295, 10, 4253, 75, 437, 31843]
Tesla is recalling nearly all 2 million of its cars on US roads to limit the use of its Autopilot feature following a two-year probe by US safety regulators of roughly 1,000 crashes in which the feature was engaged. The limitations on Autopilot serve as a blow to Tesla’s efforts to market its vehicles to buyers willing to pay extra to have their cars do the driving for them.
['▁新', '华', '社', '北京', '1', '月', '2', '日', '电', '(', '记者', '丁', '雅', '雯', '、', '李', '唐', '宁', ')', '20', '24', '年', '元', '旦', '假期', ',', '旅游', '消费', '十分', '火', '爆', '。', '旅游', '平台', '
数据显示', ',', '旅游', '相关', '产品', '订单', '', '大幅增长', ',', '', '', '', '', '', '', '南北', '', '', '', '', '关键', '', '', '', '', '', '人士', '认为', ',', '', '', '假期', '旅 
游', '', '', '', '', '', '彰显', '消费', '潜力', ',', '预计', '20', '24', '', '旅游', '消费', '有望', '保持', '上升', '势头', '。']
[1460, 29568, 28980, 2200, 28770, 29048, 28779, 28930, 29275, 28786, 2539, 29953, 30003, 1, 28758, 30345, 30229, 30365, 28787, 10, 3137, 28747, 28934, 29697, 18645, 28723, 4054, 266, 651, 29672, 29541, 28724, 4054, 2269, 12883, 28723, 4054, 521, 640, 25619, 28937, 22184, 710, 29596, 28765, 29649, 28747, 28811, 28809, 9356, 29410, 29649, 28811, 28762, 318, 29859, 28724, 28722, 28825, 28922, 1196, 64, 28723, 28934, 29697, 18645, 4054, 28809, 28889, 29208, 30060, 28811, 9466, 266, 1899, 28723, 1321, 10, 3137, 28747, 4054, 266, 4485, 398, 543, 4315, 28724]
新华社北京12日电(记者丁雅 ⁇ 、李唐宁)2024年元旦假期,旅游消费十分火爆。旅游平台数据显示,旅游相关产品订单量大幅增长,“异地跨年”“南北互跨”成关键词。 业内人士认为,元旦假期旅游“开门红”彰显消费潜力,预计2024年旅游消费有望保持上升势头。

这里可以看到,它无法正确识别雯字,因为我们的语料库中没有,所以在一个充分大的语料上训练分词器是非常有必要的。但我们可以先忽略这个问题。整个训练过程只需要几分钟。每个分词器会生成两个文件,一个模型文件和一个词表文件。比如中文的词表.vocab文件内容如下:

<pad> 0
<unk> 0
<s> 0
</s> 0
—— -0
经济 -1
国家 -2
美国 -3
▁但 -4
一个 -5
20 -6
我们 -7
政府 -8
中国 -9
可能 -10
他们 -11
欧洲 -12
问题 -13
...

这样我们有了训练好的BPE分词器,常用的操作如下:

sp.load("model_storage/source.model") # 加载分词器
print(sp.encode_as_pieces(source_text)) # 对文本分词
ids = sp.encode_as_ids(source_text) # 分词并编码成ID序列
print(sp.decode_ids(ids)) # ID序列还原成文本

定义数据加载器

@dataclass
class Batch:source: Tensortarget: Tensorlabels: Tensornum_tokens: intsrc_text: str = Nonetgt_text: str = Noneclass NMTDataset(Dataset):"""Dataset for translation"""def __init__(self, text_df: pd.DataFrame, pad_idx: int = 0) -> None:"""Args:text_df(pd.DataFrame): a DataFrame which contains the processed source and target sentences"""# sorted by target Length# text_df = text_df.iloc[text_df["target"].apply(len).sort_values().index]self.text_df = text_dfself_padding_index = pad_idxdef __getitem__(self, index:int) -> Tuple[list[int], list[int], list[str], list[str]]:row = self.text_df.iloc[index]return (row.source_indices, row_target_indices, row.source, row.target)def collate_fn(self, batch:list[Tuple[list[int], list[int], list[str]]]) -> Tuple[LongTensor, LongTensor, LongTensor]:source_indices = [x[0] for x in batch]target_indices = [x[1] for x in batch]source_text = [x[2] for x in batch]target_text = [x[3] for x in batch]source_indices = [torch.LongTensor(indices) for indices in source_indices]target_indices = [torch.LongTensor(indices) for indices in target_indices]# The <eos> was added before the <pad> token to ensure that the model can correctly the end of a sentence.source = pad_sequence(source_indices, padding_value=self.padding_index, batch_first=True)target = pad_sequence(target_indices, padding_value=self.padding_index, batch_first=True)labels = target[:, 1:]target = target[:, :-1]num_tokens = (labels != self.padding_index).data.sum()return Batch(source, target, labels, num_tokens, source_text, target_text)def __len__(self) -> int:return len(self.text_df)

首先定义数据集类,将数据转换成DataFrame操作比较方便,这里假设传入的内容已经经过分词器的向量化。
我们还需要自己实现collate_fn,把数据转换成我们需要的格式。
具体地,先将源和目标索引序列转换Tensor;然后按批次内最大长度进行填充,即每个批次最大长度是不同的。假设一个批大小为2的批次内数据为:

[[2, 12342, 123, 323, 3, 0, 0, 0],[2, 222, 23, 12, 123, 22, 22, 3]]

这里的2和3分别对应bos和eos的ID,0对应填充ID。可以看到eos id(3)是在pad id(0)之前,这样模型能正确区分句子的结束位置。

填充完之后就得到(batch_size, seq_len)形状的数据,这里seq_len是批次内最大长度。

其中source可以直接输入给编码器,但是解码器的输入以及预测的目标要注意。
举个例子,假设要翻译的一句话为:

['<bos>', '我', '喜', '欢', '打', '篮', '球', '。', '<eos>', '<pad>']

注意后面有一个填充标记,解码器的输入target会移除这句话的最后一个标记,这里是,得到:

target = ['<bos>', '我', '喜', '欢', '打', '篮', '球', '。', '<eos>']

我们要预测的标签labels会移除这句话的第一个标记,都是:

labels = ['我', '喜', '欢', '打', '篮', '球', '。', '<eos>', '<pad>']

即解码器在输入和编码器的编码后,要预测出’我’;(结合mask)在输入[,‘我’]之后要预测出’喜’;…;在输入[‘’, ‘我’, ‘喜’, ‘欢’, ‘打’, ‘篮’, ‘球’, ‘。’]之后要预测出句子结束标记。

有了这个类定义数据加载器就简单了:

DataLoader(dataset, # 数据集类的实例shuffle=True,batch_size=32,collate_fn=dataset.collate_fn,
)

定义训练函数

定义训练和评估函数:

def train(model: nn.Module,data_loader: DataLoader,criterion: torch.nn.Module,optimizer: torch.optim.Optimizer,device: torch.device,clip: float,scheduler: torch.optim.lr_scheduler._LRScheduler,
) -> float:model.train() # train modetotal_loss = 0.0tqdm_iter = tqdm(data_loader)for source, target, labels, _ in tqdm_iter:source = source.to(device)target = target.to(device)labels = labels.to(device)logits = model(source, target)#loss calculationloss = criterion(logits, labels)loss.backward()if clip:torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()scheduler.step()optimizer.zero_grad()total_loss += loss.item()description = f" TRAIN loss={loss.item():.6f}, learning rate={scheduler.get_last_lr()[0]:.7f}"del losstqdm_iter.set_description(description)# average training lossavg_loss = total_loss / len(data_loader)return avg_loss@torch.no_grad()
def evaluate(model: nn.Module,data_loader: DataLoader,device: torch.device,criterion: torch.nn.Module,
) -> float:model.eval()total_loss = 0for source, target, labels, _ in tqdm(data_loader):source = source.to(device)target = target.to(device)labels = labels.to(device)# feed forwardlogits = model(source, target)#loss calculationloss = criteriontotal_loss += loss.item()del loss#average validation lossavg_loss = total_loss / len(data_loader)return avg_loss

贪心搜索

贪心搜索或者说贪心解码,就是每次在预测下一个标记时都选取概率最大的那个。贪心搜索比较好实现,但是我们需要支持批操作,因为我们想在每个训练epoch结束后在验证集上计算一次BLEU分数。

def _greedy_search(self,src: Tensor,src_mask: Tensor,max_gen_len: int,keep_attentions: bool
):memory = self.transformer.encode(src, src_mask)batch_size = src.shape[0]device = src.device# keep track of which sequences are already finishedunfinished_sequences = torch.ones(batch_size, dtype=torch.long, device=device)decoder_inputs = torch.LongTensor(batch_size, 1).fill_(self.bos_idx).to(device)eos_idx_tensor = torch.tensor([self.eos_idx]).to(device)finished = Falsewhile True:tgt_mask = self.generate_subsequent_mask(decoder_inputs.size(1), device)logits = self.lm_head(self.transformer.decode(decoder_inputs,memory,tgt_mask=tgt_mask,memory_mask=src_mask,keep_attentions=keep_attentions,))next_tokens = torch.argmax(logits[:, -1, :], dim=-1)#finished sentences should have their next token be a pad tokennext_tokens = next_tokens * unfinished_sequences = self.pad_idx * (1 - unfinished_sequences)decoder_inputs = torch.cat([decoder_inputs, next_tokens[:, None]], dim=-1)# set sentence to finished if eos_idx was foundunfinished_sequences = unfinished_sequences.mul(next_tokens.tile(eos_idx_tensor.shape[0], 1).ne(eos_idx_tensor.unsqueeze(1)).prod(dim=0))# all sentences have eos_idxif unfinished_sequences.max() == 0:finished = Trueif decoder_inputs.shape[-1] >= max_gen_len:finished = Trueif finished:breakreturn decoder_inputs

开始训练

定义训练参数:

import os
from dataclasses import dataclass
from typing import Tuple@dataclass
class TrainArgument:"""Create a 'data' directory and store the dataset under it"""dataset_path: str = f"{os.path.dirname(__file__)}/data/wmt"save_dir = f"{os.path.dirname(__file__)}/model_storage"src_tokenizer_file: str = f"{save_dir}/source.model"tgt_tokenizer_path: str = f"{save_dir}/target.model"model_save_path: str = f"{save_dir}/best_transformer.pt"dataframe_file: str = "dataframe.{}.pkl"use_dataframe_cache: bool = Truecuda : bool = Truenum_epochs: int = 40batch_size: int = 32gradient_accumulation_steps: int = 1grad_clipping: int = 0 # 0 dont use grad clipbetas: Tuple[float, float] = (0.9, 0.997)eps: float = 1e-6label_smoothing: float = 0warmup_steps: int = 6000warmup_factor: float = 0.5only_test: bool = Falsemax_gen_len: int = 60use_wandb: bool = Truepatient: int = 5gpus = [1, 2, 3]seed = 12345calc_bleu_during_train: bool = True@dataclass
class ModelArgument:d_model: int = 512 # dimension of embeddingsn_heads: int = 8 # number of self attention headsnum_encoder_layers: int = 6 # number of encoder layersnum_decoder_layers: int = 6 # number of decoder layersd_ff: int = d_model * 4 # dimension of feed-forward networkdropout: float = 0.1 # dropout ratio in the whole networkmax_position: int=(5000 # supported max length of the sequence in positional encoding)source_vocab_size: int = 32000target_vocab_size: int = 32000pad_idx: int = 0norm_first: bool = Truetrain_args = TrainArgument()
model_args = ModelArgument()

warmup_steps的设置和总训练步数有关,一般训练成总训练步数的5-10%。

train_args = TrainArgument()if __name__ == "__main__":assert os.path.exists(train_args.src_tokenizer_path ###???), "should first run train_tokenizer.py to train the tokenizer"assert os.path.exists(train_args.tgt_tokenizer_path), "should first run train_tokenizer.py to train the tokenizer"source_tokenizer = BPETokenizer.load_model(train_args.src_tokenizer_path) ###???target_tokenizer = BPETokenizer.load_model(train_args.tgt_tokenizer_path)print(f"source tokenizer size: {source_tokenizer.vocab_size}")print(f"target tokenizer size: {target_tokenizer.vocab_size}")train_df = build_dataframe_from_csv(train_args.dataset_csv.format("train"))valid_df = build_dataframe_from_csv(train_args.dataset_csv.format("dev"))test_df = build_dataframe_from_csv(train_args.dataset_csv.format("test"))train_dataset = NMTDataset(train_df, source_tokenizer, target_tokenizer)valid_dataset = NMTDataset(valid_df, source_tokenizer, target_tokenizer)test_dataset = NMTDataset(test_df, source_tokenizer, target_tokenizer)train_dataloader = DataLoader(train_dataset,batch_size=train_args.batch_size,shuffle=True,collate_fn=train_dataset.collate_fn,)valid_dataloader = DataLoader(valid_dataset,batch_size=train_args.batch_size,collate_fn=valid_dataset.collate_fn,)test_dataloader = DataLoader(test_dataset,batch_size=train_args.batch_size,collate_fn=test_dataset.collate_fn,)if train_args.cuda:device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")else:device = torch.device("cpu")model_args = ModelArgument()pad_idx = target_tokenizer.pad_idxmodel_args.pad_idx = pad_idxmodel_args.source_vocab_size = source_tokenizer.vocab_sizemodel_args.target_vocab_size = target_tokenizer.vocab_sizemodel = Transformer(**asdict(model_args))print(model)print(f"The model has {count_parameters(model)} trainable parameters")model.to(device)if train_args.use_wandb:import wandb# start a new wandb run to track this scriptwandb.init(# set the wandb project where this run will be loggedproject="transfomer",config={"architecture": "Transformer","dataset": "en-cn","epochs": train_args.num_epochs,},)train_criterion = LabelSmoothingLoss(train_args.label_smoothing, pad_idx)# no label smoothing for validationvalid_criterion = LabelSmoothingLoss(0, pad_idx)optimizer = torch.optim.Adam(model.parameters(), betas=train_args.betas, eps=train_args.eps)beta_loss = float("inf")print(f"begin train with argument: {train_args}")print(f"total train steps: {len(train_dataloader) * train_args.num_epochs}")if not train_args.inference:for epoch in range(train_args.num_epochs):train_loss = train(model,train_dataloader,train_criterion,optimizer,device,train_args.grad_clipping,scheduler,)valid_loss = evaluate(model, valid_dataloader, valid_criterion)print(f"end of epoch {epoch+1:3d} | train loss: {train_loss:.4f} valid loss {valid_loss:.4f}")if train_args.use_wandb:wandb.log({"train_loss": train_loss, "valid_loss": valid_loss})if valid_loss < best_loss:best_loss = valid_lossprint(f"Save model with best valid loss: {best_loss:.4f}")torch.save(model.state_dict(), train_args.model_save_path)model.load_state_dict(torch.load(train_args.model_save_path))test_loss = evaluate(model, test_dataloader, valid_criterion)#calculate bleu scorebleu_score = calculate_bleu(model,source_tokenizer,target_tokenizer,test_df,train_args.max_gen_len,device,)print(f"TEST loss={test_loss:.4f} bleu score: {bleu_score}")

begin train with arguments: {‘d_model’: 512, ‘n_heads’: 8, ‘num_encoder_layers’: 6, ‘num_decoder_layers’: 6, ‘d_ff’: 2048, ‘dropout’: 0.1, ‘max_positions’: 5000, ‘source_vocab_size’: 32000, ‘target_vocab_size’: 32000, ‘pad_idx’: 0, ‘norm_first’: True, ‘dataset_path’: ‘nlp-in-action/transformers/transformer/data/wmt’, ‘src_tokenizer_file’: ‘nlp-in-action/transformers/transformer/model_storage/source.model’, ‘tgt_tokenizer_path’: ‘nlp-in-action/transformers/transformer/model_storage/target.model’, ‘model_save_path’: ‘nlp-in-action/transformers/transformer/model_storage/best_transformer.pt’, ‘dataframe_file’: ‘dataframe.{}.pkl’, ‘use_dataframe_cache’: True, ‘cuda’: True, ‘num_epochs’: 40, ‘batch_size’: 32, ‘gradient_accumulation_steps’: 1, ‘grad_clipping’: 0, ‘betas’: (0.9, 0.997), ‘eps’: 1e-06, ‘label_smoothing’: 0, ‘warmup_steps’: 8000, ‘warmup_factor’: 1.0, ‘only_test’: False, ‘max_gen_len’: 60, ‘use_wandb’: True, ‘patient’: 5, ‘calc_bleu_during_train’: True}
total train steps: 221200
TRAIN loss=6.496174, learning rate=0.0002630: 100%|██████████| 5530/5530 [09:39<00:00, 9.54it/s]
100%|██████████| 790/790 [00:25<00:00, 30.93it/s]
100%|██████████| 790/790 [09:33<00:00, 1.38it/s]
end of epoch 1 | train loss: 7.5265 | valid loss: 6.4111 | valid bleu_score 2.73
Save model with best bleu score :2.73
TRAIN loss=5.051253, learning rate=0.0002101: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.95it/s]
100%|██████████| 790/790 [08:29<00:00, 1.55it/s]
end of epoch 2 | train loss: 5.6566 | valid loss: 4.8901 | valid bleu_score 13.65
Save model with best bleu score :13.65
TRAIN loss=4.618272, learning rate=0.0001716: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.95it/s]
100%|██████████| 790/790 [07:16<00:00, 1.81it/s]
end of epoch 3 | train loss: 4.4314 | valid loss: 4.1444 | valid bleu_score 19.75
Save model with best bleu score :19.75
TRAIN loss=3.363390, learning rate=0.0001486: 100%|██████████| 5530/5530 [09:42<00:00, 9.50it/s]
100%|██████████| 790/790 [00:25<00:00, 30.94it/s]
100%|██████████| 790/790 [07:27<00:00, 1.77it/s]
end of epoch 4 | train loss: 3.7425 | valid loss: 3.8078 | valid bleu_score 22.49
Save model with best bleu score :22.49
TRAIN loss=2.784010, learning rate=0.0001329: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.92it/s]
100%|██████████| 790/790 [07:00<00:00, 1.88it/s]
end of epoch 5 | train loss: 3.3077 | valid loss: 3.6406 | valid bleu_score 23.61
Save model with best bleu score :23.61
TRAIN loss=2.984864, learning rate=0.0001213: 100%|██████████| 5530/5530 [09:42<00:00, 9.50it/s]
100%|██████████| 790/790 [00:25<00:00, 30.93it/s]
100%|██████████| 790/790 [07:01<00:00, 1.87it/s]
end of epoch 6 | train loss: 2.9858 | valid loss: 3.5483 | valid bleu_score 25.05
Save model with best bleu score :25.05
TRAIN loss=2.415353, learning rate=0.0001123: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.94it/s]
100%|██████████| 790/790 [06:59<00:00, 1.88it/s]
end of epoch 7 | train loss: 2.7246 | valid loss: 3.5058 | valid bleu_score 25.26
Save model with best bleu score :25.26
TRAIN loss=2.376031, learning rate=0.0001051: 100%|██████████| 5530/5530 [09:41<00:00, 9.50it/s]
100%|██████████| 790/790 [00:25<00:00, 30.94it/s]
100%|██████████| 790/790 [07:05<00:00, 1.86it/s]
end of epoch 8 | train loss: 2.5033 | valid loss: 3.5067 | valid bleu_score 25.43
Save model with best bleu score :25.43
TRAIN loss=2.036147, learning rate=0.0000990: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.97it/s]
100%|██████████| 790/790 [07:17<00:00, 1.81it/s]
end of epoch 9 | train loss: 2.3110 | valid loss: 3.5108 | valid bleu_score 25.49
Save model with best bleu score :25.49
TRAIN loss=2.295238, learning rate=0.0000940: 100%|██████████| 5530/5530 [09:40<00:00, 9.53it/s]
100%|██████████| 790/790 [00:25<00:00, 30.91it/s]
100%|██████████| 790/790 [07:11<00:00, 1.83it/s]
end of epoch 10 | train loss: 2.1405 | valid loss: 3.5340 | valid bleu_score 25.92
Save model with best bleu score :25.92
TRAIN loss=2.026224, learning rate=0.0000896: 100%|██████████| 5530/5530 [09:40<00:00, 9.52it/s]
100%|██████████| 790/790 [00:25<00:00, 30.94it/s]
100%|██████████| 790/790 [07:13<00:00, 1.82it/s]
end of epoch 11 | train loss: 1.9879 | valid loss: 3.5786 | valid bleu_score 25.53
TRAIN loss=1.975156, learning rate=0.0000858: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.94it/s]
100%|██████████| 790/790 [06:52<00:00, 1.91it/s]
end of epoch 12 | train loss: 1.8505 | valid loss: 3.6214 | valid bleu_score 25.57
TRAIN loss=1.730956, learning rate=0.0000824: 100%|██████████| 5530/5530 [09:41<00:00, 9.50it/s]
100%|██████████| 790/790 [00:25<00:00, 30.97it/s]
100%|██████████| 790/790 [07:10<00:00, 1.83it/s]
end of epoch 13 | train loss: 1.7260 | valid loss: 3.6728 | valid bleu_score 25.59
TRAIN loss=1.944140, learning rate=0.0000794: 100%|██████████| 5530/5530 [09:40<00:00, 9.52it/s]
100%|██████████| 790/790 [00:25<00:00, 30.93it/s]
100%|██████████| 790/790 [07:15<00:00, 1.82it/s]
end of epoch 14 | train loss: 1.6129 | valid loss: 3.7186 | valid bleu_score 25.60
TRAIN loss=1.699621, learning rate=0.0000767: 100%|██████████| 5530/5530 [09:41<00:00, 9.51it/s]
100%|██████████| 790/790 [00:25<00:00, 30.95it/s]
100%|██████████| 790/790 [07:22<00:00, 1.79it/s]
end of epoch 15 | train loss: 1.5094 | valid loss: 3.7738 | valid bleu_score 25.44
Stop from early stopping.
100%|██████████| 1580/1580 [00:51<00:00, 30.91it/s]
100%|██████████| 1580/1580 [14:28<00:00, 1.82it/s]
TEST loss=3.5372 bleu score: 25.85
wandb: Waiting for W&B process to finish… (success).
wandb:
wandb: Run history:
wandb: train_loss █▆▄▄▃▃▂▂▂▂▂▁▁▁▁
wandb: valid_bleu_score ▁▄▆▇▇██████████
wandb: valid_loss █▄▃▂▁▁▁▁▁▁▁▁▁▂▂
wandb:
wandb: Run summary:
wandb: train_loss 1.50937
wandb: valid_bleu_score 25.44111
wandb: valid_loss 3.77379

在单卡A10上训练一个epoch大概需要20分钟,实际训练了15个epoch,训练时长300分钟,即5个小时。时间有点长,不利于调参。
最终在测试集上的BLEU得分为25.85。
后文我们会探讨如何对整个耗时进行优化,通过但不限于多卡训练、KV Cache等方法。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/23300.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

出图效率倍增!47个高质量的 Stable Diffusion 常用模型推荐

“选用适当的模型&#xff0c;随随便便出个图&#xff0c;都要比打上一堆提示词的效果要好。” 事实如此&#xff0c;高质量的模型&#xff0c;能够成倍提升出图质量。目前 CivitAI&#xff08;俗称 C 站&#xff0c; https://civitai.com/ &#xff09;是业内比较成熟的一个 …

E10:流程主表表单字段值变化触发事件

效果– //window.WeFormSDK.showMessage("这是一个E10的提示", 3, 2); const onClickCreate () > console.log("create"); const onClickSave () > console.log("save"); const onClickCancel () > dialogComponent?.destroy();/…

如何学习提示词?

随着人工智能技术的飞速发展&#xff0c;AI已经成为我们生活中不可或缺的一部分。从智能助手到专业领域的应用&#xff0c;AI正在以前所未有的速度和规模改变着我们的工作与生活。而在AI的众多应用中&#xff0c;提示词&#xff08;Prompt&#xff09;的作用日益凸显。提示词&a…

【Python】 Python应用的最佳项目结构解析

基本原理 在Python开发中&#xff0c;一个清晰且结构化良好的项目布局对于项目的可维护性、可扩展性和团队协作至关重要。项目结构不仅影响代码的组织方式&#xff0c;还影响到开发流程和部署策略。一个优秀的项目结构应该能够方便地进行模块化开发&#xff0c;易于理解&#…

47、Flink 的 Data Source 原理

1.Data Source 原理 a&#xff09;核心组件 一个数据 source 包括三个核心组件&#xff1a;分片&#xff08;Splits&#xff09;、分片枚举器&#xff08;SplitEnumerator&#xff09; 以及 源阅读器&#xff08;SourceReader&#xff09;。 分片&#xff08;Split&#xff…

第二证券炒股知识:美股的交易规则有哪些?

在经济全球化的浪潮中&#xff0c;美股以其开放的商场体系和完善的买卖规则吸引了不少出资者的关注。关于美股的买卖规则有哪些&#xff0c;第二证券下面就为大家详细介绍一下。 1、美股的买卖时刻&#xff1a;美股的买卖时刻依照美国东部时刻核算&#xff0c;分为夏令时和冬令…

elementUI el-table高度heght和总结summary 同时使用 表格样式异常

背景 同时使用height和 show-summary 样式错位 解决方案 在钩子函数updated 中重新渲染此表格 <el-table :height"autoHeight" show-summary ref"dataTable" >updated() {this.$nextTick(() >{this.$refs.dataTable.doLayout();})},更改后的效果 …

Java注解使用与自定义

一、什么是注解 注解是元数据的一种形式&#xff0c;它提供有关程序的数据&#xff0c;该数据不属于程序本身。注解对其注释的代码操作没有直接影响。换句话说&#xff0c;注解携带元数据&#xff0c;并且会引入一些和元数据相关的操作&#xff0c;但不会影响被注解的代码的逻…

导入地址表钩取技术解析

前置知识 导入表 在一个可执行文件需要用到其余DLL文件中的函数时&#xff0c;就需要用到导入表&#xff0c;用于记录需要引用的函数。例如我们编写的可执行文件需要用到CreateProcess函数&#xff0c;就需要用到kernel32.dll文件并且将其中的CreateProcess函数的信息导入到我…

重复文件查找?6款电脑重复文件清理软件很靠谱!

在日常使用电脑过程中&#xff0c;很多人下载文件后常常会忘记它们的存在&#xff0c;导致同一份资料在系统中存在多个副本。虽然你可以手动删除 Windows 系统中的所有重复文件&#xff0c;但这样做很费时间&#xff0c;而且有可能会遗漏很多文件。 而且随着重复文件的不断累积…

基于springboot实现餐饮管理系统项目【项目源码+论文说明】计算机毕业设计

基于springboot实现餐饮管理系统演示 摘要 互联网发展至今&#xff0c;无论是其理论还是技术都已经成熟&#xff0c;而且它广泛参与在社会中的方方面面。它让信息都可以通过网络传播&#xff0c;搭配信息管理工具可以很好地为人们提供服务。针对信息管理混乱&#xff0c;出错率…

使用 Navicat 工具查看 SQLite 数据库中的 PNG 图片

Navicat 是一款功能强大的数据库管理工具&#xff0c;支持多种数据库类型&#xff0c;包括 SQLite。它提供了一个直观的用户界面&#xff0c;可以轻松查看、编辑和管理数据库数据。 SQLite 是一种轻量级的嵌入式数据库&#xff0c;常用于移动应用程序和小型项目。它支持存储各…

量化投资分析平台 迅投 QMT(四)获取标的期权的代码

量化投资分析平台 迅投 QMT [迅投 QMT](https://www.xuntou.net/?user_code7NYs7O)我目前在使用有了底层标的如何获取期权的交易代码呢&#xff1f;上代码历史帖子 迅投 QMT 我目前在使用 两个月前&#xff08;2024年4月&#xff09;迅投和CQF有一个互动的活动&#xff0c;进…

Linux--Socket编程基础

一、Socket简介 套接字&#xff08; socket &#xff09;是 Linux 下的一种进程间通信机制&#xff08; socket IPC &#xff09;&#xff0c; 使用 socket IPC 可以使得在不同主机上的应用程序之间进行通信&#xff08;网络通信&#xff09;&#xff0c;当然也可以是同一台…

【Excel】Excel中将日期格式转换为文本格式,并按日期显示。

【问题需求】 在使用excel进行数据导入的过程中&#xff0c; 有的软件要求日期列必须是文本格式。 但是直接将日期列的格式改为文本后&#xff0c;显示一串数字&#xff0c;而不按日期显示。 进而无法导入使用。 【解决方法】 使用【TXET】函数公式进行处理&#xff0c; 在单…

EDA数据跨网交换解决方案,一文了解

EDA数据通常与电子设计自动化相关&#xff0c;这是一种利用计算机辅助设计&#xff08;CAD&#xff09;软件来完成超大规模集成电路&#xff08;VLSI&#xff09;芯片的功能设计、综合、验证、物理设计等流程的技术。以下是一些会涉及到EDA数据的行业&#xff1a; 集成电路设计…

淘宝扭蛋机源码解析:功能实现与技术细节

随着在线购物和娱乐的融合&#xff0c;淘宝扭蛋机作为一种创新的购物娱乐方式&#xff0c;受到了广大用户的喜爱。本文将深入解析淘宝扭蛋机的源码&#xff0c;探讨其功能实现与技术细节&#xff0c;以期为开发者们提供一些有价值的参考。 一、功能实现 1.用户登录与注册 淘宝…

《深入浅出OCR》项目实战:基于CRNN的文字识别

基于CRNN的文本字符验证码识别 1项目介绍链接&#xff1a; 为方便大家快速上手OCR实战&#xff0c;本次实战项目采用开源框架PaddleOCR&#xff0c;大家可以参考官网文档快速了解基本使用&#xff0c;项目数据为2022 DCIC赛题中提供的验证码数据集&#xff0c;大家可以参考其他…

圈子社区系统源码 开源 多端圈子社区论坛系统 社区圈子管理系统

介绍 圈子论坛小程序&#xff0c;是一款为用户提供交流分享、互动沟通的平台。在这个小程序中&#xff0c;用户可以轻松地加入各种不同兴趣爱好的圈子&#xff0c;与志同道合的朋友们交流互动。圈子论坛小程序不仅仅是一个简单的社交工具&#xff0c;更是一个打开新世界大门的…