基于Tensorflow来重现GPT v1模型

OpenAI推出的ChatGPT模型让我们看到了通用人工智能的发展潜力,我也找了GPT的相关论文来进行研究。OpenAI在2017年的论文Improving Language Understanding by Generative Pre-Training提出了GPT的第一个版本,我也基于这个论文来用Tensorflow进行了复现。

数据集的下载

GPT是基于bookcorpus数据集来进行预训练的。在huggingface.co网站里面提供了相关的数据集。以下代码是下载数据集并展示第一条数据

from datasets import load_dataset
dataset = load_dataset("bookcorpusopen", split="train")
dataset["train"][0]

这个数据集总共包括了17868本图书,其中每本图书对应title和text两个字段,我们将基于Text来进行训练。按照GPT论文的描述,其采用了BPE来对文本进行tokenizer,在Huggingface里面有一篇文章解释了BPE的原理和训练细节,Byte-Pair Encoding tokenization - Hugging Face NLP Course,这里我直接采用huggingface的tokenizer预训练好的gpt模型。

from transformers import OpenAIGPTTokenizer
block_size=513tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')def tokenize_function(examples):token_ids = [tokenizer(text) for text in examples["text"]]total_length = [len(t["input_ids"]) for t in token_ids]total_length = [(l//(block_size+1))*(block_size+1) for l in total_length]result = []label = []for i in range(len(total_length)):result.extend([token_ids[i]["input_ids"][j:j+block_size+1] for j in range(0, total_length[i], block_size+1)])return {"token_ids": result}ds_test = ds['train'].select(range(10000))tokenized_datasets = ds_test.map(tokenize_function, batched=True, num_proc=8, remove_columns=["title", "text"], batch_size=100
)tokenized_datasets.save_to_disk("data/boocorpusopen_10000_512tokens")

在以上代码中,我把数据集的每本书的text文本通过tokenizer来转化为token id,然后每513个tokenid保存为一条数据记录,因为在GPT论文中是对512个token进行训练的,因此我们在训练时取这513个token的前512个作为训练,然后对应的第2-513个token作为label,最后把处理后的数据集保存到本地。

因为我将要在tensorflow的模型中进行训练,还要把这个数据集转化为tensorflow dataset的格式。我们可以直接调用tokenized_datasets.to_tf_dataset函数来进行转化,但是我发现这样转换之后,要读取dataset的数据很慢。因此我先把数据集转化为TFRecords的文件格式,这样读取速度就能加快很多,以下的代码把每10万条记录保存为一个tfrecord文件,每个文件大约100M。

import tensorflow as tf
from tqdm import tqdmdef _int64_feature(value):"""Returns an int64_list from a bool / enum / int / uint."""return tf.train.Feature(int64_list=tf.train.Int64List(value=value))def serialize_example(token_ids):feature = {'token_ids': _int64_feature(token_ids)}example_proto = tf.train.Example(features=tf.train.Features(feature=feature))return example_proto.SerializeToString()records_num = 100000
count = 0
for record in tqdm(ds):if count%records_num == 0:writer = tf.io.TFRecordWriter("bookcorpus_"+str(count//records_num)+".tfrecords")writer.write(serialize_example(record['token_ids']))count += 1if count%records_num == 0:writer.close()
if writer:writer.close()

之后我们就可以读取数据了

data_dir = "/data/datasets/bookcorpus_tf/"
filenames = os.listdir(data_dir)
filenames = [data_dir+f for f in filenames]
tf_ds = tf.data.TFRecordDataset(filenames)
tf_ds = tf_ds\.map(_parse_function, num_parallel_calls=tf.data.AUTOTUNE)\.shuffle(buffer_size=batch_size*100)\.prefetch(tf.data.experimental.AUTOTUNE)\.batch(batch_size)

我们可以检查一下Batch的数据,取出数据并用tokenizer来解码

data = next(iter(tf_ds))
tokenizer.decode(data['token_ids'][0])

结果如下:

"i was sitting to the left of alex, and tinker was to his right, with lilla sitting to the right of her. tinker was leaning over to alex, chatting away, when her hand suddenly gently slid along his thigh, and onto his crotch. oops! now that was a surprise. unexpected, to say the least. right out of the blue. alex looked at me in desperation, and i initially laughed. we hadn't really thought about any sexual side to the situation, we had just considered the girls to be friends. plus, honestly, while they were very nice people, they weren't at all our cup of tea, so to speak. besides, what exactly was the situation down there, in the lady garden? had operations been done? would we end up comparing who had the bigger penis? we didn't know, and we didn't want to find out. it was time to get out of there. we both went into time - to - go mode. \n'you know, we got ta get up early in the morning, so we better hit the road.'i yelled to all and sundry. \n alex was already on his feet, yelling out something similar as well. we started heading to the door. the girls were calling after us, but i couldn't hear what they were saying, over the music, and the blood pumping in my head. i just waved back at them. \n it was with some relief that we found ourselves back out in the industrial wasteland. \n'fuck, what a surprise!'alex said as we ran off in the darkness.'i didn't see that coming. i hope they won't be upset with us. i thought they had sex for money, anyway.'\n'maybe on their night off they like a bit of young cock? fucked if i know. you should have seen the look on your face, man!'\n'shit, i don't want to think about it.'\n'don't worry, i won't be letting you forget this one.'\n we were pretty relieved, and happy to be out of that situation, and pretty much laughed about it all the way home. i would be pulling alex's leg over that one for a long time to come. mind you, probably lilla hadn't been far away from making a move on me, if it had all gone successfully with tinker and alex. sometimes it all comes down to who is the person closest to the door, or, as in that case, who is in the"

现在数据集我们已经准备好了。

建立GPT模型

根据论文的描述,GPT只采用了Transformer里面的Decoder,因为Encoder是通过查看整个训练数据的上下文来建立Token之间的联系的,但是对于文本生成来说,只能通过上文来预测之后的token,因此只能采用Decoder。论文的模型架构如下,共采用了12个Decoder组合而成,每个Decoder包含了12个Attention Head:

关于Transformer模型的解读,可以见我以前写的博客基于Tensorflow实现一个Transformer翻译器_tensorflow transformer_gzroy的博客-CSDN博客

首先是定义multi attention head,代码如下:

def scaled_dot_product_attention(q, k, v, mask):"""Calculate the attention weights.q, k, v must have matching leading dimensions.k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.The mask has different shapes depending on its type(padding or look ahead)but it must be broadcastable for addition.Args:q: query shape == (..., seq_len_q, depth)k: key shape == (..., seq_len_k, depth)v: value shape == (..., seq_len_v, depth_v)mask: Float tensor with shape broadcastableto (..., seq_len_q, seq_len_k). Defaults to None.Returns:output, attention_weights"""matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)# scale matmul_qkdk = tf.cast(tf.shape(k)[-1], tf.float32)scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)# add the mask to the scaled tensor.if mask is not None:scaled_attention_logits += (mask * -1e9)# softmax is normalized on the last axis (seq_len_k) so that the scores# add up to 1.attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)return output, attention_weightsclass MultiHeadAttention(tf.keras.layers.Layer):def __init__(self,*, d_model, num_heads):super(MultiHeadAttention, self).__init__()self.num_heads = num_headsself.d_model = d_modelassert d_model % self.num_heads == 0self.depth = d_model // self.num_headsself.wq = tf.keras.layers.Dense(d_model)self.wk = tf.keras.layers.Dense(d_model)self.wv = tf.keras.layers.Dense(d_model)self.dense = tf.keras.layers.Dense(d_model)def split_heads(self, x, batch_size):"""Split the last dimension into (num_heads, depth).Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)"""x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))return tf.transpose(x, perm=[0, 2, 1, 3])def call(self, v, k, q, mask):batch_size = tf.shape(q)[0]q = self.wq(q)  # (batch_size, seq_len, d_model)k = self.wk(k)  # (batch_size, seq_len, d_model)v = self.wv(v)  # (batch_size, seq_len, d_model)q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)# scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)# attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)concat_attention = tf.reshape(scaled_attention,(batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)return output, attention_weights

然后是Feed forward层,代码如下

def point_wise_feed_forward_network(d_model, dff):return tf.keras.Sequential([tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)])

定义一个decoder layer,把以上的两个层组合起来:

class DecoderLayer(tf.keras.layers.Layer):def __init__(self,*, d_model, num_heads, dff, rate=0.1):super(DecoderLayer, self).__init__()self.mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)self.ffn = point_wise_feed_forward_network(d_model, dff)self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.dropout1 = tf.keras.layers.Dropout(rate)self.dropout2 = tf.keras.layers.Dropout(rate)def call(self, x, training, look_ahead_mask):attn, attn_weights_block = self.mha(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)attn = self.dropout1(attn, training=training)out1 = self.layernorm1(attn + x)ffn_output = self.ffn(out1)  # (batch_size, target_seq_len, d_model)ffn_output = self.dropout2(ffn_output, training=training)out2 = self.layernorm2(ffn_output + out1)  # (batch_size, target_seq_len, d_model)return out2, attn_weights_bloc

最后就是定义一个GPT模型,模型里面包括了12个Decoder layer。

class Decoder(tf.keras.layers.Layer):def __init__(self,*, num_layers, d_model, num_heads, dff, target_vocab_size, rate=0.1):super(Decoder, self).__init__()self.d_model = d_modelself.num_layers = num_layersself.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)self.pos_encoding = tf.reshape(tf.range(target_vocab_size-block_size, target_vocab_size), shape=[1, -1])self.dec_layers = [DecoderLayer(d_model=d_model, num_heads=num_heads, dff=dff, rate=rate)for _ in range(num_layers)]self.dropout = tf.keras.layers.Dropout(rate)def call(self, x, training, look_ahead_mask):#seq_len = tf.shape(x)[1]attention_weights = {}x = self.embedding(x)  # (batch_size, block_size, d_model)x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))x += self.embedding(self.pos_encoding)x = self.dropout(x, training=training)for i in range(self.num_layers):x, block1 = self.dec_layers[i](x, training, look_ahead_mask)attention_weights[f'decoder_layer{i+1}_block1'] = block1# x.shape == (batch_size, target_seq_len, d_model)return x, attention_weightstarget_vocab_size = vocab_size + block_sizetransformer = Transformer(num_layers=num_layers,d_model=d_model,num_heads=num_heads,dff=dff,target_vocab_size=target_vocab_size,rate=dropout_rate)

解释一下上面的代码,输入的序列Token通过嵌入向量的变换,每个Token映射到一个768维的向量。然后这个序列的向量需要添加位置信息,按照论文的解释,这里没有采用正弦余弦的位置信息,而是采用嵌入向量的方式。例如词汇表由40000个词,对应40000个token,我们的输入序列是包括512个token,因此新增40000-40511这512个token对应输入序列的各个位置,然后把这个位置token对应的嵌入向量加到输入序列的嵌入向量中,使得输入包含每个token的位置信息。

训练模型

要训练模型,我们需要定义一个Loss函数,来计算模型的Loss值。

因为我们是根据给定上文的若干个token,模型预测下一个token,因此我们计算这个预测的CategoryCrossEntropy,代码如下:

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')def loss_function(real, pred):mask = tf.math.logical_not(tf.math.equal(real, 0))loss_ = loss_object(real, pred)mask = tf.cast(mask, dtype=loss_.dtype)loss_ *= maskreturn tf.reduce_sum(loss_)/tf.reduce_sum(mask)train_loss = tf.keras.metrics.Mean(name='train_loss')

为了在训练过程中了解模型的预测性能,我们还需要定义一个准确度的指标

def accuracy_function(real, pred):accuracies = tf.equal(real, tf.argmax(pred, axis=2))mask = tf.math.logical_not(tf.math.equal(real, 0))accuracies = tf.math.logical_and(mask, accuracies)accuracies = tf.cast(accuracies, dtype=tf.float32)mask = tf.cast(mask, dtype=tf.float32)return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

按照论文的描述,采用了Adam optimizer来优化模型,学习率在最初的2000个Batch的训练中由0增加到0.00025,然后采用余弦衰减,在100个Epoch后降为0。在新版的Tensorflow里面有一个新的CosineDecay可以直接调用

epoch_steps = 1680000//batch_size
epochs = 100
decay_steps = epoch_steps*epochs
initial_learning_rate = 0
warmup_steps = 2000
target_learning_rate = 0.00025
lr_warmup_decayed_fn = tf.keras.optimizers.schedules.CosineDecay(initial_learning_rate, decay_steps, warmup_target=target_learning_rate,warmup_steps=warmup_steps
)optimizer = tf.keras.optimizers.Adam(lr_warmup_decayed_fn, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

在训练过程中,我们要保存中间过程的训练结果,为此定义checkpoint

checkpoint_path = './checkpoints/train'#定义两个trackable object需要保存
ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:ckpt.restore(ckpt_manager.latest_checkpoint)print('Latest checkpoint restored!!')

定义一个训练函数,计算loss并调用optimizer来进行优化,如以下代码:

train_step_signature = [tf.TensorSpec(shape=(None, None), dtype=tf.int64),tf.TensorSpec(shape=(None, None), dtype=tf.int64)
]
@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):with tf.GradientTape() as tape:predictions, _ = transformer(inp, training = True)loss = loss_function(tar, predictions)gradients = tape.gradient(loss, transformer.trainable_variables)optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))train_loss(loss)train_accuracy(accuracy_function(tar, predictions))

最后我们就可以进行模型的训练了,在训练过程中我们将每100个batch打印loss值和预测准确度,然后每个Epoch结束时用checkpoint来保存训练结果。

for epoch in range(EPOCHS):start = time.time()train_loss.reset_states()train_accuracy.reset_states()# inp -> portuguese, tar -> englishfor (batch, inputs) in enumerate(tf_ds):try:train_step(inputs[...,:-1], inputs[...,1:])except ValueError:print(inputs)breakif batch % 10 == 0:print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')if batch == 100:breakif (epoch + 1) % 5 == 0:ckpt_save_path = ckpt_manager.save()print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

训练结果

我本地的GPU是一个2080的卡,只有12GB内存。按照论文的描述,是采用了8块P600显卡训练了30天。我没有这么多的显卡资源,因此只简单的训练了几个Epoch,可以看到Loss值在训练过程中不断下降以及预测下一个Token的准确度的增加。稍后我将训练更多的EPOCH,然后在这个训练的模型上进行有监督的学习,以完成不同的NLP任务(例如文本分类,文本问答等等)。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/213.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Keepalived 安装与配置

安装 Keepalived apt -y install keepalived 里边有一个杠y,就是我安装的时候里面有yes,就直接是yes 添加 Keepalived 配置 安装好之后, 下一步就开始去来写这个配置文件了,就在这里面去建一个 etc 当中,就是在这个 etc 当中建一个…

认识企业级定时任务Quartz

文章目录 前言一、实现一个Quartz的小案例1.创建一个maven项目2.添加Quartz依赖3.创建一个配置文件配置Quartz信息4.创建一个Job类继承Job接口5.编写主方法逻辑进行测试6.测试运行结果 二、Job和JobDetail总结 前言 目前仍有大部分企业仍在使用Quartz这种定时任务框架&#xf…

45. 跳跃游戏 II (贪心)

题目链接:力扣 解题思路:贪心,尽可能地找到下一跳能够跳到的最远距离,这样到达终点时,所需跳跃次数最少 以nums [2,3,1,1,4,2]为例: 以当前位置begin作为起跳点,能够跳跃的最远距离为m&#…

7.7知识点

1.React.createElement() jsx语法是React.createElement()的语法糖 虚拟dom 本质是一个js对象(类型为Object) 虚拟dom属性比较少(轻),真实dom属性方法比较多,因为在react内部使用,没有挂载到真实dom中,所以不用很多真实dom中的属性。(…

MySQL每日一练:多表查询——连接查询、子查询

目录 1、首先创建员工表emp和部门表dept: dept表: emp表: 2、插入数据: dept表: emp表: 3、 按条件查找 1、首先创建员工表emp和部门表dept: dept表: create table dept (…

存储优化和Protobuf

存储优化和Protobuf 本文链接:https://blog.csdn.net/feather_wch/article/details/131649688 文章目录 存储优化和ProtobufAndroid存储方式Protobuf介绍使用原理 7z压缩 Android存储方式 1、Android存储方式 SharePreferenceSQLite: 复杂的数据结构File -> p…

以太网(Ethernet)入门了解

以太网(Ethernet)是一种常见的局域网(LAN)通信协议,它是由Xerox公司于1970年代中期开发的。以太网是一种基于广播技术的开放式网络协议,它允许设备在共享通信介质上进行通信。以下是关于以太网的基本概念、…

MySQL 多表查询练习

1.创建student和score表 CREATE TABLE student ( id INT(10) NOT NULL UNIQUE PRIMARY KEY , name VARCHAR(20) NOT NULL , sex VARCHAR(4) , birth YEAR, department VARCHAR(20) , address VARCHAR(50) );创建score表。SQL代码如下: CREATE TABLE s…

OpenCV 入门教程:Laplacian算子和Canny边缘检测

OpenCV 入门教程: Laplacian 算子和 Canny 边缘检测 导语一、Laplacian 算子二、Canny 边缘检测三、示例应用3.1 图像边缘检测3.2 边缘增强 总结 导语 边缘检测在图像处理和计算机视觉领域中起着重要的作用。 Laplacian 算子和 Canny 边缘检测是两种常用的边缘检测…

CAT1模块 EC800M HTTP使用总结记录

分享记录一下 CAT1 模块EC800 HTTP 协议使用流程 ...... by 矜辰所致目录 前言一、基础说明1.1 CAT1 与 4G1.2 EC800M 模块1.3 HTTP 二、开始使用2.1 硬件设计部分2.2 模块上电流程2.3 PDP 上下文2.3.1 什么是 SGSN 和 GGSN ? 三、 HTTP 流程3.1 客户端3.1.1 PDP 上…

Spring MVC JSON 数据交互

🔔 JSON 语法 | 菜鸟教程 为实现浏览器与控制器类之间的 JSON 数据交互, Spring MVC 提供了 MappingJackson2HttpMessageConverter 实现类默认处理 JSON 格式请求响应。该实现类利用 Jackson 开源包读写 JSON 数据,将 Java 对象转换为 JSO…

k8s发生故障,可以从哪些方面排查

目录 第一方面:审视集群状态 第二方面:追踪事件日志 第三方面:聚焦Pod状态 第四方面:检查网络连通性 第五方面:审视存储配置 第六方面:研究容器日志 第一方面:审视集群状态 K8S的集群状态…

Ubuntu18.04 系统安装 Docker

1、首先更新软件源: sudo apt-get updatesudo apt-get upgrade 2、安装Docker: sudo apt install docker -y 3、查看安装的Docker apt list docker 4、查看docker 进程 ps -ef|grep docker 5、查看docker 版本有问题 6、开启Docker服务 systemctl…

10_SPI_Flash 连续写实验

10_SPI_Flash 连续写实验 1. 实验目标2. 连续写方法3. 操作时序4. 流程框图4.1 顶层模块4.2 连续写模块 5. 波形图6. RTL6.1 flash_seq_wr_ctrl6.2 spi_flash_seq_wr 7. Testbench 1. 实验目标 使用页写指令,将串口发送过来的连续不定量数据写入 Flash。本实验中&a…

Web安全——数据库mysql学习

数据库mysql基础 Web安全分享一、数据库的基本操作1、MYSQL登录与退出2、MYSQL数据库的一些解释3、MYSQL注释符有三种: 二、数据库的一些基本操作1、数据库的增删改查(sql语句) 三、table 表的操作1、查看表结构2、查看表的内容3、建立表4、约束条件5、修改表的操作…

网络通信原理系统的认知(NEBASE第十四课)

1 物理层 第一层 物理层:建立、维护、断开物理连接,定义了接口及介质,实现了比特流的传输。 1.1传输层介质分类 有线介质:网线 (双绞线)光纤 无线: 无线电 1.2 双绞线 五类线 cat5 :适用 10…

“深入解析Redis的数据结构与应用场景“

标题:深入解析Redis的数据结构与应用场景 摘要:本文将深入解析Redis的数据结构与应用场景,包括字符串、列表、哈希表、集合和有序集合。通过详细介绍每种数据结构的特点和常见使用场景,帮助开发工程师更好地理解和应用Redis。 正…

Docker的使用

1. Docker的含义 1. 镜像:别人开发好的基础,类似C中的类 2. 容器:自己创建的实例 2.如何使用docker 可以从网上拉取相应的镜像,然后宿主机以下命令:创建Dockerfile文件docker build -t image_name:tag . 创建自己的do…

RxJava 2.0 全体系梳理(持续更新ing)

事件流 数据流 如何组合和变换流 如何订阅任何可观察的数据流 目录 一.源码解析 二.基本使用 步骤 一、创建Observable 二、创建Observer 三、使用subscribe进行订阅 四、取消订阅 Disposable 五种Observable(被观察者) Observable 和 Observer Cold Observable …

第二次CCF计算机软件能力认证

第一题:相邻数对 给定 n 个不同的整数,问这些数中有多少对整数,它们的值正好相差 1。 输出格式 输入的第一行包含一个整数 n,表示给定整数的个数。 第二行包含所给定的 n 个整数。 输出格式 输出一个整数,表示值正好相…