Chapter4.1 Coding an LLM architecture

文章目录

4 Implementing a GPT model from Scratch To Generate Text
- 4.1 Coding an LLM architecture

4 Implementing a GPT model from Scratch To Generate Text

本章节包含
1. 编写一个类似于GPT的大型语言模型（LLM），这个模型可以被训练来生成类似人类的文本。
2. Normalizing layer activations to stabilize neural network training
3. 在深度神经网络中添加shortcut connections，以更有效地训练模型
4. 实现 Transformer 模块以创建不同规模的 GPT 模型
5. 计算 GPT 模型的参数数量及其存储需求
在上一章中，学习了多头注意力机制并对其进行了编码，它是LLMs的核心组件之一。在本章中，将编写 LLM 的其他构建块，并将它们组装成类似 GPT 的模型

4.1 Coding an LLM architecture

诸如GPT和Llama等模型,基于原始Transformer架构中的decoder部分,因此，这些LLM通常被称为"decoder-like" LLMs，与传统的深度学习模型相比，LLM规模更大，这主要归因于它们庞大的参数数量，而非代码量。因为它的许多组件都是重复的，下图提供了类似 GPT LLM 的自上而下视图

本章将详细构建一个最小规模的GPT-2模型（1.24亿参数），并展示如何加载预训练权重以兼容更大规模的模型。

1.24亿参数GPT-2模型的配置细节包括：
```
GPT_CONFIG_124M = {"vocab_size": 50257,    # Vocabulary size"context_length": 1024, # Context length"emb_dim": 768,         # Embedding dimension"n_heads": 12,          # Number of attention heads"n_layers": 12,         # Number of layers"drop_rate": 0.1,       # Dropout rate"qkv_bias": False       # Query-Key-Value bias
}
```
我们使用简短的变量名以避免后续代码行过长
1. "vocab_size" 词汇表大小，由 BPE tokenizer 支持，值为 50,257。
2. "context_length" 模型的最大输入标记数量，通过 positional embeddings 实现。
3. "emb_dim" token输入的嵌入大小，将每个token转换为 768 维向量。
4. "n_heads" 多头注意力机制中的注意力头数量。
5. "n_layers" 是模型中 transformer 块的数量
6. "drop_rate" 是 dropout 机制的强度，第 3 章讨论过；0.1 表示在训练期间丢弃 10% 的隐藏单元以缓解过拟合
7. "qkv_bias" 决定多头注意力机制（第 3 章）中的 Linear 层在计算查询（Q）、键（K）和值（V）张量时是否包含偏置向量；我们将禁用此选项，这是现代 LLMs 的标准做法；然而，我们将在第 5 章将 OpenAI 的预训练 GPT-2 权重加载到我们的重新实现时重新讨论这一点。

下图中的方框展示了我们为实现最终 GPT 架构所需处理的各个概念的顺序。我们将从第一步开始，即一个我们称为 DummyGPTModel 的 GPT 骨架占位符：

import torch
import torch.nn as nnclass DummyGPTModel(nn.Module):def __init__(self, cfg):super().__init__()self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])self.drop_emb = nn.Dropout(cfg["drop_rate"])# Use a placeholder for TransformerBlockself.trf_blocks = nn.Sequential(*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])# Use a placeholder for LayerNormself.final_norm = DummyLayerNorm(cfg["emb_dim"])self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)def forward(self, in_idx):batch_size, seq_len = in_idx.shapetok_embeds = self.tok_emb(in_idx)pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))x = tok_embeds + pos_embedsx = self.drop_emb(x)x = self.trf_blocks(x)x = self.final_norm(x)logits = self.out_head(x)return logitsclass DummyTransformerBlock(nn.Module):def __init__(self, cfg):super().__init__()# A simple placeholderdef forward(self, x):# This block does nothing and just returns its input.return xclass DummyLayerNorm(nn.Module):def __init__(self, normalized_shape, eps=1e-5):super().__init__()# The parameters here are just to mimic the LayerNorm interface.def forward(self, x):# This layer does nothing and just returns its input.return x

DummyGPTModel：简化版的 GPT 类模型，使用 PyTorch 的 nn.Module 实现。
模型组件：包括标记嵌入、位置嵌入、丢弃层、变换器块、层归一化和线性输出层。
配置字典：配置通过 Python 字典传入,如 GPT_CONFIG_124M，用于传递模型配置。
forward 方法：描述数据从输入到输出的完整流程。计算嵌入 → 应用 dropout → 通过 transformer blocks 处理 → 应用归一化 → 生成 logits。
占位符：DummyLayerNorm 和 DummyTransformerBlock 是待实现的组件。

数据流动：下图提供了 GPT 模型中数据流动的高层次概述。

使用 tiktoken 分词器对由 GPT 模型的两个文本输入组成的批次进行分词：

import tiktokentokenizer = tiktoken.get_encoding("gpt2")batch = []txt1 = "Every effort moves you"
txt2 = "Every day holds a"batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)"""输出"""
tensor([[6109, 3626, 6100,  345],[6109, 1110, 6622,  257]])

接下来，我们初始化一个包含 1.24 亿参数的 DummyGPTModel 实例，并将 tokenized batch 输入其中。

torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)logits = model(batch)
print("Output shape:", logits.shape)
print(logits)"""输出"""
Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],[-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],[ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],[ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],[[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],[-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],[ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],[ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],grad_fn=<UnsafeViewBackward0>)