LLM - 大模型 ScallingLaws 的设计 100B 预训练方案(PLM) 教程(5)

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/145356022

免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。

Scaling Laws (缩放法则) 是大模型领域中，用于描述模型性能(Loss) 与模型规模N、数据量D、计算资源C 之间关系的经验规律，揭示在大模型中，随着模型参数数量、数据集大小和计算资源的增加，模型性能的变化模式，指导更高效地分配资源，优化模型训练过程，实现更好的性能。这些规律不仅有助于预测不同规模模型的表现，还能为模型设计和训练提供理论依据，是推动大模型发展和应用的重要理论基础。

使用 ScalingLaws 指导 100B 大模型的预训练方案，包括服务器资源、3D并行策略、Transformer架构、DeepNorm、混合精度策略、EGS策略、AdamW、WarmUp、GradientClipping、样本、位置编码等，使用大模型稳定和高效训练。

系列文章：

大模型 ScallingLaws 的 C=6ND 公式推导
大模型 ScallingLaws 的 CLM 和 MLM 中不同系数
大模型 ScallingLaws 的迁移学习与混合训练
大模型 ScallingLaws 的指导模型设计与实验环境
大模型 ScallingLaws 的设计 100B 预训练方案

Our-100B 的 PLM 模型的相关实验信息：

数据规模D：940 M 个序列，约 200B 的 Tokens；
模型规模N：100B
计算规模C：1000B 的 Tokens，即 FLOPs 是 $6.2 \times 10^{23}$
训练资源：NVIDIA DGX $\times A100$ GPU，96 台，即合计 768 个 A100 显卡。

根据 $C = 6 N D$ ，即

$\begin{align} \alpha = \frac{C}{ND} = \frac{6.2 \times 10^{23}}{(100 \times 10^9) \times (1000 \times 10^{9})} = 6.2 \end{align}$

根据最新 ScalingLaws (CLM 和 MLM)，参考，计算量 $\times 10^{23}$ ，合理的模型规模 $N$ 与数据规模 $D$ ，即

$\begin{align} N &= (1.26 \times 10^{-3}) \times C^{0.578} \\ N &= 1.26 \times 10^{-3} \times (6.2 \times 10^{23})^{0.578} \\ &= 70 \times 10^9 \\ D &= (1.23 \times 10^{2}) \times C^{0.422} \\ D &= 1.23 \times 10^{2} \times (6.2 \times 10^{23})^{0.422} \\ &= 1350 \times 10^9 \\ \\ N &= (6.19 \times 10^{-8}) \times C^{0.776} \\ N &= (6.19 \times 10^{-8}) \times (6.2 \times 10^{23})^{0.776} \\ &= 180 \times 10^9 \\ D &= (2.02 \times 10^{6}) \times C^{0.230} \\ D &= (2.02 \times 10^{6}) \times (6.2 \times 10^{23})^{0.230} \\ &= 600 \times 10^9 \\ \end{align}$

其中，计算量(C) Tokens 分布是 CLM 预训练 200B，MLM 微调训练 800 B。

1. 整体方案

初步 Our-100B 的 PLM 模型训练在 2024.1.18~6.30，实际训练 165 天，具体：

服务器资源：96 $\times$ DGX-A100 GPU (8 $\times$ 80G)，合计 768 个 A100 显卡，即 $768 \times 80 \approx60 T$ 的显存，训练精度是 FP16 (FP16 升级至 BF16)。
DeepSpeed 的 3D并行策略 (3D Parallel Strategy)： 张量并行(Tensor Parallel) 4路、流水线并行(Pipeline Parallel) 8 路、数据并行(Data Parallel) 24路，即 $\times 8 \times 24 = 768$ ，与显卡数一致。模型训练占用约 10T，即 $\times 4 \times 80G = 2.56T$
训练数据量：1T Tokens。
Transformer 架构：72 Layers、80 Heads、使用 GeGLU 的 10240 嵌入维度和 30720 前馈嵌入维度，改进：GQA，GeGLU 替换成 SwiGLU。
词表大小(Vocab Size) 是26个，20 个氨基酸 + 6 个特殊 Token，即模型预测 [MASK]、[sMASK]、[gMASK] ，句子分割 <sop>、<eop>、 <eos> 。
使用 DeepNorm 实现 Post-LayerNorm，**改进：**LayerNorm 替换成 RMSNorm。
使用 Apex O2 混合精度策略(Mixed-Precision Strategy)，即前向和后向使用 FP16 (BF16)，优化器状态和主权重使用 FP32，以减少 GPU 内存使用并提高训练效率。
使用 嵌入层梯度收缩(EGS) 策略，更新比例参数 $\alpha$ 值设为 0.1，稳定 100B 模型训练。
使用 AdamW 作为优化器，训练超参数 $\beta_{1}=0.9$ 和 $\beta_{2}=0.95$ ，设置权重衰减值为0.01。
预热(Warm Up)：BatchSize 预热 240 至 4224，学习率(LR) 预热 $\times 10^{-7}$ 至 $\times 10^{-5}$ ，再衰减至 $\times 10^{-6}$ ，预热步骤前 3% 的样本，即 30B Tokens。
梯度剪裁(Gradient Clipping)：使用梯度的 L2 范数剪裁(Gradient Clipping by Norm)，设置阈值为 $C = 1.0$ 。
样本：每个样本包含固定序列长度的 2048，将序列用分隔符<eos> 连接成文档，从该文档中采样序列，在预训练期间不使用填充。使模型适应下游任务中不同长度序列，使用混合长度(mix-length) 预训练策略，包括 4 种上下文窗口，即256、512、1024和2048。以512为例，将4个样本连接在一起，满足 2048 序列总长。上下文长度的比例是 $[256 : 512 : 1024 : 2048 = 0.1 : 0.4 : 0.4 : 0.1]$
位置编码：使用 RoPE，旋转位置编码。参考：理解旋转位置编码(RoPE) 与绝对相对位置编码之间的优势

2. 方案细节

方案细节包括：3D并行策略、FFN GeGLU (to SwiGLU)、Transformer 参数量、DeepNorm (LayerNorm to RMSNorm)、混合精度策略(Mixed-Precision Strategy)、嵌入层梯度收缩 (Embedding Layer Gradient Shrink, EGS)、AdamW 优化器、梯度范式剪裁(Gradient Norm Clipping)

2.1 3D并行策略 (3D Parallel Strategy)

关于 DeepSpeed 的 3D并行策略 (3D Parallel Strategy)：

张量并行(Tensor Parallel) ，4路，即模型参数划分到 4 个 GPU 中，
流水线并行(Pipeline Parallel)，8路，即模型各层划分到 12 组 GPU 中，与张量并行，组成 模型并行(Data Parallel)，即模型参数合计占用显存， $32 \times 4 \times 80 = 10T$
数据并行(Data Parallel)，24路，即训练数据划分到 24 个小批量，使用 GPU 并行处理。

参考 384 GPU 的情况，即：

参考：大语言模型的分布式训练

2.2 FFN GeGLU (to SwiGLU)

关于 GeGLU，即 GeLU + GLU(Gated Linear Unit, 门控线性单元)，即：
$\begin{align} GeGLU(x_{1}, x_{2}) &= x_{2} \odot \sigma(x_{1}) = x_{2} \odot GeLU(x_{1}) \\ GeLU(x_{1}) &= x_{1}P(X<=x_{1}) = x \Phi(x_{1}) \\ GeLU(x) &= x \cdot \frac{1+erf(\frac{x}{\sqrt{2}})}{2} \\ GeLU(x) &\approx x \cdot \sigma(1.702x) \end{align}$

其中， $\Phi(x)$ 是标准正态分布的累积分布函数(CDF)， $er f (x)$ 是高斯误差函数。

同理，Llama3 使用的是 SwiGLU，即 SiLU + GLU，即：
$\begin{align} SwiGLU(x_{1}, x_{2}) &= x_{2} \odot \sigma(x_{1}) = x_{2} \odot SiLU(x_{1}) \\ SiLU(x_{1}) &= x_{1}Sigmoid(x_{1}) \end{align}$

在大模型中，一般还需要加入 FFN(Feed-Forward Network) 部分，即：
$SwiGLU\ FFN=(SiLU(xW_{1}^{\top}) \odot xW_{3}^{\top})xW_{2}^{\top}$

其中，GeLU 和 SiLU 的图示，包括 GeLU 的近似函数：

GeLU

特点	GeLU	SiLU
输出范围	$(-\infty, +\infty)$	$(-\infty, +\infty)$
平滑性	输出和导数均平滑	输出和导数均平滑
对负输入的响应	对负输入有响应，但响应较小	对负输入有响应，但响应较小
计算复杂度	较高 (高斯误差函数)	较低 (Sigmoid 函数)

GeLU 和 SiLU 的优势：

输出接近零均值分布，有助于加速训练；
平滑的导数有助于避免梯度消失和梯度爆炸问题，有助于稳定训练；
更好地拟合具有高斯特性的数据，增强输入数据的非线性关系。

参考：从头实现 LLaMA3 网络与推理流程

2.3 Transformer 参数量

计算 72 Layers、80 Heads、维度是10240、前馈维度是 30720 (3倍)、GQA 是 4 组 ( $10240/4 = 2560$ )、词表大小是 26 个、使用 RMSNorm，即：

$\approx 96 \times 10^9 \approx 100B$

公式如下：

$Parameters = Embedding + layers*(Linear_{QKVO} + Linear_{ffn}+RMSNorm) + RMSNorm + Linear$

参考：计算大语言模型(多模态) 的参数量

2.4 DeepNorm (LayerNorm to RMSNorm)

使用 DeepNorm + Post-LN(Post-LayerNorm)，注意 Llama3 使用 RMSNorm (Root Mean Square, 根均方)，即：

$\begin{align} DeepNorm(x) &= LayerNorm(\alpha \cdot x + Network(x)) \\ LayerNorm(x) &= \gamma(\frac{x-\mu}{\sigma}) + \beta \\ RMSNorm(x) &= \gamma(\frac{x}{RMS(x)}) \\ \mu &= \sum_{i=1}^{n}x_{i} \\ \sigma &= \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\mu)^{2}} \\ RMS(x) &= \sqrt{\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2}} \end{align}$

其中，缩放因子 $\alpha$ 的值为 $(2N)^{\frac{1}{2}}$ ， $N$ 是模型的层数，即层数越深，原始输入( $x$ )的权重越高，例如 $\times 70)^{\frac{1}{2}}=11.83$ ，通过在残差连接中引入权重调整，避免在深层网络中训练不稳定的问题。

Llama3 的 RMSNorm 模式，在每层中包括 2 次 RMSNorm，最后输出包括 1 次 RMSNorm，即

final_embedding = token_embeddings_unnormalized  # Embedding 的输出
for layer in tqdm(range(n_layers), "layers"):layer_embedding_norm = rms_norm(final_embedding, model[f"layers.{layer}.attention_norm.weight"])  # 作为 QKV 的输入# ...# Self-Attention 的 残差连接 layer_embedding_norm->embedding_deltaembedding_after_edit = final_embedding + embedding_delta # QKVO 完成 LayerNorm，作为 FFN 的输入embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f"layers.{layer}.ffn_norm.weight"])# ...# FFN 的 残差连接 embedding_after_edit_normalized -> output_after_feedforwardfinal_embedding = embedding_after_edit + output_after_feedforward  # FFN 的 残差连接
final_embedding = rms_norm(final_embedding, model["norm.weight"])

RMS 对于数据集中的较大值给予更多的权重，通常大于或等于算术平均数。

其中，RMSNorm 相比于 LayerNorm 的优势：

LayerNorm 包含缩放( $\sigma$ + $\gamma$ ) 和平移( $\mu$ + $\beta$ ) 两个部分，RMSNorm 去除平移部分，只保留缩放部分( $RMS (x)$ )。
研究表明 LayerNorm 取得成功的关键是缩放不变性，而不是平移不变性。
RMSNorm 相比于 LayerNorm，减少计算均值和平移系数的部分，训练速度更快，效果基本相当，甚至有所提升。

参考：从头实现 LLaMA3 网络与推理流程

Paper：DeepNet: Scaling Transformers to 1000 Layers

DeepNorm

2.5 混合精度策略(Mixed-Precision Strategy)

参考 Apex O2 混合精度策略(Mixed-Precision Strategy)：

模型权重和输入数据：混合精度策略将模型权重和输入数据转换为 FP16 格式；
保持 Norm 在 FP32 精度：为了提高精度和性能，Norm 层的权重保持在 FP32 精度；
主权重(Master Weights)：维护一组 FP32 的主权重，优化器直接作用于这些主权重，以确保梯度更新的精度。
动态损失缩放(Dynamic Loss Scaling)：为了避免梯度下溢，自动调整损失值的缩放比例。

即：

2.6 嵌入层梯度收缩 (Embedding Layer Gradient Shrink, EGS)

在 LLM 训练的早期，嵌入层(Embedding) 的梯度范数，通常比其他层大几个数量级，在训练早期阶段，剧烈波动。异常梯度可能导致训练崩溃，即模型的损失函数突然变得不稳定，甚至趋于无穷大。EGS((Embedding Layer Gradient Shrink, 嵌入层梯度收缩) 策略，通过缩小嵌入层的梯度来抑制这种异常波动。

即使用收缩因子 $\alpha$ ，对于嵌入层的梯度进行缩放，即：

 word_embedding = word_embedding*a + word_embedding.detach()×(1−a)

其中，word_embedding.detach() 表示，将嵌入层的梯度从计算图中分离出来，使其不参与梯度计算，参考 GLM。

2.7 AdamW 优化器

优化器使用 AdamW，参数更新，即：
$\begin{align} \theta_{t+1} &= \theta_{t} - \frac{\alpha}{\sqrt{v_{t}}+\epsilon} m_{t} - \lambda\theta_{t} \\ m_{t} &= \beta_{1}m_{t-1} + (1-\beta_{1}) \nabla L(\theta_{t-1}) \\ v_{t} &= \beta_{2}v_{t-1} + (1-\beta_{2}) \nabla L(\theta_{t-1})^{2} \\ \end{align}$
其中， $m_{t}$ 是一阶矩估计(Mean)， $v_{t}$ 是二阶距估计(Variance)， $\alpha$ 是学习率。

超参数包括 4 个，即 $\beta_{1}$ 是一阶矩衰减率(0.9)， $\beta_{2}$ 是二阶距衰减率(0.95)， $\epsilon$ 是小常数( $\times 10^{-8}$ )， $\lambda$ 是权重衰减系数(0.01)。

2.8 梯度范式剪裁(Gradient Norm Clipping)

使用 梯度的范数剪裁(Gradient Clipping by Norm)，如果梯度的范数，通常是 L2 范数，超过设定的阈值，将梯度按比例缩小，使范数等于该阈值。

在 PyTorch 中，可以使用 torch.nn.utils.clip_grad_norm_ 函数来实现，设置阈值为 $C = 1.0$ 时，参数梯度 L2 范数限制在 1.0，例如：

$\begin{align} ||G||_{2} &= \sqrt{g^{2}_{1} + g^{2}_{2} + ... + g^{2}_{n}} \\ g' &= \frac{C}{||G||_{2}}g \end{align}$

--max_grad_norm MAX_GRAD_NORM, --max-grad-norm MAX_GRAD_NORMSet the maximum norm for gradient clipping, which is critical for preventing gradients from exploding duringbackpropagation. Default is 1.0.

参考：LLM Fine Tuning Parameters

3. 其他

即 SiLU、GeLU，以及 GeLU 近似函数的绘制源码：

import matplotlib.pyplot as plt
import numpy as np
import torchdef draw(func_list, label_list):x = np.arange(-10, 10, 0.1)x_torch = torch.from_numpy(x)for func, label in zip(func_list, label_list):y = []for t in x_torch:y_1 = func(t)y_1 = y_1.numpy()y.append(y_1)plt.plot(x, y, label=label)plt.xlabel("x")plt.ylabel("y")plt.xlim(-7, 7)plt.ylim(-1, 7)plt.grid()plt.legend()plt.show()def gelu(x):return x * torch.sigmoid(1.702 * x)def main():func_list = [torch.functional.F.silu, torch.functional.F.gelu, gelu]label_list = ["silu", "gelu", "gelu(sim)"]draw(func_list, label_list)if __name__ == "__main__":main()