中文版
数据并行、模型并行与张量并行:深度学习中的并行计算策略
随着深度学习模型的不断增大,单个计算节点(例如单个 GPU)的计算和内存能力逐渐成为了限制训练效率和模型规模的瓶颈。为了应对这些挑战,深度学习社区提出了多种并行计算策略,其中包括数据并行(Data Parallelism)、模型并行(Model Parallelism)和张量并行(Tensor Parallelism)。
这些并行策略的核心思想是通过在多个计算设备之间分配计算负载,来加速模型训练,尤其是当模型规模超出单个设备的计算能力时,能够有效提升训练效率和扩展性。
在本篇博客中,我们将通俗易懂地解释这些并行策略的概念,并通过简单的示例代码来帮助大家理解。我们还会加入数学公式来帮助阐明这些概念。
1. 数据并行(Data Parallelism)
数据并行是最常见的并行训练策略之一,它通过将数据集拆分成多个小批次(mini-batch),并将这些小批次分配给不同的计算设备(如不同的 GPU),以此来加速训练。
数据并行的基本思路:
- 模型复制:每个计算设备都有一个完整的模型副本。
- 数据划分:训练数据被划分成多个小批次,每个设备处理不同的小批次。
- 梯度聚合:各个设备计算出的梯度会被合并(通常使用求平均或求和),并同步更新模型的参数。
假设我们有一个总的数据集 ( D = { x 1 , x 2 , . . . , x n } D = \{x_1, x_2, ..., x_n\} D={x1,x2,...,xn} ),将其拆分成 ( P P P ) 个小批次,分配给 ( P P P ) 个 GPU。每个 GPU 上运行相同的模型,计算出对应小批次的梯度,并最终将梯度合并更新模型参数。
数学公式:
对于每个设备 ( i i i ),计算损失函数 ( L ( θ i , x i ) L(\theta_i, x_i) L(θi,xi) ):
L ( θ i , x i ) = Loss ( f ( x i ; θ i ) ) L(\theta_i, x_i) = \text{Loss}(f(x_i; \theta_i)) L(θi,xi)=Loss(f(xi;θi))
其中 ( f ( x i ; θ i ) f(x_i; \theta_i) f(xi;θi) ) 是模型的输出,( θ i \theta_i θi ) 是设备 ( i i i ) 上的模型参数。
计算完成后,所有设备计算出的梯度 ( ∇ θ i L \nabla \theta_i L ∇θiL ) 会聚合:
1 P ∑ i = 1 P ∇ θ i L \frac{1}{P} \sum_{i=1}^{P} \nabla \theta_i L P1i=1∑P∇θiL
然后,通过这种方式同步更新所有设备上的模型参数。
示例代码:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DataParallel# 创建一个简单的神经网络
class SimpleNN(nn.Module):def __init__(self):super(SimpleNN, self).__init__()self.fc1 = nn.Linear(100, 50)self.fc2 = nn.Linear(50, 10)def forward(self, x):x = torch.relu(self.fc1(x))x = self.fc2(x)return x# 假设我们有两个GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNN()# 使用DataParallel包装模型,进行数据并行训练
model = DataParallel(model) # 自动分配到多个GPU
model.to(device)# 假设数据
input_data = torch.randn(64, 100).to(device) # 一个批次的数据
target = torch.randn(64, 10).to(device)# 定义损失函数和优化器
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)# 训练循环
for epoch in range(10):optimizer.zero_grad()output = model(input_data)loss = criterion(output, target)loss.backward()optimizer.step()
2. 模型并行(Model Parallelism)
当模型的规模非常大,以至于无法在单个设备的内存中存储时,可以采用模型并行策略。与数据并行不同,模型并行通过将模型的不同部分分配到多个设备上进行计算,而不是将数据拆分。
模型并行的基本思路:
- 模型拆分:将模型划分为多个部分,每个部分被放置到不同的设备上。
- 设备间通信:各部分之间通过设备间的通信进行数据传递。
假设我们的模型包含两个大层 ( L 1 L_1 L1 ) 和 ( L 2 L_2 L2 ),由于 ( L L L ) 的参数过大,无法全部存放在单个 GPU 上,我们可以将 ( L 1 L_1 L1 ) 放在 GPU 1 上,将 ( L 2 L_2 L2 ) 放在 GPU 2 上,进行计算。
数学公式:
对于模型的不同部分,假设我们有两个设备 ( G P U 1 GPU_1 GPU1 ) 和 ( G P U 2 GPU_2 GPU2 ),计算流程如下:
- 在 ( G P U 1 GPU_1 GPU1 ) 上计算第一部分的输出 ( h 1 = f 1 ( x ) h_1 = f_1(x) h1=f1(x) )。
- 将 ( h 1 h_1 h1 ) 传输到 ( G P U 2 GPU_2 GPU2 ),在 ( G P U 2 GPU_2 GPU2 ) 上计算第二部分 ( h 2 = f 2 ( h 1 ) h_2 = f_2(h_1) h2=f2(h1) )。
- 最终输出 ( y = h 2 y = h_2 y=h2 )。
示例代码:
import torch
import torch.nn as nnclass Part1(nn.Module):def __init__(self):super(Part1, self).__init__()self.fc1 = nn.Linear(100, 200)def forward(self, x):return torch.relu(self.fc1(x))class Part2(nn.Module):def __init__(self):super(Part2, self).__init__()self.fc2 = nn.Linear(200, 10)def forward(self, x):return self.fc2(x)device1 = torch.device("cuda:0")
device2 = torch.device("cuda:1")model_part1 = Part1().to(device1)
model_part2 = Part2().to(device2)# 假设输入数据
input_data = torch.randn(64, 100).to(device1)# 在GPU 1上计算第一部分
h1 = model_part1(input_data)# 将数据传输到GPU 2并计算第二部分
h1 = h1.to(device2)
output = model_part2(h1)
3. 张量并行(Tensor Parallelism)
张量并行是一种细粒度的并行策略,它通过将张量的计算切分成多个部分,并在多个设备上并行计算来加速训练过程。张量并行通常与数据并行和模型并行结合使用。
张量并行的基本思路:
- 张量切分:将模型中的大张量(例如权重矩阵)分割成多个小张量,并分配给不同的设备。
- 并行计算:不同设备并行计算各自的部分,然后将结果合并。
假设我们有一个大型矩阵 ( A A A ) 需要进行矩阵乘法。我们可以将 ( A A A ) 切分成几个小矩阵,每个小矩阵分配给一个设备,进行并行计算。
数学公式:
假设我们要计算矩阵乘法 ( C = A ⋅ B C = A \cdot B C=A⋅B ),其中 ( A A A ) 是一个 ( m × n m \times n m×n ) 的矩阵,( B B B ) 是一个 ( n × p n \times p n×p ) 的矩阵。
在张量并行中,我们将矩阵 ( A A A ) 切分成若干个子矩阵 ( A 1 , A 2 , . . . , A k A_1, A_2, ..., A_k A1,A2,...,Ak ),并将每个子矩阵 ( A i A_i Ai ) 分配到不同的 GPU 上进行计算。
C = ( A 1 ⋅ B ) + ( A 2 ⋅ B ) + ⋯ + ( A k ⋅ B ) C = \left( A_1 \cdot B \right) + \left( A_2 \cdot B \right) + \dots + \left( A_k \cdot B \right) C=(A1⋅B)+(A2⋅B)+⋯+(Ak⋅B)
这里的拆分和合并过程请参考文末。
示例代码:
import torch
import torch.nn as nnclass Model(nn.Module):def __init__(self):super(Model, self).__init__()self.fc1 = nn.Linear(100, 200) # 第一个线性层self.fc2 = nn.Linear(200, 10) # 第二个线性层def forward(self, x):# 模拟张量并行:拆分输入并在不同设备上计算# 将输入拆分成两部分,分别送到不同的GPU上x1, x2 = x.chunk(2, dim=1) # 将输入拆成两部分,假设输入是[64, 100]x1 = x1.to(device1) # 第一部分数据送到GPU1x2 = x2.to(device2) # 第二部分数据送到GPU2# 在GPU1上计算第一部分out1 = self.fc1(x1) # 在GPU1上计算out1 = out1.to(device2) # 将输出转移到GPU2# 在GPU2上计算第二部分out2 = self.fc2(x2) # 在GPU2上计算# 合并两个输出结果out = out1 + out2 # 假设这里是两个部分的合并(可以是加法、拼接等)return out# 假设我们有两个GPU
device1 = torch.device("cuda:0") # GPU1
device2 = torch.device("cuda:1") # GPU2model = Model().to(device1) # 将模型的第一部分(fc1)放到GPU1# 模拟输入数据
input_data = torch.randn(64, 100).to(device1) # 假设输入数据是64个样本,每个样本100维# 计算前向传播
output_data = model(input_data)# 最终输出在GPU2上
print(output_data)
代码解析
- 输入数据拆分:
x.chunk(2, dim=1)
将输入数据拆分成两部分(64个样本,每个样本100维)。dim=1
表示沿着列方向(特征维度)拆分数据。因此,我们将每个样本的100维特征拆成两部分,每部分50维。
x1 和 x2 是拆分后的两部分输入,分别送到不同的设备(GPU 1 和 GPU 2)进行计算。
- 模型计算:
我们将第一部分输入 x1 放到GPU1上,使用 fc1 进行计算,然后将其结果转移到GPU2。
第二部分输入 x2 直接在GPU2上使用 fc2 进行计算。
- 合并结果:
假设我们将两个输出结果加在一起 (out1 + out2
) 作为最终的结果。在实际应用中,合并的方式可以是加法、拼接(torch.cat()
)等,取决于具体的模型设计。
总结
- 数据并行:将数据分配到多个设备,每个设备计算相同模型的不同数据。
- 模型并行:将模型拆分成多个部分,分配到不同设备,每个设备计算模型的一部分。
- 张量并行:将大型张量分割成多个小张量,分配到不同设备上并行计算。
通过这些并行策略,我们可以有效地解决训练大规模模型时的计算和内存瓶颈,提高训练效率并支持更大的模型和数据集。希望这篇博客能帮助大家理解这些并行计算策略,并在实际应用中选择合适的策略来优化模型训练。
英文版
Data Parallelism, Model Parallelism, and Tensor Parallelism: Parallel Computing Strategies in Deep Learning
As deep learning models continue to grow in size, the computational and memory capacity of a single machine (e.g., a single GPU) becomes a bottleneck for training large models. To address these challenges, the deep learning community has introduced various parallel computing strategies, including Data Parallelism, Model Parallelism, and Tensor Parallelism.
These parallel strategies aim to distribute the computation across multiple devices, allowing for faster training, especially when the model size exceeds the memory capacity of a single device. In this blog post, we will explain these parallelism strategies in simple terms, provide sample code for better understanding, and introduce mathematical formulas to clarify the concepts.
1. Data Parallelism
Data Parallelism is one of the most common parallel training strategies. It works by splitting the dataset into multiple smaller batches and distributing these batches across different computational devices (such as different GPUs) to accelerate the training process.
Key Idea of Data Parallelism:
- Model Replication: Each device has a complete replica of the model.
- Data Division: The training data is divided into smaller batches, and each device processes a different mini-batch.
- Gradient Aggregation: The gradients computed by each device are averaged (or summed) and used to update the model parameters.
Given a dataset ( D = { x 1 , x 2 , . . . , x n } D = \{x_1, x_2, ..., x_n\} D={x1,x2,...,xn} ), we split it into ( P P P ) mini-batches and assign each mini-batch to a different GPU. Each GPU runs the same model, computes the gradients for its mini-batch, and the gradients are aggregated to update the model parameters.
Mathematical Formula:
For each device ( i i i ), the loss function ( L ( θ i , x i ) L(\theta_i, x_i) L(θi,xi) ) is computed:
L ( θ i , x i ) = Loss ( f ( x i ; θ i ) ) L(\theta_i, x_i) = \text{Loss}(f(x_i; \theta_i)) L(θi,xi)=Loss(f(xi;θi))
where ( f ( x i ; θ i ) f(x_i; \theta_i) f(xi;θi) ) is the model’s output and ( θ i \theta_i θi ) are the model parameters on device ( i i i ).
Afterward, the gradients ( ∇ θ i L \nabla \theta_i L ∇θiL ) from all devices are aggregated:
1 P ∑ i = 1 P ∇ θ i L \frac{1}{P} \sum_{i=1}^{P} \nabla \theta_i L P1i=1∑P∇θiL
This aggregated gradient is then used to update the model parameters across all devices.
Example Code:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DataParallel# Simple Neural Network model
class SimpleNN(nn.Module):def __init__(self):super(SimpleNN, self).__init__()self.fc1 = nn.Linear(100, 50)self.fc2 = nn.Linear(50, 10)def forward(self, x):x = torch.relu(self.fc1(x))x = self.fc2(x)return x# Assume we have two GPUs
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNN()# Wrap the model for data parallelism
model = DataParallel(model) # Automatically distribute across GPUs
model.to(device)# Example data
input_data = torch.randn(64, 100).to(device) # A batch of data
target = torch.randn(64, 10).to(device)# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)# Training loop
for epoch in range(10):optimizer.zero_grad()output = model(input_data)loss = criterion(output, target)loss.backward()optimizer.step()
2. Model Parallelism
When the model is too large to fit into the memory of a single device, Model Parallelism can be used. Unlike data parallelism, model parallelism splits the model itself into parts and assigns each part to a different device for computation.
Key Idea of Model Parallelism:
- Model Splitting: The model is divided into several parts, and each part is placed on a different device.
- Communication Between Devices: The different parts of the model communicate with each other during the forward and backward passes.
Assume we have a model with two large layers ( L 1 L_1 L1 ) and ( L 2 L_2 L2 ). If ( L L L ) is too large to fit in a single GPU, we place ( L 1 L_1 L1) on GPU 1 and place ( L 2 L_2 L2 ) on GPU 2.
Mathematical Formula:
For different parts of the model, assume we have two devices ( G P U 1 GPU_1 GPU1 ) and ( G P U 2 GPU_2 GPU2 ). The computation flow is as follows:
- On ( G P U 1 GPU_1 GPU1 ), compute the output of the first part ( h 1 = f 1 ( x ) h_1 = f_1(x) h1=f1(x) ).
- Transfer ( h 1 h_1 h1 ) to ( G P U 2 GPU_2 GPU2 ), where the second part is computed ( h 2 = f 2 ( h 1 ) h_2 = f_2(h_1) h2=f2(h1) ).
- The final output is ( y = h 2 y = h_2 y=h2 ).
Example Code:
import torch
import torch.nn as nnclass Part1(nn.Module):def __init__(self):super(Part1, self).__init__()self.fc1 = nn.Linear(100, 200)def forward(self, x):return torch.relu(self.fc1(x))class Part2(nn.Module):def __init__(self):super(Part2, self).__init__()self.fc2 = nn.Linear(200, 10)def forward(self, x):return self.fc2(x)device1 = torch.device("cuda:0")
device2 = torch.device("cuda:1")model_part1 = Part1().to(device1)
model_part2 = Part2().to(device2)# Example input data
input_data = torch.randn(64, 100).to(device1)# Compute first part on GPU1
h1 = model_part1(input_data)# Transfer to GPU2 and compute second part
h1 = h1.to(device2)
output = model_part2(h1)
3. Tensor Parallelism
Tensor Parallelism is a finer-grained parallel strategy that splits a large tensor (such as a weight matrix) into smaller chunks and distributes these chunks across multiple devices for parallel computation. Tensor parallelism is often used in conjunction with data and model parallelism to further enhance performance.
Key Idea of Tensor Parallelism:
- Tensor Splitting: Large tensors (e.g., weight matrices) are split into smaller chunks and assigned to different devices.
- Parallel Computation: Each device computes the part of the tensor assigned to it, and the results are combined.
For example, assume we need to compute a matrix multiplication ( C = A ⋅ B C = A \cdot B C=A⋅B ), where ( A A A ) is a ( m × n m \times n m×n ) matrix and ( B B B ) is a ( n × p n \times p n×p ) matrix. We can split ( A A A ) into smaller matrices ( A 1 , A 2 , . . . , A k A_1, A_2, ..., A_k A1,A2,...,Ak ), and assign each part to a different GPU for parallel computation.
Mathematical Formula:
Let’s calculate matrix multiplication ( C = A ⋅ B C = A \cdot B C=A⋅B ), where ( A A A ) is split into submatrices ( A 1 , A 2 , . . . , A k A_1, A_2, ..., A_k A1,A2,...,Ak ), and each part is processed on different GPUs:
C = ( A 1 ⋅ B ) + ( A 2 ⋅ B ) + ⋯ + ( A k ⋅ B ) C = \left( A_1 \cdot B \right) + \left( A_2 \cdot B \right) + \dots + \left( A_k \cdot B \right) C=(A1⋅B)+(A2⋅B)+⋯+(Ak⋅B)
Example Code:
import torch
import torch.nn as nnclass Model(nn.Module):def __init__(self):super(Model, self).__init__()self.fc1 = nn.Linear(100, 200) # First fully connected layerself.fc2 = nn.Linear(200, 10) # Second fully connected layerdef forward(self, x):# Simulate tensor parallelism: split the input and compute on different devices# Split the input into two parts, and send each part to a different GPUx1, x2 = x.chunk(2, dim=1) # Split the input into two parts, assuming input is [64, 100]x1 = x1.to(device1) # Send the first part of the data to GPU1x2 = x2.to(device2) # Send the second part of the data to GPU2# Compute the first part on GPU1out1 = self.fc1(x1) # Compute on GPU1out1 = out1.to(device2) # Move the output to GPU2# Compute the second part on GPU2out2 = self.fc2(x2) # Compute on GPU2# Combine the two output resultsout = out1 + out2 # Here we assume the two parts are combined by addition (can also be concatenation, etc.)return out# Assume we have two GPUs
device1 = torch.device("cuda:0") # GPU1
device2 = torch.device("cuda:1") # GPU2model = Model().to(device1) # Place the first part of the model (fc1) on GPU1# Simulate input data
input_data = torch.randn(64, 100).to(device1) # Assume input data consists of 64 samples, each with 100 features# Perform forward pass
output_data = model(input_data)# The final output is on GPU2
print(output_data)
Summary
- Data Parallelism: Distribute data across devices, each device computes gradients for its batch, and gradients are aggregated to update the model.
- Model Parallelism: Split the model into parts and place each part on a different device for computation.
- Tensor Parallelism: Split large tensors into smaller parts, and distribute the computation of these parts across devices.
By using these parallel strategies, we can effectively address the memory and computation bottlenecks during training, enabling the training of larger models and datasets. We hope this blog helps you understand these parallel computing strategies and choose the right approach to optimize your model training.
补充张量并行矩阵拆分与合并
正确的张量并行中的矩阵切分与计算流程
假设我们有两个矩阵:
- 矩阵 ( A A A ) 的形状是 ( m × n m \times n m×n )
- 矩阵 ( B B B ) 的形状是 ( n × p n \times p n×p )
我们想要计算矩阵乘法 ( C = A ⋅ B C = A \cdot B C=A⋅B ),其中 ( C C C ) 的形状将是 ( m × p m \times p m×p )。
张量并行中的切分方式
在张量并行中,我们会将矩阵 ( A A A ) 切分成多个子矩阵,然后将每个子矩阵分配到不同的 GPU 上进行计算。
如何切分矩阵 ( A )?
如果我们将矩阵 ( A A A ) 切分成 ( k k k ) 个子矩阵 ( A 1 , A 2 , . . . , A k A_1, A_2, ..., A_k A1,A2,...,Ak ),并分配到 ( k k k ) 个不同的 GPU 上,通常的切分方式是将 ( A A A ) 沿着行方向进行切分。也就是说,每个子矩阵 ( A i A_i Ai ) 的形状将是 ( ( m / k ) × n (m/k) \times n (m/k)×n ).
矩阵乘法的计算步骤
- 切分后的 ( A i A_i Ai ) 形状: 假设将矩阵 ( A A A ) 切分成 ( k k k ) 个子矩阵,每个子矩阵 ( A i A_i Ai ) 的形状是 ( ( m / k ) × n (m/k) \times n (m/k)×n ),每个子矩阵都会和矩阵 ( B B B )(形状为 ( n × p n \times p n×p ))进行矩阵乘法。
- 矩阵乘法: 对每个 ( A i A_i Ai )(形状是 ( ( m / k ) × n (m/k) \times n (m/k)×n ))和矩阵 ( B B B )(形状是 ( n × p n \times p n×p ))进行矩阵乘法时,结果会是一个 ( ( m / k ) × p (m/k) \times p (m/k)×p ) 的矩阵 ( C i C_i Ci )。
因此,每个 GPU 上的计算结果 ( C i C_i Ci ) 的形状是 ( ( m / k ) × p (m/k) \times p (m/k)×p )。
合并子矩阵的结果
当所有子矩阵的计算完成后,我们将得到 ( k k k ) 个形状为 ( ( m / k ) × p (m/k) \times p (m/k)×p ) 的矩阵 ( C 1 , C 2 , . . . , C k C_1, C_2, ..., C_k C1,C2,...,Ck )。这些矩阵会沿着行方向拼接起来,最终得到一个 ( m × p m \times p m×p ) 的矩阵 ( C C C )。
举个例子
假设矩阵 ( A A A ) 的形状是 ( 6 × 4 6 \times 4 6×4 ),矩阵 ( B B B ) 的形状是 ( 4 × 3 4 \times 3 4×3 ),那么我们要计算 ( C = A ⋅ B C = A \cdot B C=A⋅B ),其中 ( C C C ) 的形状应该是 ( 6 × 3 6 \times 3 6×3 )。
步骤 1:切分矩阵 ( A )
将矩阵 ( A A A ) 沿行方向切分成 2 个子矩阵 ( A 1 A_1 A1 ) 和 ( A 2 A_2 A2 ):
- ( A 1 A_1 A1 ) 的形状是 ( 3 × 4 3 \times 4 3×4 )
- ( A 2 A_2 A2 ) 的形状是 ( 3 × 4 3 \times 4 3×4 )
步骤 2:计算每个子矩阵的乘积
每个子矩阵与 ( B B B ) 进行矩阵乘法,得到:
- ( C 1 = A 1 ⋅ B C_1 = A_1 \cdot B C1=A1⋅B ),形状是 ( 3 × 3 3 \times 3 3×3 )
- ( C 2 = A 2 ⋅ B C_2 = A_2 \cdot B C2=A2⋅B ),形状是 ( 3 × 3 3 \times 3 3×3 )
步骤 3:合并子矩阵
将 ( C 1 C_1 C1 ) 和 ( C 2 C_2 C2 ) 沿着行方向拼接,得到最终的矩阵 ( C C C ),形状是 ( 6 × 3 6 \times 3 6×3 ):
C = [ C 1 [ 1 , : ] C 1 [ 2 , : ] C 1 [ 3 , : ] C 2 [ 1 , : ] C 2 [ 2 , : ] C 2 [ 3 , : ] ] C = \left[ \begin{array}{ccc} C_1[1, :] \\ C_1[2, :] \\ C_1[3, :] \\ C_2[1, :] \\ C_2[2, :] \\ C_2[3, :] \\ \end{array} \right] C= C1[1,:]C1[2,:]C1[3,:]C2[1,:]C2[2,:]C2[3,:]
最终得到的矩阵 ( C C C ) 的形状是 ( 6 × 3 6 \times 3 6×3),与预期一致。
总结
- 切分矩阵 ( A ):在张量并行中,我们将矩阵 ( A A A ) 沿行方向切分成多个子矩阵,每个子矩阵的形状是 ( ( m / k ) × n (m/k) \times n (m/k)×n )。
- 每个子矩阵的计算:每个子矩阵 ( A i A_i Ai ) 会与矩阵 ( B B B )(形状为 ( n × p n \times p n×p ))进行矩阵乘法,得到一个形状为 ( ( m / k ) × p (m/k) \times p (m/k)×p ) 的矩阵 ( C i C_i Ci )。
- 合并结果:所有子矩阵的计算结果会沿行方向拼接,最终得到形状为 ( m × p m \times p m×p ) 的矩阵 ( C C C )。
这样,通过张量并行的方式,我们可以将一个大矩阵的计算任务分配到多个 GPU 上,从而提高计算效率,同时保持最终结果的正确形状。
后记
2024年11月29日15点33分于上海,在GPT4o大模型辅助下完成。