Torchtune在AMD GPU上的使用指南：利用多GPU能力进行LLM微调与扩展

Torchtune on AMD GPUs How-To Guide: Fine-tuning and Scaling LLMs with Multi-GPU Power — ROCm Blogs

这篇博客提供了一份详细的使用Torchtune在AMD GPU上微调和扩展大型语言模型（LLM）的指南。Torchtune 是一个PyTorch库，旨在让您轻松地微调和实验LLM。借助Torchtune的灵活性和可扩展性，我们展示了如何使用Llama-3.1-8B模型对摘要任务进行微调，该任务使用EdinburghNLP/xsum 数据集。通过LoRA（低秩适配）这种高效的参数微调技术，Torchtune能够在保持性能的同时实现高效训练，并支持不同数量的GPU（2、4、6和8）的使用。本文还重点介绍了Torchtune的分布式训练功能，如何使用户能够在多个GPU上扩展LLM微调，从而在保持训练模型质量的同时减少训练时间，展示了其在现代AMD硬件上使用ROCm的潜力和用途。

有关这篇博客的相关文件，请参见这个GitHub文件夹。

要求

AMD GPU: 请参阅 ROCm 文档页面了解支持的硬件和操作系统。
ROCm: 请参阅 Linux 上的 ROCm 安装页面获取安装说明。
Docker: 请参阅在 Ubuntu 上安装 Docker 引擎获取安装说明。
PyTorch 2.4 和Torchtune: 我们使用 Docker Compose 创建了一个自定义服务，用于运行和服务 Torchtune 模型。有关 Torchtune 服务创建的详细信息，请参阅相应的 ./torchtune/docker/docker-compose.yaml 和 ./torchtune/docker/Dockerfile 文件。
Hugging Face 访问令牌: 本文要求具有 Hugging Face 帐号，并生成新的用户访问令牌。
访问 Hugging Face 上的 Llama-3.1-8B 模型。`Llama-3.1` 系列模型是受限模型。要申请访问权限，请参阅: meta-llama/Meta-Llama-3.1-8B。

跟随本文操作

克隆仓库并进入博客目录:

git clone https://github.com/ROCm/rocm-blogs.git
cd rocm-blogs/blogs/artificial-intelligence/torchtune

构建并启动容器。有关构建过程的详细信息，请参阅 ./torchtune/docker/Dockerfile.
```
cd docker
docker compose build
docker compose up
```
打开 http://localhost:8888/lab/tree/src/torchtune.ipynb 并加载 torchtune.ipynb 笔记本。

使用 torchtune.ipynb 笔记本按照本博客进行操作。

背景介绍：LoRA、Torchtune 和 EdinburghNLP/Xsum

在这篇博客中，我们将使用 Torchtune 对 Llama-3.1 模型变体中特定的 Meta-Llama-3.1-8B-Instruct 版本进行微调，以完成抽象摘要任务（通过改写和提炼主要思想来总结内容，而不是逐句摘录）。Llama-3.1 作为一个为通用文本生成设计的大型语言模型，非常适合用于抽象摘要。为了提高微调过程的效率，我们将使用 LoRA（低秩适应），它在计算资源有限的情况下特别有效。为了给这篇博客提供背景信息，我们将简要讨论 LoRA、Torchtune 和我们稍后在微调 Llama-3.1-8B 大型语言模型时使用的 EdinburghNLP/Xsum 数据集。

LoRA：大型语言模型的低秩适应

LoRA（低秩适应）是一种常用来微调大规模预训练模型的技术。LoRA 的核心思想是通过在模型架构中引入低秩矩阵，减少微调过程中需要更新的参数数量。与更新大型模型的所有参数不同，LoRA 插入了可训练的、低秩的矩阵来近似微调所需的变化。这相比常规微调大大降低了计算成本和内存使用，因为只需调整模型参数的一小部分。LoRA 在计算资源有限或由于时间限制难以重新训练时，能够快速适应新任务方面特别有用。欲了解更多有关 LoRA 的信息，请参阅文章 LoRA：大型语言模型的低秩适应以及博客使用 LoRA 进行高效微调：基本原理和使用 LoRA 微调 Llama 2：为问答定制大型语言模型。

尽管 LoRA 优化了微调期间的资源使用，Torchtune 则有助于简化整个微调工作流程，确保模型训练的高效性和可扩展性。

Torchtune概念：配置和指令

Torchtune是一个专为大语言模型（LLM）微调和实验设计的PyTorch库。它提供流行LLM的模块化、原生PyTorch实现，不同微调技术的各种训练指令，并集成Hugging Face数据集进行训练。此外，它还支持使用完全分片的数据并行（Fully Sharded Data Parallel, FSDP）进行分布式训练，并提供YAML配置文件，方便设置训练运行等功能。

Torchtune的主要概念是“配置（configs）”和“指令（recipes）”:

配置（configs）：这些是YAML文件，让用户可以在不修改代码的情况下配置训练设置，如数据集、模型、检查点和超参数。
指令（recipes）：这些类似于用于训练和（可选）评估大语言模型的端到端流水线。每个指令实现一种特定的训练方法，并包含一组适用于特定模型系列的功能。

Torchtune 提供了一个命令行接口（CLI）。Torchtune CLI旨在简化LLM的微调。它提供了一种与Torchtune库互动的方法，允许用户通过命令行下载模型，管理配置，以及执行训练指令。其主要特点包括用户可以轻松列出并复制预构建的微调指令、使用自定义配置运行训练任务，以及验证配置文件的正确格式。

让我们使用 Torchtune CLI 查看所有可用的内置指令。使用`tune ls`命令打印出所有指令及相应配置。

! tune ls

RECIPE                                   CONFIG                                  
full_finetune_single_device              llama2/7B_full_low_memory               code_llama2/7B_full_low_memory          llama3/8B_full_single_device            llama3_1/8B_full_single_device          mistral/7B_full_low_memory              phi3/mini_full_low_memory               
full_finetune_distributed                llama2/7B_full                          llama2/13B_full                         llama3/8B_full                          llama3_1/8B_full                        llama3/70B_full                         llama3_1/70B_full                       mistral/7B_full                         gemma/2B_full                           gemma/7B_full                           phi3/mini_full                          
lora_finetune_single_device              llama2/7B_lora_single_device            llama2/7B_qlora_single_device           code_llama2/7B_lora_single_device       code_llama2/7B_qlora_single_device      llama3/8B_lora_single_device            llama3_1/8B_lora_single_device          llama3/8B_qlora_single_device           llama3_1/8B_qlora_single_device         llama2/13B_qlora_single_device          mistral/7B_lora_single_device           mistral/7B_qlora_single_device          gemma/2B_lora_single_device             gemma/2B_qlora_single_device            gemma/7B_lora_single_device             gemma/7B_qlora_single_device            phi3/mini_lora_single_device            phi3/mini_qlora_single_device           
lora_dpo_single_device                   llama2/7B_lora_dpo_single_device        
lora_dpo_distributed                     llama2/7B_lora_dpo                      
lora_finetune_distributed                llama2/7B_lora                          llama2/13B_lora                         llama2/70B_lora                         llama3/70B_lora                         llama3_1/70B_lora                       llama3/8B_lora                          llama3_1/8B_lora                        mistral/7B_lora                         gemma/2B_lora                           gemma/7B_lora                           phi3/mini_lora                          
lora_finetune_fsdp2                      llama2/7B_lora                          llama2/13B_lora                         llama2/70B_lora                         llama2/7B_qlora                         llama2/70B_qlora                        
generate                                 generation                              
eleuther_eval                            eleuther_evaluation                     
quantize                                 quantization                            
qat_distributed                          llama2/7B_qat_full                      llama3/8B_qat_full

tune ls命令会在Torchtune CLI中列出所有内置的微调指令和配置，提供包括指令名称与相应配置的详细输出。

现在，在使用Torchtune进行微调框架之后，下一步任务是选择适用于微调Llama-3.1-8B摘要任务的合适数据集。

EdinburghNLP/xsum 数据集与摘要方法

EdinburghNLP/xsum 数据集，或称 Extreme Summarization (XSum) Dataset, 是一个专门为训练和评估抽象摘要模型而设计的 BBC 新闻文章集合。该数据集在论文“Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization”中被引入。数据集收集于 2010 到 2017 年期间，包含约 226,000 篇 BBC 新闻文章，主题包括政治、体育、商业、科技等。数据集分为训练集（204,000 篇）、验证集（11,300 篇）和测试集（11,300 篇）。

让我们选择训练集的前 1%，并浏览 EdinburghNLP/xsum 数据集的第一个元素:

import datasets
summarization_dataset = datasets.load_dataset('EdinburghNLP/xsum', trust_remote_code=True, split="train[:1%]")
summarization_dataset

Dataset({features: ['document', 'summary', 'id'],num_rows: 2040
})

该数据集包含三个特征：`document`、`summary` 和 id。`document` 包含实际的 BBC 新闻文章，`summary` 是简洁的摘要，而 id 是实例的识别编号。使用以下命令浏览第一个训练示例：

print(f"\nDocument:\n{summarization_dataset['document'][0]}")
print(f"\nSummary:\n{summarization_dataset['summary'][0]}")

Document:
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we're neglected or forgotten," she said.
"That may not be true but it is perhaps my perspective over the last few days.
"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"
Meanwhile, a flood alert remains in place across the Borders because of the constant rain.
Peebles was badly hit by problems, sparking calls to introduce more defences in the area.
Scottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.
The Labour Party's deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.
He said it was important to get the flood protection plan right but backed calls to speed up the process.
"I was quite taken aback by the amount of damage that has been done," he said.
"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."
He said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.
Have you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.Summary:
Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.

数据集中每篇文章都配有一个一句话的摘要，使得这个数据集适合用于微调生成简明且信息丰富的摘要的模型。

在为摘要任务微调模型时，有两种主要的方法：抽取式摘要和我们将在本博客中不会使用的抽象式摘要：

抽取式摘要：这种方法直接从源文本中选择关键句子、短语或段落来创建摘要。它着重于文本中最重要的部分，而不改变原文的措辞。
抽象式摘要：在这种摘要类型中，模型生成新的句子来传达原文本的精髓。它涉及对信息进行改写或重组，以创建更加连贯且简洁的摘要。目标是生成更简短且更易理解的摘要。

在接下来的部分中，我们将概述不同 Llama-3.1 模型变体，强调使用 Torchtune 微调 LLaMA 3.1-8B 模型的过程。我们使用 lora_finetune_distributed 配方以及 llama3_1/8B_lora 配置。

微调Llama-3.1-8B与LoRA及分布式训练

Llama-3.1是Meta发布的大型语言模型系列之一。它包括具有80亿、700亿和4050亿参数的模型，注重在文本生成、推理和多语言处理任务中平衡效率和性能。Llama-3.1系列模型，基于参数数量，分为:

LlaMA 3.1-8B. 这是一个80亿参数的模型。它提供了改进的推理能力、更好的基准测试性能，并支持八种语言的多语言处理。它是一个通用模型，适用于需要高准确率而无需过多计算资源的任务。
LlaMA 3.1-70B. 配备700亿参数，适用于大规模内容生成和更复杂的对话机器人系统等高端应用。它在执行要求苛刻的任务时表现出色，同时足够高效，可以在大多数计算机服务器上运行。
LlaMA 3.1-405B. 这是Llama家族中参数最多、最强大的模型，拥有4050亿参数。它适用于大规模研究、极其复杂的推理和广泛的多语言支持等应用。由于参数数量庞大，它设计用于在服务器级节点上运行。

注意

Llama-3.1是Hugging Face上的门控模型。要了解更多关于门控模型的信息，请参见门控模型，并在Hugging Face上请求访问meta-llama/Meta-Llama-3.1-70B-Instruct模型。

以下命令将`meta-llama/Meta-Llama-3.1-8B-Instruct`模型从Hugging Face下载到本地目录`/tmp/Meta-Llama-3.1-8B-Instruct`。`Meta-Llama-3.1-8B-Instruct`模型是80B模型的专门版本，已微调用于指令跟随任务。该命令还需要Hugging Face令牌以访问模型。

! tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth" --hf-token <YOUR_HF_TOKEN>

tune run命令用于启动大型语言模型的微调配方。此命令允许我们使用预定义或自定义配方和配置来启动训练。对于我们的使用LoRA微调Llama-3.1-8B的任务，合适的配方是`lora_finetune_distributed`。

在选择配方后，必须提供必要的配置参数。有两种方法可以将这些参数传递给`tune run`命令：通过配置文件或命令行覆盖。

使用配置文件进行微调

Torchtune 主要使用 YAML 配置文件来指定微调所需的全部参数。为了传递自定义参数值，我们可以使用 tune cp 命令复制 Llama-3.1-8B 模型的配置文件：

! tune cp llama3_1/8B_lora my_llama3_1_custom_config.yaml

tune cp 命令用于将现有的 llama3_1/8B_lora 配置复制到本地工作目录中，命名为 my_llama3_1_custom_config.yaml。通过修改这个副本中的配置参数（如模型的超参数、数据集、模型设置或特定微调任务所需的任何其他配置），可以实现自定义。

那么 my_llama3_1_custom_config.yaml 文件内容如下：

# Config for multi-device LoRA finetuning in lora_finetune_distributed.py
# using a Llama-3.1-8B Instruct model
#
# This config assumes that you've run the following command before launching
# this run:
#   tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"
#
# To launch on 2 devices, run the following command from root:
#   tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
#   tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device LoRA finetuning please use 8B_lora_single_device.yaml
# or 8B_qlora_single_device.yaml# Tokenizer
tokenizer:_component_: torchtune.models.llama3.llama3_tokenizerpath: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model# Model Arguments
model:_component_: torchtune.models.llama3_1.lora_llama3_1_8blora_attn_modules: ['q_proj', 'v_proj']apply_lora_to_mlp: Falseapply_lora_to_output: Falselora_rank: 8lora_alpha: 16checkpointer:_component_: torchtune.utils.FullModelHFCheckpointercheckpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/checkpoint_files: [model-00001-of-00004.safetensors,model-00002-of-00004.safetensors,model-00003-of-00004.safetensors,model-00004-of-00004.safetensors]recipe_checkpoint: nulloutput_dir: /tmp/Meta-Llama-3.1-8B-Instruct/model_type: LLAMA3
resume_from_checkpoint: False# Dataset and Sampler
dataset:_component_: torchtune.datasets.alpaca_cleaned_dataset
seed: null
shuffle: True
batch_size: 2# Optimizer and Scheduler
optimizer:_component_: torch.optim.AdamWweight_decay: 0.01lr: 3e-4
lr_scheduler:_component_: torchtune.modules.get_cosine_schedule_with_warmupnum_warmup_steps: 100loss:_component_: torch.nn.CrossEntropyLoss# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 32# Logging
output_dir: /tmp/lora_finetune_output
metric_logger:_component_: torchtune.utils.metric_logging.DiskLoggerlog_dir: ${output_dir}
log_every_n_steps: 1
log_peak_memory_stats: False# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: False

在使用 Torchtune 对大型语言模型进行微调时，关键组件之一是用于训练过程的数据集。Torchtune 提供了 datasets 模块，包含一系列工具，可以帮助准备和使用各种数据集。有关配置数据集的更多信息，请参见配置微调数据集.

Torchtune 提供了多种内置数据集，用于各种微调任务，包括指令跟随、聊天互动和文本完成。这些数据集设计用于与 Torchtune 的微调工作流程集成，从而更容易针对特定任务训练大型语言模型。此外，Torchtune 与 Hugging Face 的数据集集成，提供了从 Hugging Face 平台访问大量现有数据集的功能。有关 Torchtune 数据集和数据集构建器的更多信息，请参见torchtune.datasets.

当配置 Torchtune 以使用 Hugging Face 数据集时，我们需要使用 Torchtune 数据集库（`torchtune.datasets`）中的通用数据集构建器 .

在 EdinburghNLP/xsum 数据集的特定情况下，我们需要使用 instruct_dataset 构建器。`instruct_dataset` 构建器需要以下参数:

tokenizer: 模型使用的分词器
source: 数据集的路径字符串。可以是 Hugging Face load_dataset 类支持的任何内容。
column_map: 一个可选参数，用于将模板中预期的占位符名称映射到数据集（样本）中的列/键名称。
split: 一个 Hugging Face 数据集参数，定义了在训练过程中使用的拆分。它允许从数据集中选择特定的百分比或固定数量的实例。
max_seq_len: 输入和输出序列的最大序列长度（以标记为单位）。如果设置了此参数，任何超过此限制的序列将被截断。
template:用于格式化数据集中指令的模板。有关可用模板的更多信息，请参见 Torchtune instruct 模板.
train_on_input: 模型是否根据用户提示进行训练。换句话说，控制模型是根据用户提示和响应一起学习，还是仅根据响应学习。更多信息见: train_on_inputs 说明.
trust_remote_code: 一个 Hugging Face 数据集参数，允许在某些数据集上执行远程代码。
epochs: 模型的训练周期数

在创建 my_llama3_1_custom_config.yaml 副本时，大部分参数值已经配置好。下面是我们需要在 YAML 文件中修改的参数和值：

dataset:_component_: torchtune.datasets.instruct_datasetcolumn_map:dialogue: documentoutput: summarysource: EdinburghNLP/xsumsplit: train[:2000]max_seq_len: 2048template: torchtune.data.SummarizeTemplatetrain_on_input: falsetrust_remote_code: trueepochs: 10

我们已经为列映射、数据集源标签、训练集的一部分、最大序列长度和训练周期数分配了值。这些值分别是：`document`，`summary`，`EdinburghNLP/xsum`，`train[:2000]`，`2048` 和 10。后三个值仅供说明，可以根据需要进行调整，以实现模型的完全微调。完整的 my_llama3_1_custom_config.yaml 文件如下：

# Config for multi-device LoRA finetuning in lora_finetune_distributed.py
# using a Llama-3.1-8B Instruct model
#
# This config assumes that you've run the following command before launching
# this run:
#   tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"
#
# To launch on 2 devices, run the following command from root:
#   tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
#   tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device LoRA finetuning please use 8B_lora_single_device.yaml
# or 8B_qlora_single_device.yaml# Tokenizer
tokenizer:_component_: torchtune.models.llama3.llama3_tokenizerpath: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model# Model Arguments
model:_component_: torchtune.models.llama3_1.lora_llama3_1_8blora_attn_modules: ['q_proj', 'v_proj']apply_lora_to_mlp: Falseapply_lora_to_output: Falselora_rank: 8lora_alpha: 16checkpointer:_component_: torchtune.utils.FullModelHFCheckpointercheckpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/checkpoint_files: [model-00001-of-00004.safetensors,model-00002-of-00004.safetensors,model-00003-of-00004.safetensors,model-00004-of-00004.safetensors]recipe_checkpoint: nulloutput_dir: /tmp/Meta-Llama-3.1-8B-Instruct/model_type: LLAMA3
resume_from_checkpoint: False# Dataset and Sampler. We have set-up the parameters for the custom EdinburghNLP/xsum dataset
dataset:_component_: torchtune.datasets.instruct_datasetcolumn_map:dialogue: documentoutput: summarysource: EdinburghNLP/xsumsplit: train[:2000]max_seq_len: 2048template: torchtune.data.SummarizeTemplatetrain_on_input: falsetrust_remote_code: trueseed: null
shuffle: True
batch_size: 2# Optimizer and Scheduler
optimizer:_component_: torch.optim.AdamWweight_decay: 0.01lr: 3e-4
lr_scheduler:_component_: torchtune.modules.get_cosine_schedule_with_warmupnum_warmup_steps: 100loss:_component_: torch.nn.CrossEntropyLoss# Training. Updated to finetune for 10 epochs
epochs: 10
max_steps_per_epoch: null
gradient_accumulation_steps: 32# Logging
output_dir: /tmp/lora_finetune_output
metric_logger:_component_: torchtune.utils.metric_logging.DiskLoggerlog_dir: ${output_dir}
log_every_n_steps: 1
log_peak_memory_stats: False# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: False

要启动微调过程，请运行以下命令:

%%time
! tune run --nproc_per_node 8 lora_finetune_distributed --config my_llama3_1_custom_config.yaml

微调过程将会开始，并显示类似如下的输出:

Running with torchrun...
W0814 16:18:44.070000 140416268592960 torch/distributed/run.py:778] 
W0814 16:18:44.070000 140416268592960 torch/distributed/run.py:778] *****************************************
W0814 16:18:44.070000 140416268592960 torch/distributed/run.py:778] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0814 16:18:44.070000 140416268592960 torch/distributed/run.py:778] *****************************************
INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeDistributed with resolved config:...INFO:torchtune.utils.logging: Profiler config after instantiation: {'enabled': False}
1|3|Loss: 2.6132729053497314: 100%|██████████████| 3/3 [05:31<00:00, 110.20s/it]INFO:torchtune.utils.logging:Model checkpoint of size 4.98 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0001_0.pt...10|29|Loss: 1.33621084690094: 100%|██████████████| 3/3 [05:26<00:00, 107.99s/it]
10|30|Loss: 1.2566407918930054: 100%|████████████| 3/3 [05:26<00:00, 107.99s/it]
INFO:torchtune.utils.logging:Model checkpoint of size 4.98 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0001_9.pt
INFO:torchtune.utils.logging:Model checkpoint of size 5.00 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0002_9.pt
INFO:torchtune.utils.logging:Model checkpoint of size 4.92 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0003_9.pt
INFO:torchtune.utils.logging:Model checkpoint of size 1.17 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0004_9.pt
INFO:torchtune.utils.logging:Adapter checkpoint of size 0.01 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/adapter_9.pt
INFO:torchtune.utils.logging:Adapter checkpoint of size 0.01 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/adapter_model.bin
INFO:torchtune.utils.logging:Adapter checkpoint of size 0.00 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/adapter_config.json
10|30|Loss: 1.2566407918930054: 100%|████████████| 3/3 [07:59<00:00, 159.98s/it]
CPU times: user 24.9 s, sys: 6.56 s, total: 31.5 s
Wall time: 1h 20min 44s

在上述命令中，我们传递了 recipe 标签`lora_finetune_distributed` 和配置文件`my_llama3_1_custom_config.yaml`。此外，我们将 nproc_per_node 参数设置为 8。由于 recipe 使用了分布式训练（`lora_finetune_distributed`），我们需要指定在单个节点上进行训练的 GPU 数量。每个 GPU 运行一个单独的进程，从而实现并行和更高效的训练，尤其是在处理大型模型时。`nproc_per_node` 参数不仅设置了分布式训练中使用的 GPU 数量，还允许在多个 GPU 上扩展训练过程。对于包含 2000 个训练实例和最大序列长度为 2048 个 token 的数据集，微调过程大约花费了 1.5 小时。

使用 YAML 配置文件是一种清晰且结构化的方式来定义训练运行的参数和配置，使得在不同实验或团队之间维护、共享和重用配置变得更加容易。然而，有时需要一种更灵活的方法，这时命令行覆盖命令就派上了用场。

使用命令行覆盖进行微调

命令行覆盖在不修改 YAML 文件的情况下进行快速配置更改时非常有用。这种方法在实验过程中特别有用，可以调整一些参数，例如学习率、批量大小和训练轮数，或者使用不同的训练数据集。

为了在命令行中复现 my_llama3_1_custom_config.yaml 文件处理的场景，我们可以运行以下命令：

%%time
! tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_1/8B_lora \
dataset=torchtune.datasets.instruct_dataset \
dataset.source=EdinburghNLP/xsum \
dataset.split=train[:2000] \
dataset.max_seq_len=2048 \
dataset.template=torchtune.data.SummarizeTemplate \
dataset.column_map.dialogue=document \
dataset.column_map.output=summary \
dataset.trust_remote_code=True \
epochs=10

我们通过传递这些参数作为 tune run 命令的附加参数，覆盖了默认的参数值。

使用微调模型进行推理

要使用 Torchtune 对微调的 Llama-3.1-8B 模型执行推理，我们需要修改默认的生成配置。类似于创建自定义 config 文件的过程，我们需要运行以下命令来创建自定义生成配置：

! tune cp generation ./my_llama3_1_custom_generation_config.yaml

此命令会生成一个新的 YAML 配置文件。要使用它，我们需要更新文件，加入模型检查点和分词器的路径。这些所需的检查点在微调过程结束时列出：

10|30|Loss: 1.2566407918930054: 100%|████████████| 3/3 [05:26<00:00, 107.99s/it]
INFO:torchtune.utils.logging:Model checkpoint of size 4.98 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0001_9.pt
INFO:torchtune.utils.logging:Model checkpoint of size 5.00 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0002_9.pt
INFO:torchtune.utils.logging:Model checkpoint of size 4.92 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0003_9.pt
INFO:torchtune.utils.logging:Model checkpoint of size 1.17 GB saved to /tmp/Meta-Llama-3.1-8B-Instruct/hf_model_0004_9.pt

换句话说，`my_llama3_1_custom_generation_config.yaml` 文件中的 checkpointer 参数会如下所示：

checkpointer:_component_: torchtune.utils.FullModelHFCheckpointercheckpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/checkpoint_files: [hf_model_0001_9.pt,hf_model_0002_9.pt,hf_model_0003_9.pt,hf_model_0004_9.pt,]

完整的 my_llama3_1_custom_generation_config.yaml YAML 文件如下所示：

# Config for running the InferenceRecipe in generate.py to generate output from an LLM
#
# To launch, run the following command from root torchtune directory:
#    tune run generate --config generation# Model arguments
model:_component_: torchtune.models.llama3.llama3_8bcheckpointer:_component_: torchtune.utils.FullModelHFCheckpointercheckpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/checkpoint_files: [hf_model_0001_9.pt,hf_model_0002_9.pt,hf_model_0003_9.pt,hf_model_0004_9.pt,]output_dir: /tmp/Meta-Llama-3.1-8B-Instruct/model_type: LLAMA3device: cuda
dtype: bf16seed: 1234# Tokenizer arguments
tokenizer:_component_: torchtune.models.llama3.llama3_tokenizerpath: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model# Generation arguments; defaults taken from gpt-fast
prompt: "Summarize this dialogue: The crash happened about 07:20 GMT at the junction of the A127 and Progress Road in Leigh-on-Sea, Essex. The man, who police said is aged in his 20s, was treated at the scene for a head injury and suspected multiple fractures, the ambulance service said. He was airlifted to the Royal London Hospital for further treatment. The Southend-bound carriageway of the A127 was closed for about six hours while police conducted their initial inquiries. A spokeswoman for Essex Police said it was not possible comment to further as this time as the 'investigation is now being conducted by the IPCC'. --- Summary:"
instruct_template: null
chat_format: null
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
# It is recommended to set enable_kv_cache=False for long-context models like Llama-3.1
enable_kv_cache: Falsequantizer: null

文件包含 prompt 参数。这个参数也被更新，用于运行以下提示:

# Generation arguments; defaults taken from gpt-fast
prompt: "Summarize this dialogue: The crash happened about 07:20 GMT at the junction of the A127 and Progress Road in Leigh-on-Sea, Essex. The man, who police said is aged in his 20s, was treated at the scene for a head injury and suspected multiple fractures, the ambulance service said. He was airlifted to the Royal London Hospital for further treatment. The Southend-bound carriageway of the A127 was closed for about six hours while police conducted their initial inquiries. A spokeswoman for Essex Police said it was not possible comment to further as this time as the 'investigation is now being conducted by the IPCC'. --- Summary:"

最后，我们可以通过以下命令使用我们微调过的 Llama-3.1-8B 模型来执行总结任务：

! tune run generate --config ./my_llama3_1_custom_generation_config.yaml

这将产生以下输出：

INFO:torchtune.utils.logging:Running InferenceRecipe with resolved config:...model:_component_: torchtune.models.llama3.llama3_8b
prompt: 'Summarize this dialogue: The crash happened about 07:20 GMT at the junctionof the A127 and Progress Road in Leigh-on-Sea, Essex. The man, who police said isaged in his 20s, was treated at the scene for a head injury and suspected multiplefractures, the ambulance service said. He was airlifted to the Royal London Hospitalfor further treatment. The Southend-bound carriageway of the A127 was closed forabout six hours while police conducted their initial inquiries. A spokeswoman forEssex Police said it was not possible comment to further as this time as the ''investigationis now being conducted by the IPCC''. --- Summary:'
quantizer: null
seed: 1234
temperature: 0.6
tokenizer:_component_: torchtune.models.llama3.llama3_tokenizerpath: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
top_k: 300DEBUG:torchtune.utils.logging:Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils.logging:
Summarize this dialogue: The crash happened about 07:20 GMT at the junction of the A127 and Progress Road in Leigh-on-Sea, Essex. The man, who police said is aged in his 20s, was treated at the scene for a head injury and suspected multiple fractures, the ambulance service said. He was airlifted to the Royal London Hospital for further treatment. The Southend-bound carriageway of the A127 was closed for about six hours while police conducted their initial inquiries. A spokeswoman for Essex Police said it was not possible comment to further as this time as the 'investigation is now being conducted by the IPCC'. --- Summary: A man has been airlifted to hospital after a car crashed into a tree in Essex....INFO:torchtune.utils.logging:Time for inference: 25.33 sec total, 11.84 tokens/sec
INFO:torchtune.utils.logging:Bandwidth achieved: 191.80 GB/s
INFO:torchtune.utils.logging:Memory used: 16.64 GB

我们的微调模型在总结提供的文本方面表现得很好。

评估 Torchtune 分布式训练在多 GPU 上的可扩展性

Torchtune 提供了在多 GPU 上的稳定可扩展性。通过利用分布式训练，Torchtune 能高效地使用硬件资源，让从单个设备设置到单节点的多 GPU 配置的训练规模化成为可能。我们将在一台拥有 8 个 AMD Instinct MI210 GPU 的节点上评估使用 2、4、6 和 8 个 GPU 进行分布式训练时的运行时间改进。此评估将提供分布式设置如何优化大模型训练过程的见解。

为了评估运行时间的改进，可以在 tune run 命令中更改参数 nproc_per_node 的值。为了进行快速实验，可以使用命令行覆盖来微调 Llama-3.1-8B，命令如下所示：

%%time
! tune run --nproc_per_node <NUMBER_OF_GPUs_TO_USE> lora_finetune_distributed --config llama3_1/8B_lora \
dataset=torchtune.datasets.instruct_dataset \
dataset.source=EdinburghNLP/xsum \
dataset.split=train[:2000] \
dataset.max_seq_len=2048 \
dataset.template=torchtune.data.SummarizeTemplate \
dataset.column_map.dialogue=document \
dataset.column_map.output=summary \
dataset.trust_remote_code=True \
dataset.packed=False \
dataset.train_on_input=False \
epochs=1

出于实验目的，我们仅进行一个 epoch 的微调。需要在每次运行中为 NUMBER_OF_GPUs_TO_USE 指定一个值（2、4、6 或 8）。下图显示了完成微调任务所需的运行时间：

总结

在这篇博客文章中，我们提供了如何在 AMD 支持的多 GPU 设置上使用 Torchtune 库和 ROCm 微调 Llama-3.1-8B 模型用于摘要任务的详细指南。通过集成 LoRA 以实现高效的微调，我们展示了 Torchtune 如何使从 2 到 8 个 GPU 的扩展成为可能，展现了其在 AMD 硬件上分布式训练的能力。