大型语言模型的工作原理（LLM：从零学起）

一、说明

二、LLM如何运作

三、预训练：基本模型

四、微调：培训助手

五、RLHF：从人类反馈中强化学习

六、提示工程

七、总结

一、说明

这是我们谈论LLM系列的第二篇文章。在本文中，我们旨在为大型语言模型（LLM）的运行方式提供易于理解的解释。

二、LLM如何运作

让我们首先看一下文档完成器模型的工作原理：

用户提示：

>A banana is

模型响应：

>an elongated, edible fruit

然后，文档生成器模型的工作方式如下：

用户提示：

I want to buy a new car

模型响应：

>What kind of car do you want to buy?

请注意上述两者之间的差异。

第一个模型只是一个文档完成器，它只会用它找到最有可能成为下一个字符的内容来完成提示。这是我们在互联网数据块上训练的模型，它被称为基础模型。

第二个模型是一个文档生成器，它将根据提示问题生成更像人类的响应。这是 ChatGPT 模型。

ChatGPT模型是一种推理模型，可以根据提示问题生成响应。我会说它 99% 是基本模型，但有两个额外的训练步骤：微调步骤和从人类反馈步骤中强化学习。

三、预训练：基本模型

这构成了人工智能革命的核心，也是真正的魔力所在。

训练模型是一个向它提供大量数据并让它从中学习的过程。

正如 GPT-3 论文中所述，基础模型是在大量互联网数据上训练的。对于像你我这样的人来说，这不是一件容易的事。它不仅需要获取数据，还需要大量的计算能力，如GPU和TPU。

但别担心，我们仍然可以学习在自己的计算机上训练一个小型 GPT 模型。我将在下一个主题中向您展示如何执行此操作。

LLM 训练背后的创新在于引入了 Transformer 架构，该架构使模型能够从大量数据中学习，同时保留输入不同部分之间的关键上下文关系。

通过保持这些联系，该模型可以根据提供的上下文有效地推断出新的见解，无论它们是单个单词、句子、段落还是其他内容。凭借这种能力，LLM培训为自然语言处理和生成任务开辟了新的机会，使机器能够更好地理解和响应人类交流。

用于训练基本模型的转换器架构如下所示：

这是一个基于神经网络的模型训练，采用了一些新旧技术：标记化、嵌入、位置编码、前馈、归一化、softmax、线性变换，以及最重要的多头注意力。

这部分是你我最感兴趣的。我们希望清楚地了解架构背后的想法以及培训的确切方式。因此，从下一篇文章开始，我们将开始深入研究用于训练基础模型的论文、代码和数学。

四、微调：培训助手

微调是一种非常聪明的实现方式。我想它首先是由 OpenAI 完成的。这个想法非常简单，但工作原理很智能：雇用人工标记人员来创建大量问答对话对（如 100k 对话）。然后向模型提供对话对，并让它从中学习。

此过程称为微调。您知道将这 100k 个示例对话训练到模型中后会发生什么吗？模型将像人类一样开始响应！

让我们看一下那些带有标签的对话示例：

>Human labeled Q&AQ: What is your name?
A: My name is John.

>Human labeled Q&AQ: What's the capital of China?
A: China's capital is Beijing.

>Human labeled Q&AQ: Summarize the plot of the movie Titanic.
A: The movie Titanic is about a ship that sinks in the ocean.

哇，这些示例问答是在嘲笑我们彼此交谈的方式。

通过向模型传授这些响应样式，相关上下文响应的概率将变得非常高，并成为对用户提示的响应。通过以各种对话风格训练模型，我们增加了它为提示提供相关且上下文适当的响应的可能性。

这就是语言模型如何显得如此智能和像人类的原因;通过学习模仿现实世界对话的节奏和模式，他们可以令人信服地模拟与用户的来回对话。

在这一步，我们可以说我们得到了一个助手模型。

下图显示了从预训练基础模型到微调助手模型的一些亮点：

（摘自 Andrej Karpathy 从头开始构建 GPT 模型）)

五、RLHF：从人类反馈中强化学习

2022 年 1 月，OpenAI 发表了他们关于调整语言模型以遵循指令的工作。在他们的博客文章中，他们描述了该模型是如何根据人类反馈进行微调的：

这个有点棘手。这个想法是让模型从人类反馈中学习。他们不是提供 ~100k 标记的问答对，而是收集用户的提示和模型响应，然后让人类对它们进行排名。将排名对话作为最需要的问答样本，然后再次将它们提供给模型，让它从中学习以提高其整体性能。

OpenAI 在其博客上介绍了此过程：

为了使我们的模型更安全、更有用、更一致，我们使用了一种称为从人类反馈中强化学习（RLHF）的现有技术。根据客户提交给 API 的提示，我们的贴标器会提供所需模型行为的演示，并对模型的多个输出进行排名。然后，我们使用这些数据来微调 GPT-3。

以下是基本模型与微调/RLHF 响应比较：

您可以看到，在没有微调和 RLHF 的情况下，该模型只是一个文档完成器。

六、提示工程

即使进行了微调和 RLHF，该模型仍然需要一些帮助才能获得所需的响应。这就是提示工程的用武之地。

简单来说，我们可以仔细设计提示，从模型中获得所需的响应（有时甚至不需要微调）。

如果你不想过多地研究数学和代码，那么提示工程是更多关注的好方法，因为它可以通过键入更好的提示来充分利用 LLM 模型。

现在让我们看一个例子：

提示：

>The sky is

输出：

>blue.

让我们试着改进一下：

提示：

>Complete the sentence:
The sky is

输出：

>blue during the day and dark at night.

通过在提示中包含一些指令，模型将知道该做什么以及该响应什么。

让我们看另一个有趣的例子：

提示：

>When I was 6 my sister was half my age. Now
I’m 70 how old is my sister?

输出：

>35

答案是错误的。正确答案应为67。看起来模型理解了这些问题，但指的是数学计算而不是逻辑推理。

无需微调和 RLHF，我们只需在提示中添加更多示例说明即可获得正确答案：

提示：

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent $15. She has $8 left.
Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:

输出：

>When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.

输出 2：

>When the narrator was 6, his sister was half his age, which is 3. Now that the narrator is 70, his sister would be 70 - 3 = 67 years old. The answer is 67.

两个答案都是正确的！我们只是在提示中添加一些示例作为逻辑解释，然后再次提出相同的问题。模型现在可以理解问题并正确回答。

强提示可用于指导模型执行复杂的任务，例如解决数学问题或总结文本。所以即时工程在LLM生态系统中也扮演着非常重要的角色。

有关提示工程的更多信息，这里有一个很好的提示指南教程。