LangChain输出解析器

大型语言模型（或 LLM）生成文本，当你构建应用程序时，有时需要使用结构化数据而不是字符串。 LangChain 提供了输出解析器，可以帮助我们做到这一点。

我们将回顾 LangChain 提供的 Pydantic (JSON) 解析器。

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎 - Three.js虚拟轴心开发包

1、为什么要解析数据？

这是显而易见的，但我们无论如何都会回答它。解析数据可以帮助我们将其转换为更可读的格式，从而提高数据的整体质量。

假设你想在两个整数之间进行简单的算术运算，需要将给定的字符串转换为整数。拥有干净且结构化的数据还有其他好处。最明显的是数据如何轻松地融入你现有的模型和数据库。

2、餐厅预订用例

早在 2023 年 5 月，我发表了一篇关于使用自然语言与计算机交互的文章，其中我要求 ChatGPT 为我在一家餐厅进行虚构的预订，并使用 JSON 对象而不是纯旧文本进行响应。

长话短说，这是该帖子的输出：

{"intent": "book_reservation","parameters": {"date": "2023-05-05","time": "18:00:00","party_size": 2,"cuisine": "any"}
}

虽然我们可以要求 LLM 返回 JSON 并明确指定格式（就像我们在上一篇文章中所做的那样），但重要的是要认识到这在某些情况下可能不起作用，因为模型可能会产生幻觉（hallucinate）。

3、准备我们的查询模板

好的，让我们使用上一篇文章中的相同查询，但我们不会请求响应采用 JSON 格式，而是添加一个 {format_instructions} 占位符，如下所示：

reservation_template = '''Book us a nice table for two this Friday at 6:00 PM. Choose any cuisine, it doesn't matter. Send the confirmation by email.Our location is: {query}Format instructions:{format_instructions}
'''

好了，我们有了漂亮的查询。下面，我们将看到 LangChain 如何自动填充 {format_instructions} 占位符。

4、Pydantic (JSON) 解析器

为了告诉 LangChain 我们需要将文本转换为 Pydantic 对象，我们需要首先定义 Reservation 对象。那么，让我们直接进入：

from pydantic import BaseModelclass Reservation(BaseModel):date: str = Field(description="reservation date")time: str = Field(description="reservation time")party_size: int = Field(description="number of people")cuisine: str = Field(description="preferred cuisine")

很好，我们现在有了 Reservation 对象及其参数。让我们告诉 LangChain，我们需要一个解析器来将给定的输入转换为 Reservation 对象：

parser = PydanticOutputParser(pydantic_object=Reservation)

5、设置提示模板

我们现在要设置提示模板：

prompt = PromptTemplate(template=reservation_template,input_variables=["query"],partial_variables={"format_instructions": parser.get_format_instructions()},
)

注意到 partial_variables={"format_instructions": parser.get_format_instructions()}行了吗？这告诉 LangChain 将上面模板中的 format_instructions 变量替换为我们创建的Pydantic Reservation 对象的结构。

让我们添加位置查询，看看 LangChain 在我们原始查询的幕后会做什么以及它会是什么样子。

_input = prompt.format_prompt(query="San Francisco, CA")

让我们看看我们的查询现在是什么样子的：

print(_input.to_string())

>> Book us a nice table for two this Friday at 6:00 PM. 
>> Choose any cuisine, it doesn't matter. Send the confirmation by email.
>> 
>> Our location is: San Francisco, CA
>> 
>> Format instructions:
>> The output should be formatted as a JSON instance that conforms to the JSON schema below.
>> 
>> As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
>> the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
>> 
>> Here is the output schema:
>> ```
>> {"properties": {"date": {"title": "Date", "description": "reservation date", "type": "string"}, "time": {"title": "Time", "description": "reservation time", "type": "string"}, "party_size": {"title": "Party Size", "description": "number of people", "type": "integer"}, "cuisine": {"title": "Cuisine", "description": "preferred cuisine", "type": "string"}}, "required": ["date", "time", "party_size", "cuisine"]}
>> ```

太棒了，正如你所看到的，LangChain 为我们做了很多工作。它自动将我们创建的 Pydantic 对象转换为字符串，用于定义 LLM 响应的结构。

但这还不是全部，在查询模型后，我们可以使用 LangChain 的解析器自动将从模型获得的文本响应转换为 Reservation 对象。

我们是这样做的：

# We query the model first
output = model(_input.to_string())# We parse the output 
reservation = parser.parse(output)

太棒了，让我们通过迭代每个元素来打印 reservation 字段（和数据类型）：

for parameter in reservation.__fields__:print(f"{parameter}: {reservation.__dict__[parameter]},  {type(reservation.__dict__[parameter])}")

下面是输出：

>> date: Friday,  <class 'str'>
>> time: 6:00 PM,  <class 'str'>
>> party_size: 2,  <class 'int'>
>> cuisine: Any,  <class 'str'>

请注意， party_size 现在是 int 类型。显然，我们还可以直接访问 party_size 属性，如下所示： reservation.party_size。

6、其他输出解析器

正如我在文章前面提到的，LangChain 提供了更多的输出解析器，可以根据你的具体用例使用它们。幕后发生的事情的相同逻辑也适用于其中的大多数。

几个有趣的解析器是重试和自动修复解析器。重试解析器尝试重新查询模型以获取适合解析器参数的答案，并且如果相关输出解析器在尝试修复输出时失败，则自动修复解析器会触发。

以下是 LangChain 输出解析器的完整列表：

XML parser
Datetime parser
Enum parser
Retry parser
Auto-fixing parser
Structured output parser

下面是完整的代码：

from typing import Listfrom dotenv import load_dotenv
from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Fieldload_dotenv()model_name = "text-davinci-003"
temperature = 0.0
model = OpenAI(model_name=model_name, temperature=temperature)class Reservation(BaseModel):date: str = Field(description="reservation date")time: str = Field(description="reservation time")party_size: int = Field(description="number of people")cuisine: str = Field(description="preferred cuisine")parser = PydanticOutputParser(pydantic_object=Reservation)reservation_template = '''Book us a nice table for two this Friday at 6:00 PM. Choose any cuisine, it doesn't matter. Send the confirmation by email.Our location is: {query}Format instructions:{format_instructions}
'''prompt = PromptTemplate(template=reservation_template,input_variables=["query"],partial_variables={"format_instructions": parser.get_format_instructions()},
) _input = prompt.format_prompt(query="San Francisco, CA")output = model(_input.to_string())reservation = parser.parse(output)print(_input.to_string())for parameter in reservation.__fields__:print(f"{parameter}: {reservation.__dict__[parameter]},  {type(reservation.__dict__[parameter])}")