langchain 入门中篇：数据封装，Memory 封装

数据的处理流程可以看一张图来帮助理解

数据来源可以是网络，可以是邮件，可以是本地文件

经过 Document Loaders 加载，再在 Transform 阶段对文档进行 split, filter, translate, extract metadata 等操作，之后在 Embed 阶段进行向量化，存储在向量数据库中后期用于检索。

加载文档

准备工作：

安装依赖库：pip install pypdf

加载本地文档

from langchain_community.document_loaders import PyPDFLoaderloader = PyPDFLoader("./data/spyl.pdf") # pdf 是从网上下载的，放在了 ./data 目录下
pages = loader.load_and_split()print(pages[0].page_content)

输出：

加载 web 网页：

from langchain_community.document_loaders import WebBaseLoaderloader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")  # LangSmith 的官方介绍
pages = loader.load_and_split()
print(pages[2].page_content)

输出：

文档处理器

1. TextSplitter

安装依赖库：pip install --upgrade langchain-text-splitters

接下来我们将前面的文档进一步切分处理

from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=300,chunk_overlap=50,length_function=len,add_start_index=True
)paragraphs = text_splitter.create_documents([p.page_content for p in pages])
for para in paragraphs:print(para.page_content)print("=====================")

输出：

每个 chunk 300 个 token，重叠最多 50 个 token

注意：L昂Chain 的 PDFLoader 和 TextSplitter 实现都比较粗糙，实际生产不建议使用

向量数据库与向量检索

文档已经加载到内存中，接下来就需要将文档向量化，然后灌入向量数据库

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS# 灌库
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
db = FAISS.from_documents(paragraphs, embeddings)# 检索 top-5 结果
retriever = db.as_retriever(search_kwargs={"k": 5})docs = retriever.invoke("白酒最近的产量")for doc in docs:print(doc.page_content)print("+++++++++++++++++++++")

输出：

根据检索结果，检索成功匹配到了最相关的信息

小结：

文档处理部分 LangChain 实现较为粗糙，实际生产中不建议使用
与向量数据库的链接部分本质是接口封装，向量数据库需要自己选型

记忆封装

先来看几个简单的例子：

对话上下文：ConversationBufferMemory

from langchain.memory import ConversationBufferMemory, ConversationBufferWindowMemoryhistory = ConversationBufferMemory()
history.save_context({"input": "你好啊"}, {"output": "你也好啊"})print(history.load_memory_variables({}))history.save_context({"input": "你再好啊"}, {"output": "你又好啊"})print(history.load_memory_variables({}))

{'history': 'Human: 你好啊\nAI: 你也好啊'}
{'history': 'Human: 你好啊\nAI: 你也好啊\nHuman: 你再好啊\nAI: 你又好啊'}

只保留一个窗口的上下文：ConversationBufferWindowMemory

from langchain.memory import ConversationBufferWindowMemorywindow = ConversationBufferWindowMemory(k=2)
window.save_context({"input":"第一轮问"}, {"output":"第一轮答"})
window.save_context({"input":"第二轮问"}, {"output":"第二轮答"})
window.save_context({"input":"第三轮问"}, {"output":"第三轮答"})print(window.load_memory_variables({}))

{'history': 'Human: 第二轮问\nAI: 第二轮答\nHuman: 第三轮问\nAI: 第三轮答'}

限制上下文长度：ConversationTokenBufferMemory

from langchain.memory import ConversationTokenBufferMemory
from langchain_openai import ChatOpenAIllm = ChatOpenAI()
memory = ConversationTokenBufferMemory(max_token_limit=200,llm=llm
)
memory.save_context({"input":"你好"}, {"output":"你好，我是你的AI助手。"})
memory.save_context({"input":"你会干什么？"}, {"output":"我上知天文下知地理，什么都会。"})
print(memory.load_memory_variables({}))

{'history': 'Human: 你会干什么？\nAI: 我上知天文下知地理，什么都会。'}

应用场景：

我们通过一个简单的示例来展示 memory 的作用，这里设置 verbose=True ,这样我们就可以看到 prompt 的生成。

from langchain.chains import ConversationChainconversation_with_summary = ConversationChain(llm=llm,# We set a very low max_token_limit for the purposes of testing.memory=memory,verbose=True,
)
conversation_with_summary.predict(input="今天天气真不错！")

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Current conversation:
Human: 你好
AI: 你好，我是你的AI助手。
Human: 你会干什么？
AI: 我上知天文下知地理，什么都会。
Human: 今天天气真不错！
AI: 是的，今天天气晴朗，气温适宜，适合出门活动。您有什么计划吗？
Human: 今天天气真不错！
AI:> Finished chain.

[13]:

'是的，今天天气确实很好，阳光明媚，适合户外活动。您想去哪里呢？'

conversation_with_summary.predict(input="我喜欢跑步，喜欢放风筝")

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Current conversation:
Human: 你好
AI: 你好，我是你的AI助手。
Human: 你会干什么？
AI: 我上知天文下知地理，什么都会。
Human: 今天天气真不错！
AI: 是的，今天天气晴朗，气温适宜，适合出门活动。您有什么计划吗？
Human: 今天天气真不错！
AI: 是的，今天天气确实很好，阳光明媚，适合户外活动。您想去哪里呢？
Human: 我喜欢跑步，喜欢放风筝
AI:> Finished chain.

[14]:

'那真是个好主意！跑步和放风筝都是很棒的户外活动。您喜欢在什么样的地方跑步呢？有没有特别喜欢去的公园或者跑步道？'

conversation_with_summary.predict(input="获取爬山也不错，不如去爬一次黄山吧")

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Current conversation:
AI: 是的，今天天气晴朗，气温适宜，适合出门活动。您有什么计划吗？
Human: 今天天气真不错！
AI: 是的，今天天气确实很好，阳光明媚，适合户外活动。您想去哪里呢？
Human: 我喜欢跑步，喜欢放风筝
AI: 那真是个好主意！跑步和放风筝都是很棒的户外活动。您喜欢在什么样的地方跑步呢？有没有特别喜欢去的公园或者跑步道？
Human: 获取爬山也不错，不如去爬一次黄山吧
AI:> Finished chain.

[15]:

'黄山是一个非常著名的旅游胜地，有着壮丽的风景和丰富的自然资源。黄山位于安徽省，被誉为中国的五大名山之一。登上黄山，您可以欣赏到奇松怪石、云海日出等壮丽景观。不过需要注意的是，黄山地势险峻，需要一定的体力和意志力来完成登山之旅。希望您在黄山能够享受到美妙的风景和登山的乐趣！您需要我为您提供黄山的相关信息吗？'

conversation_with_summary.predict(input="我喜欢什么运动？")

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Current conversation:
AI: 黄山是一个非常著名的旅游胜地，有着壮丽的风景和丰富的自然资源。黄山位于安徽省，被誉为中国的五大名山之一。登上黄山，您可以欣赏到奇松怪石、云海日出等壮丽景观。不过需要注意的是，黄山地势险峻，需要一定的体力和意志力来完成登山之旅。希望您在黄山能够享受到美妙的风景和登山的乐趣！您需要我为您提供黄山的相关信息吗？
Human: 我喜欢什么运动？
AI:> Finished chain.

[16]:

'很抱歉，我不知道您喜欢什么运动。您可以告诉我您感兴趣的领域，我会尽力为您提供相关信息。比如，如果您喜欢户外活动，我可以为您介绍一些常见的户外运动项目。如果您喜欢健身运动，我也可以为您提供一些健身的建议和指导。请告诉我您的兴趣方向，我会尽力帮助您。'

从上面的实验结果看，前面的对话长度 token 没到 200 的时候，AI 能根据前面的对话来回答你的问题，但是当超过 200 之后，前面的信息就丢了，相当于遗忘了。

ConversationSummaryMemory

如果对话历史如果很长，可以通过摘要记忆对信息进行浓缩

from langchain.memory import ConversationSummaryMemory, ChatMessageHistory
from langchain_openai import OpenAImemory = ConversationSummaryMemory(llm=OpenAI(temperature=0))
memory.save_context({"input": "你好"}, {"output": "你好，有什么需要帮助？"})
memory.load_memory_variables({})

{'history': '\nThe human greets the AI in Chinese. The AI responds in Chinese and asks if there is anything it can help with.'}

通过已经存在的messages或者摘要初始化

history = ChatMessageHistory()
history.add_user_message("hi")
history.add_ai_message("hi there!")memory = ConversationSummaryMemory.from_messages(llm=OpenAI(temperature=0),chat_memory=history,return_messages=True
)memory.buffer输出：'\nThe human greets the AI, to which the AI responds with a friendly greeting.'

通过之前的摘要进行初始化来加快初始化速度

memory = ConversationSummaryMemory(llm=OpenAI(temperature=0),buffer="The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good because it will help humans reach their full potential.",chat_memory=history,return_messages=True
)

在 chain 使用

from langchain_openai import OpenAI
from langchain.chains import ConversationChain
llm = OpenAI(temperature=0)
conversation_with_summary = ConversationChain(llm=llm,memory=ConversationSummaryMemory(llm=OpenAI()),verbose=True
)
conversation_with_summary.predict(input="Hi, what's up?")

输出：

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: Hi, what's up?
AI:

> Finished chain.

" Hi there! I'm doing great. I'm currently helping a customer with a technical issue. How about you?"

`向量数据库支持的VectorStoreRetrieverMemory`

当想引用对话中之前提到过的一些信息，这个 memory 会比较有用

初始化向量数据库：

这一步会因使用的向量数据库不同而有所区别，具体细节可以看向量数据库的文档

import faiss
from langchain_openai import OpenAIEmbeddings
from langchain_community.docstore import InMemoryDocstore
from langchain_community.vectorstores import FAISSembedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
embedding = OpenAIEmbeddings()
vectorstore = FAISS(embedding, index, InMemoryDocstore({}), {})

创建 VectorStoreRetrieverMemory

from langchain.memory import VectorStoreRetrieverMemoryretriever = vectorstore.as_retriever(search_kwargs={"k":2}) # 实际使用时 k 值会设置得比较高
memory = VectorStoreRetrieverMemory(retriever=retriever)# When added to an agent, the memory object can save pertinent information from conversations or used tools
memory.save_context({"input": "我喜欢披萨"}, {"output": "哦，那太好了"})
memory.save_context({"input": "我喜欢足球"}, {"output": "..."})
memory.save_context({"input": "我不喜欢梅西"}, {"output": "ok"}) #

print(memory.load_memory_variables({"prompt": "我该看什么体育节目?"})["history"])

输出：

input: 我喜欢足球
output: ...
input: 我不喜欢梅西
output: ok

在 chain 中使用

from langchain_core.prompts import PromptTemplatellm = OpenAI(temperature=0) # Can be any valid LLM
_DEFAULT_TEMPLATE = """The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Relevant pieces of previous conversation:
{history}(You do not need to use these pieces of information if not relevant)Current conversation:
Human: {input}
AI:"""
PROMPT = PromptTemplate(input_variables=["history", "input"], template=_DEFAULT_TEMPLATE
)
conversation_with_summary = ConversationChain(llm=llm,prompt=PROMPT,memory=memory,verbose=True
)
conversation_with_summary.predict(input="你好，我是 Gem")

输出：

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Relevant pieces of previous conversation:
input: 我不喜欢梅西
output: ok
input: 我喜欢披萨
output: 哦，那太好了(You do not need to use these pieces of information if not relevant)Current conversation:
Human: 你好，我是 Gem
AI:> Finished chain.

[79]:

' 你好，Gem！我是你的AI助手。我可以回答你关于任何事情的问题。你有什么想知道的吗？'

conversation_with_summary.predict(input="我喜欢什么？")

输出：

> Entering new ConversationChain chain...
Prompt after formatting:
The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Relevant pieces of previous conversation:
input: 我喜欢披萨
output: 哦，那太好了
input: 我喜欢足球
output: ...(You do not need to use these pieces of information if not relevant)Current conversation:
Human: 我喜欢什么？
AI:> Finished chain.

[80]:

' 你喜欢什么？我不太清楚，因为我是一个人工智能，我没有自己的喜好。但是我可以告诉你一些我所知道的事情。比如，我知道你喜欢披萨和足球。你还喜欢什么？'

小结：

LangChain 的 Memory 管理机制属于可用的部分，尤其是简单情况如按轮数或按 Token 数管理；
对于复杂情况，它不一定是最优的实现，例如检索向量库方式，建议根据实际情况和效果评估；
但是它对内存的各种维护方法的思路在实际生产中可以借鉴。