【LangChain】检索器之上下文压缩

LangChain学习文档

  • 【LangChain】检索器(Retrievers)
  • 【LangChain】检索器之MultiQueryRetriever
  • 【LangChain】检索器之上下文压缩

上下文压缩

    • LangChain学习文档
  • 概要
  • 内容
  • 使用普通向量存储检索器
  • 使用 LLMChainExtractor 添加上下文压缩(Adding contextual compression with an LLMChainExtractor)
  • 更多内置压缩机:过滤器(More built-in compressors: filters)
    • LLMChainFilter
    • EmbeddingsFilter
  • 将压缩器和文档转换器串在一起(Stringing compressors and document transformers together)
  • 总结

概要

检索的一项挑战是,通常我们不知道:当数据引入系统时,文档存储系统会面临哪些特定查询。

这意味着与查询最相关的信息可能被隐藏在包含大量不相关文本的文档中。

通过我们的应用程序传递完整的文件可能会导致更昂贵的llm通话和更差的响应。

上下文压缩旨在解决这个问题。

这个想法很简单:我们可以使用给定查询的上下文来压缩它们,以便只返回相关信息,而不是立即按原样返回检索到的文档。

这里的“压缩”既指压缩单个文档的内容,也指批量过滤文档。

要使用上下文压缩检索器,我们需要:

  • 基础检索器
  • 文档压缩器

上下文压缩检索器将查询传递给基础检索器,获取初始文档并将它们传递给文档压缩器。文档压缩器获取文档列表并通过减少文档内容或完全删除文档来缩短它。

在这里插入图片描述

内容

# 打印文档的辅助功能def pretty_print_docs(docs):print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

使用普通向量存储检索器

让我们首先初始化一个简单的向量存储检索器并存储 2023 年国情咨文演讲(以块的形式)。我们可以看到,给定一个示例问题,我们的检索器返回一两个相关文档和一些不相关的文档。甚至相关文档中也有很多不相关的信息。

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS
# 加载文档
documents = TextLoader('../../../state_of_the_union.txt').load()
# 拆分器
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# 拆分文档
texts = text_splitter.split_documents(documents)
# 构建索引,并构建检索器
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()
# 运行
docs = retriever.get_relevant_documents("What did the president say about Ketanji Brown Jackson")
# 美化打印
pretty_print_docs(docs)

结果:

    Document 1:Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.----------------------------------------------------------------------------------------------------Document 2:A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.----------------------------------------------------------------------------------------------------Document 3:And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  First, beat the opioid epidemic.----------------------------------------------------------------------------------------------------Document 4:Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. And as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up.  That ends on my watch. Medicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. We’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. Let’s pass the Paycheck Fairness Act and paid leave.  Raise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.

使用 LLMChainExtractor 添加上下文压缩(Adding contextual compression with an LLMChainExtractor)

现在让我们用 ContextualCompressionRetriever 包装我们的基本检索器。我们将添加一个 LLMChainExtractor,它将迭代最初返回的文档,并从每个文档中仅提取与查询相关的内容。

from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# 构建大模型
llm = OpenAI(temperature=0)
# 从大模型中构建LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
# 构建压缩检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)

结果:

    Document 1:"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence."----------------------------------------------------------------------------------------------------Document 2:"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

更多内置压缩机:过滤器(More built-in compressors: filters)

LLMChainFilter

LLMChainFilter 是稍微简单但更强大的压缩器,它使用 LLM Chain来决定过滤掉最初检索到的文档中的哪些文档以及返回哪些文档,而无需操作文档内容。

from langchain.retrievers.document_compressors import LLMChainFilter# 构建LLMChainFilter
_filter = LLMChainFilter.from_llm(llm)
# 构建上下文压缩检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)
    Document 1:Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

EmbeddingsFilter

对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。 EmbeddingsFilter 通过嵌入文档和查询并仅返回那些与查询具有足够相似嵌入的文档来提供更便宜且更快的选项。

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
# 构建嵌入
embeddings = OpenAIEmbeddings()
# 构建EmbeddingsFilter
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
# 构建上下文压缩检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)

结果:

    Document 1:Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.----------------------------------------------------------------------------------------------------Document 2:A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.----------------------------------------------------------------------------------------------------Document 3:And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  First, beat the opioid epidemic.

将压缩器和文档转换器串在一起(Stringing compressors and document transformers together)

使用 DocumentCompressorPipeline 我们还可以轻松地按顺序组合多个压缩器。除了压缩器之外,我们还可以将 BaseDocumentTransformers 添加到管道中,它不执行任何上下文压缩,而只是对一组文档执行一些转换。

例如,TextSplitters 可以用作文档转换器,将文档分割成更小的部分,而 EmbeddingsRedundantFilter 可以用于根据文档之间嵌入的相似性来过滤掉冗余文档。

下面我们创建一个压缩器管道,首先将文档分割成更小的块,然后删除冗余文档,然后根据与查询的相关性进行过滤。

from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter
# 构建拆分器
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")
# 构建EmbeddingsRedundantFilter
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
# 构建嵌入过滤器:EmbeddingsFilter
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
# 构建文档管道
pipeline_compressor = DocumentCompressorPipeline(transformers=[splitter, redundant_filter, relevant_filter]
)
# 构建上下文检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)

结果:

    Document 1:One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson----------------------------------------------------------------------------------------------------Document 2:As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year----------------------------------------------------------------------------------------------------Document 3:A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder

总结

我们在进行文档搜索的时候,正相关的文档是少部分,大部分都是不相关的文档。
我们可以使用上下文压缩检索器,只返回正相关的那部分文档。

主要步骤:

  1. 构建一个普通检索器:retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()
  2. 构建一个上下文压缩检索器:ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)

特别是第二步骤:构建上下文压缩器的第一个参数,有很多花样:
① LLMChainExtractor 提取,精炼
② LLMChainFilter 普通过滤
③ EmbeddingsFilter 嵌入过滤
④ DocumentCompressorPipeline 文档管道,可以将多个过滤器组合在一起。

参考地址:

https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/contextual_compression/

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/8465.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

AI视频监控综合管理平台EasyCVR多分屏默认播放协议的配置优化

智能视频监控平台EasyCVR可拓展性强、开放度高,既能作为业务平台使用,也能作为视频能力层被调用和集成。视频监控综合管理平台兼容度高,支持自由调用、支持与第三方集成。在AI能力的接入上,TSINGSEE青犀视频平台可支持AI智能分析网…

奇舞周刊第500期:TQL,巧用 CSS 实现动态线条 Loading 动画

记得点击文章末尾的“ 阅读原文 ”查看哟~ 下面先一起看下本期周刊 摘要 吧~ 奇舞推荐 ■ ■ ■ TQL,巧用 CSS 实现动态线条 Loading 动画 最近,群里有个很有意思的问题,使用 CSS 如何实现如下 Loading 效果: leaferjs&#xff0c…

STM32MP157驱动开发——LED 驱动( GPIO 子系统)

文章目录 编写思路GPIO子系统的LED驱动程序(stm32mp157)如何找到引脚功能和配置信息在设备树中添加 Pinctrl 信息leddrv.cledtest.cMakefile编译测试 编写思路 阅读:STM32MP157驱动开发——GPIO 和 和 Pinctrl 子系统的概念可知利用GPIO子系统去编写LED驱动&#x…

机器学习深度学习——softmax回归从零开始实现

👨‍🎓作者简介:一位即将上大四,正专攻机器学习的保研er 🌌上期文章:机器学习&&深度学习——向量求导问题 📚订阅专栏:机器学习&&深度学习 希望文章对你们有所帮助 …

全网最牛,Jmeter接口自动化-读取用例执行并结果回写(详细整理)

目录:导读 前言一、Python编程入门到精通二、接口自动化项目实战三、Web自动化项目实战四、App自动化项目实战五、一线大厂简历六、测试开发DevOps体系七、常用自动化测试工具八、JMeter性能测试九、总结(尾部小惊喜) 前言 1、环境准备 下载…

网络安全(零基础)自学

一、网络安全基础知识 1.计算机基础知识 了解了计算机的硬件、软件、操作系统和网络结构等基础知识,可以帮助您更好地理解网络安全的概念和技术。 2.网络基础知识 了解了网络的结构、协议、服务和安全问题,可以帮助您更好地解决网络安全的原理和技术…

windows下安装composer

安装Php 教程 下载composer 官网 中文网站 exe下载地址 下载好exe 双击运行 找到php.ini注释一行代码 测试 composer -v说明安装成功 修改源 执行以下命令即可修改 composer config -g repo.packagist composer https://packagist.phpcomposer.com # 查看配置…

SAFe工具,SAFe规模化敏捷工具,SAFe实施流程,SAFe框架管理工具

​Leangoo领歌敏捷工具覆盖了敏捷项目研发全流程,包括小型团队敏捷开发,Scrum of Scrums大规模敏捷。 随着SAFe的越来越普及,Leangoo本次上线提供了完整的SAFe框架功能,包括:Program Backlog,PI规划&#…

从零开始学习自动驾驶路径规划-环境配置

从零开始学习自动驾驶路径规划-环境配置 前面,每个人遇到的问题不一样,这里记录了配置步骤和目前遇到的问题,会持续更新报错解决方法。配置时有报错请认真看报错经验 环境配置步骤(18.04和20.04都可以,有些问题没遇到…

医疗小程序:提升服务质量与效率的智能平台

在医疗行业,公司小程序成为提高服务质量、优化管理流程的重要工具。通过医疗小程序,可以方便医疗机构进行信息传播、企业展示等作用,医疗机构也可以医疗小程序提供更便捷的预约服务,优化患者体验。 医疗小程序的好处 提升服务质量…

C# List 详解六

目录 35.MemberwiseClone() 36.Remove(T) 37.RemoveAll(Predicate) 38.RemoveAt(Int32) 39.RemoveRange(Int32, Int32) 40.Reverse() 41.Reverse(Int32, Int32) C# List 详解一 1.Add(T),2.AddRange(IEnumerable),3…

css——box-sizing属性

含义 盒子模型由四部分构成,外边距(margin), 边框(border),内边距(padding), 内容content box-sizing 就是指定盒子的大小和结构的。 box-sizing: content-box; //默认值 内容真正宽度 设置的宽度box-sizing: border-box; // 内容真正宽度width 设置的width- 左右p…

ChatGPT应用|科大讯飞星火杯认知大模型场景创新赛开始报名了!

ChatGPT发布带来的 AI 浪潮在全球疯狂蔓延,国内掀起的大模型混战已经持续半年之久,国产大模型数量正以惊人的速度增长,据不完全统计,截止7月14号已经达到了111个,所谓的“神仙打架”不过如此了吧。 ( 包括但…

【Hammerstein模型的级联】快速估计构成一连串哈默斯坦模型的结构元素研究(Matlab代码实现)

目录 💥1 概述 📚2 运行结果 🎉3 参考文献 🌈4 Matlab代码实现 💥1 概述 在许多振动应用中,所研究的系统略微非线性。Hammerstein模型的级联可以方便地描述这样的系统。Hammerstein提供了一种基于指数正弦…

Windows Server 2012 搭建网关服务器并端口转发

需求 使用 Windows server 作为Hyper-V 虚拟出许多虚拟机,基本上都分配了内网地址,现在需要这些虚拟机访问外网,或者外网直接访问这些虚拟机,必须配置一个网关服务器。我决定直接使用 Windows 的远程访问中的 NAT 服务来完成。 …

PHP注册、登陆、6套主页-带Thinkphp目录解析-【强撸项目】

强撸项目系列总目录在000集 PHP要怎么学–【思维导图知识范围】 文章目录 本系列校训本项目使用技术 上效果图主页注册,登陆 phpStudy 设置导数据库项目目录如图:代码部分:控制器前台的首页 其它配套页面展示直接给第二套方案的页面吧第三套…

Android版本的发展4-13

Android 4.4 KitKat 1、通过主机卡模拟实现新的 NFC 功能。 2、低功耗传感器,传感器批处理,步测器和计步器。 3、全屏沉浸模式,隐藏所有系统 UI,例如状态栏和导航栏。它适用于鲜艳的视觉内容,例如照片、视频、地图、…

API自动化测试总结

目录 Jmeter是怎么做API自动化测试的? Jmeter中动态参数的处理? 怎么判断前端问题还是后端问题? 详细描述下使用postman是怎么做API的测试的? 资料获取方法 Jmeter是怎么做API自动化测试的? 1、首先在JMeter里面…

Spring AOP(面向切面编程)的详细讲解

1.什么是 AOP? AOP(Aspect Oriented Programming):⾯向切⾯编程,它是⼀种思想,它是对某⼀类事情的集中处理 AOP是一种思想,而Spring AOP是一个实现了AOP的思想框架,他们的关系和IOC…

git实战

git实战 第一章 快速入门 1.1 什么是git git是一个分布式的版本控制软件。 软件,类似于QQ、office、dota等安装到电脑上才能使用的工具。版本控制,类似于毕业论文、写文案、视频剪辑等,需要反复修改和保留原历史数据。分布式 - 文件夹拷贝…