Elasticsearch:同义词在 RAG 中重要吗?

作者:来自 Elastic Jeffrey Rengifo 及 Tomás Murúa

探索 RAG 应用程序中 Elasticsearch 同义词的功能。

同义词允许我们使用具有相同含义的不同词语在文档中搜索,以确保用户无论使用什么确切的词语都能找到他们所寻找的内容。你可能会认为,由于 RAG 应用程序使用语义/向量搜索,同义词功能的一部分已经被同义词涵盖(因为根据定义,同义词是语义相关的词)。

这是真的吗?语义搜索真的能取代同义词吗?在本文中,我们将分析在 RAG 应用程序中使用同义词的影响。

步骤

  • 配置端点
  • 配置同义词
  • 索引文档
  • 语义搜索
  • 同义词和 RAG

配置推理端点

对于这个例子,我们将在 HR 环境中实现带有和不带有同义词的 RAG(Retrieval-Augmented Generation - 检索增强生成)系统。我们将使用术语 PTO(Paid Time Off - 带薪休假)的变体(如 “vacation” 或 “holiday”)为不同的文档编制索引。然后我们将配置同义词来展示这些关系如何提高搜索的相关性和准确性。

首先,让我们通过在 Kibana DevTools 中运行以下命令,使用带有推理 API(inference api) 的 ELSER 模型创建一个端点:

PUT _inference/sparse_embedding/code-wave_inference
{"service": "elasticsearch","service_settings": {"num_allocations": 1,"num_threads": 1}
}

配置同义词

Elasticsearch 中的同义词是什么?

在 Elasticsearch 中,同义词(synonyms)是具有相同或相似含义的单词或短语,存储为同义词集,可以作为文件或通过 API 进行管理。它们允许用户找到相关信息,即使他们使用不同的术语来指代同一概念。

因此,例如,如果我们创建一组同义词,其中 “holiday” 和 “vacation” 是 “Paid Time Off” 的同义词,当员工搜索其中任何一个词时,他们就会找到与所有词相关的文档。

你可以在这篇文章中阅读有关它们的更多信息。

让我们使用同义词 API(synonyms API:) 创建一组同义词:

PUT _synonyms/code-wave_synonyms
{"synonyms_set": [{"synonyms": "holidays, paid time off"}]
}

值得注意的是,同义词集必须先进行配置,然后才能应用于索引。

现在,让我们定义数据的设置和映射:

PUT /code-wave_index
{"settings": {"analysis": {"filter": {"synonyms_filter": {"type": "synonym_graph","synonyms_set": "code-wave_synonyms","updateable": true}},"analyzer": {"my_search_analyzer": {"type": "custom","tokenizer": "standard","filter": ["lowercase","synonyms_filter"]}}}},"mappings": {"properties": {"text_field": {"type": "text","analyzer": "standard","copy_to": "semantic_field","fields": {"synonyms": {"type": "text","analyzer": "standard","search_analyzer": "my_search_analyzer"}}},"semantic_field": {"type": "semantic_text","inference_id": "code-wave_inference"}}}
}

我们将使用 semantic_text 字段进行语义搜索,并使用 synonyms graph token filter 来处理多词同义词。

我们还创建了 text_field.synonym 版本和 text_field 版本的字段(可以针对这两种不同的类型进行搜索。请注意的是这两个类型都是 text 类型),以便更好地控制如何使用或不考虑同义词来查询字段。

最后,我们使用 copy_to 将 text_field 的值复制到该字段的 semantic_text 版本,以实现全文和语义查询。

索引文档

我们现在将使用批量 API 索引我们的文档:

POST _bulk
{"index":{"_index":"code-wave_index","_id":"1"}}
{"semantic_field":"Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.","text_field":"Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones."}
{"index":{"_index":"code-wave_index","_id":"2"}}
{"semantic_field":"Holidays: Paid public holidays recognized each calendar year.","text_field":"Holidays: Paid public holidays recognized each calendar year."}
{"index":{"_index":"code-wave_index","_id":"3"}}
{"semantic_field":"Sick leave: Paid sick leave of up to 15 days per year.","text_field":"Sick leave: Paid sick leave of up to 15 days per year."}
{"index":{"_index":"code-wave_index","_id":"4"}}
{"semantic_field":"Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!","text_field":"Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!"}
{"index":{"_index":"code-wave_index","_id":"5"}}
{"semantic_field":"Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations.","text_field":"Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations."}
{"index":{"_index":"code-wave_index","_id":"6"}}
{"semantic_field":"Holidays travel: Find the best deals for your holidays flights and accommodations this season.","text_field":"Holidays travel: Find the best deals for your holidays flights and accommodations this season."}
{"index":{"_index":"code-wave_index","_id":"7"}}
{"semantic_field":"Holidays music: Stream your favorite holidays classics and discover new seasonal hits.","text_field":"Holidays music: Stream your favorite holidays classics and discover new seasonal hits."}
{"index":{"_index":"code-wave_index","_id":"8"}}
{"semantic_field":"Holidays decorations: Our store offers a wide range of holidays decorations to make your home festive.","text_field":"Holidays decorations: Our store offers a wide range of holidays decorations to make your home festive."}
{"index":{"_index":"code-wave_index","_id":"9"}}
{"semantic_field":"Holidays movies: Check out our list of must-watch holidays movies for cozy winter nights.","text_field":"Holidays movies: Check out our list of must-watch holidays movies for cozy winter nights."}
{"index":{"_index":"code-wave_index","_id":"10"}}
{"semantic_field":"Holidays festival: Join us at the city's annual holidays festival featuring lights, music, and local food.","text_field":"Holidays festival: Join us at the city's annual holidays festival featuring lights, music, and local food."}
{"index":{"_index":"code-wave_index","_id":"11"}}
{"semantic_field":"Holidays weather: Stay updated with our holidays weather forecast to plan your activities.","text_field":"Holidays weather: Stay updated with our holidays weather forecast to plan your activities."}
{"index":{"_index":"code-wave_index","_id":"12"}}
{"semantic_field":"Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list.","text_field":"Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list."}
{"index":{"_index":"code-wave_index","_id":"13"}}
{"semantic_field":"Holidays traditions: Explore unique holidays traditions celebrated around the world.","text_field":"Holidays traditions: Explore unique holidays traditions celebrated around the world."}

我们现在就可以开始搜索了!但首先,让我们通过搜索 holidays 来确保同义词有效:

GET code-wave_index/_search
{"_source": {"excludes": ["*embeddings","*chunks"]},"query": {"multi_match": {"query": "holidays","fields": ["text_field^10","text_field.synonyms^0.6"]}}
}

我们对 boost 进行调整,使同义词的得分低于原始单词。

检查响应:

{"took": 3,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 12,"relation": "eq"},"max_score": 5.2014494,"hits": [{"_index": "code-wave_index","_id": "2","_score": 3.0596757,"_source": {"text_field": "Holidays: Paid public holidays recognized each calendar year.","semantic_field": {"inference": {"inference_id": "code-wave_inference","model_settings": {"task_type": "sparse_embedding"}},"text": "Holidays: Paid public holidays recognized each calendar year."}}},{"_index": "code-wave_index","_id": "1","_score": 3.023004,"_source": {"text_field": "Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.","semantic_field": {"inference": {"inference_id": "code-wave_inference","model_settings": {"task_type": "sparse_embedding"}},"text": "Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones."}}},{"_index": "code-wave_index","_id": "13","_score": 2.9230676,"_source": {"text_field": "Holidays traditions: Explore unique holidays traditions celebrated around the world.","semantic_field": {"inference": {"inference_id": "code-wave_inference","model_settings": {"task_type": "sparse_embedding"}},"text": "Holidays traditions: Explore unique holidays traditions celebrated around the world."}}},...]}
}

我们可以看到,当我们搜索 “holidays” 时,第二个文档有同义词:“Paid Time Off”。

混合搜索

混合搜索使我们能够将全文和语义搜索查询的结果组合成一个规范化的结果集,方法是使用 RRF(Reciprocal Rank Fusion - 倒述排序融合)来平衡来自不同检索器的分数。

GET code-wave_index/_search
{"_source": "text_field","retriever": {"rrf": {"retrievers": [{"standard": {"query": {"nested": {"path": "semantic_field.inference.chunks","query": {"sparse_vector": {"inference_id": "code-wave_inference","field": "semantic_field.inference.chunks.embeddings","query": "holidays"}}}}}},{"standard": {"query": {"multi_match": {"query": "holidays","fields": ["text_field.synonyms"]}}}}]}}
}

回复:

{"took": 11,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 13,"relation": "eq"},"max_score": 0.03175403,"hits": [{"_index": "code-wave_index","_id": "7","_score": 0.03175403,"_source": {"text_field": "Holidays music: Stream your favorite holidays classics and discover new seasonal hits."}},{"_index": "code-wave_index","_id": "13","_score": 0.031257633,"_source": {"text_field": "Holidays traditions: Explore unique holidays traditions celebrated around the world."}},{"_index": "code-wave_index","_id": "4","_score": 0.031009614,"_source": {"text_field": "Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!"}},{"_index": "code-wave_index","_id": "2","_score": 0.030834913,"_source": {"text_field": "Holidays: Paid public holidays recognized each calendar year."}},{"_index": "code-wave_index","_id": "6","_score": 0.03079839,"_source": {"text_field": "Holidays travel: Find the best deals for your holidays flights and accommodations this season."}},{"_index": "code-wave_index","_id": "11","_score": 0.02964427,"_source": {"text_field": "Holidays weather: Stay updated with our holidays weather forecast to plan your activities."}},{"_index": "code-wave_index","_id": "5","_score": 0.029418126,"_source": {"text_field": "Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations."}},{"_index": "code-wave_index","_id": "12","_score": 0.028991597,"_source": {"text_field": "Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list."}},{"_index": "code-wave_index","_id": "1","_score": 0.016393442,"_source": {"text_field": "Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones."}},{"_index": "code-wave_index","_id": "10","_score": 0.016393442,"_source": {"text_field": "Holidays festival: Join us at the city's annual holidays festival featuring lights, music, and local food."}}]}
}

该查询将返回语义和文本相关的文档。

同义词和 RAG

在本节中,我们将评估同义词和语义搜索如何改进 RAG 系统中的查询。我们将使用一个关于休息日的常见问题作为此示例:

How many vacation days are provided for holidays?

对于这个问题,我们对文档 1 中的信息感兴趣。文档 2 更接近我们想要的结果,但并不精确。当我们不使用同义词进行搜索时,我们将得到此结果。我们来看看它们的内容:

  • [1] Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.
  • [2] Holidays: Paid public holidays recognized each calendar year.

这两个文档都包含与休息日(days off)相关的信息,但只有文档 2 特别使用了术语 “holidays”,因此我们可以测试同义词和语义搜索在 Playground 中的工作方式。

你可以从 Search>Playground 访问 Playground。从那里,你需要配置你想要使用的 LLM 并选择我们已经创建的索引作为上下文发送。你可以在此处阅读有关 Playground 及其配置的更多信息

配置完 Playground 后,如果我们点击查询按钮,我们可以看到同义词已被停用:

对于每个问题,我们会将前一个查询的前三个结果发送给 LLM,作为上下文:

现在,让我们向 Playground 提出问题并检查停用同义词后的结果:

由于前三个搜索结果中没有列出说明员工每年可享受多少假期的文件,因此 LLM 无法回答这个问题。在这种情况下,最接近的结果在文档 [2] 中。

注意:通过点击 “Snippet”,我们可以看到答案在 Elasticsearch 中的具体内容。

让我们清理聊天记录,激活同义词并再次提出同样的问题:

请注意,当你启用 semantic_text 字段和 text 字段时,Playground 将自动生成混合搜索查询:

让我们重复一下这个问题,现在激活同义词:

现在,答案确实包含了我们正在搜索的文档,因为同义词允许将文档 [1] 发送到 LLM。

结论

在本文中,我们发现同义词是搜索系统的基本组成部分,即使在使用语义搜索时也不一定涵盖同义词功能。

同义词允许我们根据用例控制要提升的文档,并通过调整相关性来提高准确性。另一方面,语义搜索对于 recall 很有用,这意味着它可以引入潜在的相关结果,而无需我们为每个相关术语添加同义词。

通过混合搜索,我们可以同时进行同义词和语义搜索,实现两全其美的效果。使用 Playground,如果我们选择语义和文本字段的组合作为搜索字段,它将自动为我们构建混合查询。

想要获得 Elastic 认证吗?了解下一期 Elasticsearch 工程师培训何时举行!

Elasticsearch 包含许多新功能,可帮助你为你的用例构建最佳的搜索解决方案。深入了解我们的示例笔记本以了解更多信息,开始免费云试用,或立即在本地机器上试用 Elastic。

原文:Are synonyms important in RAG? - Elasticsearch Labs

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/70999.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【devops】 Git仓库如何fork一个私有仓库到自己的私有仓库 | git fork 私有仓库

一、场景说明 场景: 比如我们Codeup的私有仓库下载代码 放入我们的Github私有仓库 且保持2个仓库是可以实现fork的状态,即:Github会可以更新到Codeup的最新代码 二、解决方案 1、先从Codeup下载私有仓库代码 下载代码使用 git clone 命令…

LabVIEW与小众设备集成

在LabVIEW开发中,当面临控制如布鲁克OPUS红外光谱仪这类小众专业设备的需求,而厂家虽然提供了配套软件,但由于系统中还需要控制其他设备且不能使用厂商的软件时,必须依赖特定方法通过LabVIEW实现设备的控制。开发过程中&#xff0…

从 0 开始本地部署 DeepSeek:详细步骤 + 避坑指南 + 构建可视化(安装在D盘)

个人主页:chian-ocean 前言: 随着人工智能技术的迅速发展,大语言模型在各个行业中得到了广泛应用。DeepSeek 作为一个新兴的 AI 公司,凭借其高效的 AI 模型和开源的优势,吸引了越来越多的开发者和企业关注。为了更好地…

强化学习中的“奖励塑形“:机器人控制与游戏AI的关键训练技术(深度优化版)

技术原理:奖励函数的数学重构 核心公式推导 奖励塑形的数学表达: R(s,a,s) R_{env}(s,a,s) \gamma\Phi(s) - \Phi(s)其中: Φ(s): 势能函数(人工设计的关键)γ: 折扣因子(0.9-0.99典型值)…

亚冬会绽放“云端”,联通云如何点亮冰城“科技之光”?

科技云报到原创。 35年前,中国第一次承办亚运会,宣传曲《亚洲雄风》红遍大江南北,其中有一句“我们亚洲,云也手握手”。如今回看,这句话仿佛有了更深的寓意:一朵朵科技铸就的“云”,把人和人连…

【C++ 真题】P2920 [USACO08NOV] Time Management S

P2920 [USACO08NOV] Time Management S 题目描述 Ever the maturing businessman, Farmer John realizes that he must manage his time effectively. He has N jobs conveniently numbered 1…N (1 < N < 1,000) to accomplish (like milking the cows, cleaning the …

#用于跟踪和反映数据源对象的变化--useMagical

import {cloneDeep } from lodash-es import {reactive, ref, watchEffect } from vue /*** 神奇函数* @param source 数据源,* @param initKey 固定需要返回的属性* @description 收集数据源中修改的属性,并返回* @version 1.0 仅支持对象* @author sufei* @return { source, …

快速排序

目录 什么是快速排序&#xff1a; 图解&#xff1a; 递归法&#xff1a; 方法一&#xff08;Hoare法&#xff09;&#xff1a; 代码实现&#xff1a; 思路分析&#xff1a; 方法二&#xff08;挖坑法&#xff09;&#xff1a; 代码实现&#xff1a; 思路分析&#xff1a; 非递…

数据结构-链式二叉树

文章目录 一、链式二叉树1.1 链式二叉树的创建1.2 根、左子树、右子树1.3 二叉树的前中后序遍历1.3.1前(先)序遍历1.3.2中序遍历1.3.3后序遍历 1.4 二叉树的节点个数1.5 二叉树的叶子结点个数1.6 第K层节点个数1.7 二叉树的高度1.8 查找指定的值(val)1.9 二叉树的销毁 二、层序…

gitlab无法登录问题

在我第一次安装gitlab的时候发现登录页面是 正常的页面应该是 这种情况的主要原因是不是第一次登录&#xff0c;所以我们要找到原先的密码 解决方式&#xff1a; [rootgitlab ~]# vim /etc/gitlab/initial_root_password# WARNING: This value is valid only in the followin…

Elastic Cloud Serverless 现已在 Microsoft Azure 上提供技术预览版

作者&#xff1a;来自 Elastic Yuvi Gupta Elastic Cloud Serverless 提供了启动和扩展安全性、可观察性和搜索解决方案的最快方法 — 无需管理基础设施。 今天&#xff0c;我们很高兴地宣布 Microsoft Azure 上的 Elastic Cloud Serverless 技术预览版现已在美国东部地区推出。…

AI前端开发:蓬勃发展的机遇与挑战

人工智能&#xff08;AI&#xff09;领域的飞速发展&#xff0c;正深刻地改变着我们的生活方式&#xff0c;也为技术人才&#xff0c;特别是AI代码生成领域的专业人士&#xff0c;带来了前所未有的机遇。而作为AI应用与用户之间桥梁的前端开发&#xff0c;其重要性更是日益凸显…

Spring Boot整合DeepSeek实现AI对话(API调用和本地部署)

本篇文章会分基于DeepSeek开放平台上的API&#xff0c;以及本地私有化部署DeepSeek R1模型两种方式来整合使用。 本地化私有部署可以参考这篇博文 全面认识了解DeepSeek利用ollama在本地部署、使用和体验deepseek-r1大模型 Spring版本选择 根据Spring官网的描述 Spring AI是一…

Java 大视界 -- 云计算时代 Java 大数据的云原生架构与应用实践(86)

&#x1f496;亲爱的朋友们&#xff0c;热烈欢迎来到 青云交的博客&#xff01;能与诸位在此相逢&#xff0c;我倍感荣幸。在这飞速更迭的时代&#xff0c;我们都渴望一方心灵净土&#xff0c;而 我的博客 正是这样温暖的所在。这里为你呈上趣味与实用兼具的知识&#xff0c;也…

【RK3588嵌入式图形编程】-SDL2-鼠标输入处理

鼠标输入处理 文章目录 鼠标输入处理1、概述2、鼠标移动事件3、鼠标点击事件4、鼠标点击位置5、鼠标双击6、鼠标进入和离开事件7、总结在本文中,将介绍如何在 SDL2 中检测和处理鼠标输入事件,包括鼠标移动、按钮点击以及窗口进入/退出。 1、概述 在本文中,我们将详细介绍如…

Qt - 地图相关 —— 3、Qt调用高德在线地图功能示例(附源码)

效果 作者其他相关文章链接:           Qt - 地图相关 —— 1、加载百度在线地图(附源码)           Qt - 地图相关 —— 2、Qt调用百度在线地图功能示例全集,包含线路规划、地铁线路查询等(附源码)           Qt - 地图相关 —— 3、Qt调用…

PCB多层板打样:深度解析优缺点与应用场景

随着电子产品朝小型化、高性能化方向发展&#xff0c;PCB多层板扮演着越来越重要的角色。无论是智能手机、计算机&#xff0c;还是航空航天、工业控制&#xff0c;多层板都发挥着至关重要的作用。像专业的PCB制造商——嘉立创&#xff0c;凭借超高层工艺&#xff0c;可以生产最…

CCFCSP第34次认证第一题——矩阵重塑(其一)

第34次认证第一题——矩阵重塑&#xff08;其一&#xff09; 官网链接 时间限制&#xff1a; 1.0 秒 空间限制&#xff1a; 512 MiB 相关文件&#xff1a; 题目目录&#xff08;样例文件&#xff09; 题目背景 矩阵&#xff08;二维&#xff09;的重塑&#xff08;reshap…

2023-arXiv-CoT Prompt 思维链提示提升大型语言模型的推理能力

arXiv | https://arxiv.org/abs/2201.11903 摘要&#xff1a; 我们探讨了如何生成思维链&#xff08;一系列中间推理步骤&#xff09;显著提高大型语言模型执行复杂推理的能力。在三个大型语言模型上的实验表明&#xff0c;思维链提示提高了一系列算术、常识和符号推理任务的性…

macOS部署DeepSeek-r1

好奇&#xff0c;跟着网友们的操作试了一下 网上方案很多&#xff0c;主要参考的是这篇 DeepSeek 接入 PyCharm&#xff0c;轻松助力编程_pycharm deepseek-CSDN博客 方案是&#xff1a;PyCharm CodeGPT插件 DeepSeek-r1:1.5b 假设已经安装好了PyCharm PyCharm: the Pyth…