Elasticsearch:介绍 kNN query,这是进行 kNN 搜索的专家方法

作者:来自 Elastic Mayya Sharipova, Benjamin Trent

当前状况:kNN 搜索作为顶层部分

Elasticsearch 中的 kNN 搜索被组织为搜索请求的顶层(top level)部分。 我们这样设计是为了:

  • 无论分片数量多少,它总是可以返回全局 k 个最近邻居
  • 这些全局 k 个结果与其他查询的结果相结合以形成混合搜索
  • 全局 k 结果被传递到聚合以形成统计(facets)。

这是 kNN 搜索在内部执行的简化图(省略了一些阶段):

图 1:顶层 kNN 搜索的步骤是:

  1. 用户提交搜索请求
  2. 协调器节点在 DFS 阶段向数据节点发送请求的 kNN 搜索部分
  3. 每个数据节点运行 kNN 搜索并将本地 top-k 结果发送回协调器
  4. 协调器合并所有本地结果以形成全局前 k 个最近邻居。
  5. 协调器将全局 k 个最近邻居发送回数据节点,并提供任何其他查询
  6. 每个数据节点运行额外的查询并将本地 size 结果发送回协调器
  7. 协调器合并所有本地结果并向用户发送响应

我们首先在 DFS 阶段运行 kNN 搜索以获得全局前 k 个结果。 然后,这些全局 k 结果被传递到搜索请求的其他部分,例如其他查询或聚合。 即使执行看起来很复杂,但从用户的角度来看,运行 kNN 搜索的模型很简单,因为用户始终可以确保 kNN 搜索返回全局 k 结果。

它的请求格式如下:

GET collection-with-embeddings/_search
{"knn": {"field": "text_embedding.predicted_value","query_vector_builder": {"text_embedding": {"model_id": "sentence-transformers__msmarco-distilbert-base-tas-b","model_text": "How is the weather in Jamaica?"}},"k": 10,"num_candidates": 100},"_source": ["id","text"]
}

引入 kNN 查询

随着时间的推移,我们意识到还需要将 kNN 搜索表示为查询。 查询是 Elasticsearch 中搜索请求的核心组件,将 kNN 搜索表示为查询可以灵活地将其与其他查询结合起来,以解决更复杂的请求。

kNN 查询与顶层 kNN 搜索不同,没有 k 参数。 与其他查询一样,返回的结果(最近邻居)的数量由 size 参数定义。 与 kNN 搜索类似,num_candidates 参数定义在执行 kNN 搜索时在每个分片上考虑多少个候选者。

GET products/_search
{"size" : 3,"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10}}
}

kNN 查询的执行方式与顶层 kNN 搜索不同。 下面是一个简化图,描述了 kNN 查询如何在内部执行(省略了一些阶段):

图 2:基于查询的 kNN 搜索步骤如下:

  • 用户提交搜索请求
  • 协调器向数据节点发送一个 kNN 搜索查询,并提供附加查询
  • 每个数据节点运行查询并将本地大小结果发送回协调器节点
  • 协调器节点合并所有本地结果并向用户发送响应

我们在一个分片上运行 kNN 搜索以获得 num_candidates 结果; 这些结果将传递给分片上的其他查询和聚合,以从分片获取大小结果。 由于我们不首先收集全局 k 个最近邻居,因此在此模型中,收集的且对其他查询和聚合可见的最近邻居的数量取决于分片的数量。

kNN 查询 API 示例

让我们看一下 API 示例,这些示例演示了顶层 kNN 搜索和 kNN 查询之间的差异。

我们创建产品索引并索引一些文档:

PUT products
{"mappings": {"dynamic": "strict","properties": {"department": {"type": "keyword"},"brand": {"type": "keyword"},"description": {"type": "text"},"embedding": {"type": "dense_vector","index": true,"similarity": "l2_norm"},"price": {"type": "float"}}}
}
POST products/_bulk?refresh=true
{"index":{"_id":1}}
{"department":"women","brand": "Levi's", "description":"high-rise red jeans","embedding":[1,1,1,1],"price":100}
{"index":{"_id":2}}
{"department":"women","brand": "Calvin Klein","description":"high-rise beautiful jeans","embedding":[1,1,1,1],"price":250}
{"index":{"_id":3}}
{"department":"women","brand": "Gap","description":"every day jeans","embedding":[1,1,1,1],"price":50}
{"index":{"_id":4}}
{"department":"women","brand": "Levi's","description":"jeans","embedding":[2,2,2,0],"price":75}
{"index":{"_id":5}}
{"department":"women","brand": "Levi's","description":"luxury jeans","embedding":[2,2,2,0],"price":150}
{"index":{"_id":6}}
{"department":"men","brand": "Levi's", "description":"jeans","embedding":[2,2,2,0],"price":50}
{"index":{"_id":7}}
{"department":"women","brand": "Levi's", "description":"jeans 2023","embedding":[2,2,2,0],"price":150}

kNN 查询类似于顶层 kNN 搜索,具有 num_candidates 和充当预过滤器的内部 filter 参数。

GET products/_search?filter_path=**.hits
{"size" : 3,"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"filter" : {"term" : {"department" : "women"}}}}
} 
{"hits": {"hits": [{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75}},{"_index": "products","_id": "5","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150}},{"_index": "products","_id": "7","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans 2023","embedding": [2,2,2,0],"price": 150}}]}
}

kNN 查询比 kNN collapsing 和聚合搜索可以获得更多样化的结果。 对于下面的 kNN 查询,我们在每个分片上执行 kNN 搜索以获得 10 个最近邻居,然后将其传递到 collapsing 以获取 3 个顶部结果。 因此,我们将在响应中得到 3 个不同的点击。

GET products/_search?filter_path=**.hits
{"size" : 3,"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"filter" : {"term" : {"department" : "women"}}}},"collapse": {"field": "brand"        }
}
{"hits": {"hits": [{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"fields": {"brand": ["Levi's"]}},{"_index": "products","_id": "2","_score": 0.2,"_source": {"department": "women","brand": "Calvin Klein","description": "high-rise beautiful jeans","embedding": [1,1,1,1],"price": 250},"fields": {"brand": ["Calvin Klein"]}},{"_index": "products","_id": "3","_score": 0.2,"_source": {"department": "women","brand": "Gap","description": "every day jeans","embedding": [1,1,1,1],"price": 50},"fields": {"brand": ["Gap"]}}]}
}

顶层 kNN 搜索首先在 DFS 阶段获取全局前 3 个结果,然后在查询阶段将它们传递到 collapse。 我们在响应中只会得到 1 个命中,因为全球 3 个最近的邻居恰好都来自同一品牌。

与聚合类似,kNN query 允许我们获得 3 个不同的存储桶,而 kNN search 仅允许 1 个。

GET products/_search?filter_path=aggregations
{
"size": 0,
"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"filter" : {"term" : {"department" : "women"}}}},"aggs": {"brands": {"terms": {"field": "brand"}}}
}
{"aggregations": {"brands": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "Levi's","doc_count": 4},{"key": "Calvin Klein","doc_count": 1},{"key": "Gap","doc_count": 1}]}}
}

而顶层的 search 是这样的:

GET products/_search?filter_path=aggregations
{"size": 0,"knn": {"field": "embedding","query_vector": [2,2,2,0],"k": 3,"num_candidates": 10,"filter": {"term": {"department": "women"}}},"aggs": {"brands": {"terms": {"field": "brand"}}}
}
{"aggregations": {"brands": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "Levi's","doc_count": 3}]}}
}

现在,让我们看一下其他示例,展示 kNN 查询的灵活性。 具体来说,它如何能够灵活地与其他查询结合起来。

kNN 可以是 boolean 查询的一部分(需要注意的是,所有外部查询过滤器都用作 kNN 搜索的后过滤器)。 我们可以使用 kNN 查询的 _name 参数来通过额外信息来增强结果,这些信息告诉 kNN 查询是否匹配及其分数贡献。

GET products/_search?include_named_queries_score
{"size": 3,"query": {"bool": {"should": [{"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"_name": "knn_query"}},{"match": {"description": {"query": "luxury","_name": "bm25query"}}}]}}
}
{"took": 2,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 2.8042283,"hits": [{"_index": "products","_id": "5","_score": 2.8042283,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150},"matched_queries": {"knn_query": 1,"bm25query": 1.8042282}},{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": {"knn_query": 1}},{"_index": "products","_id": "6","_score": 1,"_source": {"department": "men","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 50},"matched_queries": {"knn_query": 1}}]}
}

kNN 也可以是复杂查询的一部分,例如 pinned 查询。 当我们想要显示最接近的结果,但又想要提升选定数量的其他结果时,这非常有用。

GET products/_search
{"size": 3,"query": {"pinned": {"ids": [ "1", "2" ],"organic": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"_name": "knn_query"}}}}
}
{"took": 9,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 1.7014124e+38,"hits": [{"_index": "products","_id": "1","_score": 1.7014124e+38,"_source": {"department": "women","brand": "Levi's","description": "high-rise red jeans","embedding": [1,1,1,1],"price": 100},"matched_queries": ["knn_query"]},{"_index": "products","_id": "2","_score": 1.7014122e+38,"_source": {"department": "women","brand": "Calvin Klein","description": "high-rise beautiful jeans","embedding": [1,1,1,1],"price": 250},"matched_queries": ["knn_query"]},{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": ["knn_query"]}]}
}

我们甚至可以将 kNN 查询作为 function_score 查询的一部分。 当我们需要为 kNN 查询返回的结果定义自定义分数时,这非常有用:​

GET products/_search
{"size": 3,"query": {"function_score": {"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"_name": "knn_query"}},"functions": [{"filter": { "match": { "department": "men" } },"weight": 100},{"filter": { "match": { "department": "women" } },"weight": 50}]}}
}
{"took": 3,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 100,"hits": [{"_index": "products","_id": "6","_score": 100,"_source": {"department": "men","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 50},"matched_queries": ["knn_query"]},{"_index": "products","_id": "4","_score": 50,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": ["knn_query"]},{"_index": "products","_id": "5","_score": 50,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150},"matched_queries": ["knn_query"]}]}
}

当我们想要组合 kNN 搜索和其他查询的结果时,kNN 查询作为 dis_max 查询的一部分非常有用,以便文档的分数来自排名最高的子句,并为任何其他子句提供打破平局的增量。

GET products/_search
{"size": 5,"query": {"dis_max": {"queries": [{"knn": {"field": "embedding","query_vector": [2,2, 2,0],"num_candidates": 3,"_name": "knn_query"}},{"match": {"description": "high-rise jeans"}}],"tie_breaker": 0.8}}
}
{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 1.890432,"hits": [{"_index": "products","_id": "1","_score": 1.890432,"_source": {"department": "women","brand": "Levi's","description": "high-rise red jeans","embedding": [1,1,1,1],"price": 100}},{"_index": "products","_id": "2","_score": 1.890432,"_source": {"department": "women","brand": "Calvin Klein","description": "high-rise beautiful jeans","embedding": [1,1,1,1],"price": 250}},{"_index": "products","_id": "4","_score": 1.0679927,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": ["knn_query"]},{"_index": "products","_id": "6","_score": 1.0679927,"_source": {"department": "men","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 50},"matched_queries": ["knn_query"]},{"_index": "products","_id": "5","_score": 1.0556482,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150},"matched_queries": ["knn_query"]}]}
}

kNN 搜索作为查询已在 8.12 版本中引入。 请尝试一下,如果有任何反馈,我们将不胜感激。

原文:Introducing kNN query, an expert way to do kNN search — Elastic Search Labs

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/641295.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

实现纯Web语音视频聊天和桌面分享(附源码,PC端+移动端)

在网页里实现文字聊天是比较容易的,但若要实现视频聊天,就比较麻烦了。本文将实现一个纯Web版的视频聊天和桌面分享的Demo,可直接在浏览器中运行,不需要安装任何插件。 一. 主要功能及支持平台 1.本Demo的主要功能有 &#xff…

【书生·浦语】大模型实战营——第六次作业

使用OpenCompass 评测 InterLM2-chat-chat-7B 模型在C-Eval数据集上的性能 环境配置 1. 创建虚拟环境 conda create --name opencompass --clone/root/share/conda_envs/internlm-base source activate opencompass git clone https://github.com/open-compass/opencompass cd…

【PWN · 格式化字符串|劫持fini_array|劫持got表】[CISCN 2019西南]PWN1

格式化字符串的经典利用:劫持got表。但是遇到漏洞点只能执行一次的情况,该怎么办? 前言 如果存在格式化字符串,保护机制开的不健全,通常可以劫持got表,构造后门函数。然而,如果不存在循环、栈溢…

gradle打包分离依赖jar

正常打包的jar是包含项目所依赖的jar包资源,而且大多数场景下的依赖资源是不会频繁的变更的,所以实际把项目自身jar和其所依赖的资源分离可以实现jar包瘦身,减小上传的jar包总大小,能实现加速部署的效果 一 原本结构 二 配置buil…

机器学习_正则化、欠拟合和过拟合

文章目录 正则化欠拟合和过拟合正则化参数 正则化 机器学习中的正则化是在损失函数里面加惩罚项,增加建模的模糊性,从而把捕捉到的趋势从局部细微趋势,调整到整体大概趋势。虽然一定程度上地放宽了建模要求,但是能有效防止过拟合…

用通俗易懂的方式讲解:使用 MongoDB 和 Langchain 构建生成型AI聊天机器人

想象一下:你收到了你梦寐以求的礼物:一台非凡的时光机,可以将你带到任何地方、任何时候。 你只有10分钟让它运行,否则它将消失。你拥有一份2000页的PDF,详细介绍了关于这台时光机的一切:它的历史、创造者、…

【计算机网络】应用层——HTTP 协议(一)

个人主页:兜里有颗棉花糖 欢迎 点赞👍 收藏✨ 留言✉ 加关注💓本文由 兜里有颗棉花糖 原创 收录于专栏【网络编程】 本专栏旨在分享学习计算机网络的一点学习心得,欢迎大家在评论区交流讨论💌 目录 一、什么是 HTTP 协…

假期刷题打卡--Day10

一、C语言刷题 预处理命令模块的题目就只有几个,下面开始选择结构这个模块的题目。 1、MT1112中庸之道 请编写一个简单程序,输入3个整数,比较他们的大小,输出中间的那个数 格式 输入格式: 输入整型,空…

linux源码编译安装llvm

目录 1 建立文件夹llvm 2 下载源码到llvm文件夹 3 解压上述文件 4 将解压后的3个文件夹改名,并移动到llvm-9.0.0.src中: 5 在llvm文件夹内建立build文件夹,并进入该文件夹: 6 执行cmake命令 7 make 8 安装 9 安装成功后…

C++中特殊类的设计与单例模式的简易实现

设计一个只能在堆上创建对象的类 对于这种特殊类的设计我们一般都是优先考虑私有构造函数。然后对于一些特殊要求就直接通过静态成员函数的实现来完成。 class A//构造函数私有(也可以析构函数私有) { public:static A* creat(){return new A;} privat…

docker容器下php框架laravel的使用问题与解决方案

DB_CONNECTIONmysqlDB_HOSTlocalhost DB_CONNECTIONmysqlDB_HOSTdocker33-mysql-1 容器中只有数据库结构 进入MySQL容器内,创建表结构,添加数据 代码层面需要转换成数组 $query->get([*])->toArray(); 分页数据框架会返回带有data的数据&#xf…

计算机网络-AAA原理概述

对于任何网络,用户管理都是最基本的安全管理要求之一,在华为设备管理中通过AAA框架进行认证、授权、计费实现安全验证。 一、AAA概述 AAA(Authentication(认证), Authorization(授权), and Accounting(计费))是一种管理框架&#…

大模型微调实战笔记

大模型三要素 1.算法:模型结构,训练方法 2.数据:数据和模型效果之间的关系,token分词方法 3.算力:英伟达GPU,模型量化 基于大模型对话的系统架构 基于Lora的模型训练最好用,成本低好上手 提…

CentOS 7安装全解析:适合初学者的指导

目录 前言 一.centos安装 1.下载镜像文件 2.安装 二.远程连接,换源 1.下载并且使用MobaXtermMobaXterm free Xserver and tabbed SSH client for Windows (mobatek.net)https://mobaxterm.mobatek.net/ 远程连接 2.换源 前言 在当今的信息化时代&#xff0c…

【Leetcode 965.】判断单值二叉树

单值二叉树: 示例一: 示例二: 代码: bool isUnivalTree(struct TreeNode* root) {if(rootNULL)return true;if(root->left&&root->left->val!root->val)return false;if(root->right&&root-&…

【LeetCode-135】分发糖果(贪心)

LeetCode135.分发糖果 题目描述 老师想给孩子们分发糖果,有 N 个孩子站成了一条直线,老师会根据每个孩子的表现,预先给他们评分。 你需要按照以下要求,帮助老师给这些孩子分发糖果: 每个孩子至少分配到 1 个糖果。…

Neos的渗透测试靶机练习——DarkHole-2

DarkHole-2 一、实验环境二、开始渗透1. 搜集信息2. git文件泄露3. SQL注入4. 提权 三、总结 一、实验环境 虚拟机软件:VirtualBox 攻击机:kali linux(网卡初始为仅主机模式,要有安全意识) 靶机:DarkHole-…

vue3+Element plus实现登录功能

一、想要实现的效果 二、搭建登录静态 1、实现左边背景和右边登录栏的总体布局布局&#xff1a; <el-row class"content"><!--el-col 列&#xff1a; --><el-col :span"16" :xs"0" class"content-left"></el-c…

仓储管理系统——软件工程报告(可行性研究报告及分析)①

可行性研究报告及分析 一、问题定义 1.1项目背景 随着社会的发展以及企业规模的扩大和业务的复杂化&#xff0c;仓库管理变得愈发重要。传统的手工管理方式已经导致了一系列问题&#xff0c;包括库存准确性低、订单处理效率慢等。为了提高仓库运作效率、降低成本并优化库存管…

unity 单例模式(实例详解)

文章目录 在Unity中&#xff0c;单例模式是一种常用的编程设计模式&#xff0c;用于确保在整个应用程序生命周期中&#xff0c;只有一个类的实例存在。这样可以保证数据的全局唯一性和共享性&#xff0c;例如游戏场景中的资源管理器、游戏控制器、事件管理器等。 以下是一个简单…