文档智能:OCR+Rocketqa+layoutxlm <Rocketqa>

此次梳理Rocketqa,个人认为该篇文件讲述的是段落搜索的改进点,关于其框架:粗检索 + 重排序----(dual-encoder architecture),讲诉不多,那是另外的文章;

之前根据文档智能功能,粗略过了一遍。

文档智能:OCR+Rocketqa+layoutxlm<LayoutLMv2>

最近在看RAG相关内容,提到了检索排序,故而重新梳理。如有不足或错误之处,欢迎感谢指正。

记录如下:

RocketQA是一种优化训练方法,用于密集段落检索(Dense Passage Retrieval,DPR),以支持开放域问答(Open-Domain Question Answering,ODQA)系统。

1. Abstract & Introduction


It is difficult to effectively train a dual-encoder for dense passage retrieval due to the following three major challenges:

First, there exists the discrepancy between training and inference for the dual-encoder retriever.
During inference, the retriever needs to identify positive (or relevant) passages for each question from a large collection containing millions of candidates.
However, during training, the model is learned to estimate the probabilities of positive passages in a small candidate set for each question, due to the limited memory of a single GPU (or other device).

To reduce such a discrepancy, previous work tried to design specific mechanisms for selecting a few hard negatives from the top-k retrieved candidates. However, it suffers from the false negative issue due to the following challenge.

Second, there might be a large number of unlabeled positives.

Third, it is expensive to acquire large-scale training data for open-domain QA.


采用的一系列优化策略:跨批次负采样(Cross-batch Negatives)、去噪的强负例采样(Denoised Hard Negatives)和数据增强(Data Augmentation)等。

用于解决训练过程中负例样本不足,和,存在大量错误负例样本的问题。


First, RocketQA introduces cross-batch negatives. Comparing to in-batch negatives, it increases the number of available negatives for each question during training, and alleviates the discrepancy between training and inference.

Second, RocketQA introduces denoised hard negatives. It aims to remove false negatives from the top-ranked results retrieved by a retriever, and derive more reliable hard negatives.

Third, RocketQA leverages large-scale unsupervised data “labeled” by a cross-encoder (as shown in Figure1b) for data augmentation.

Though inefficient, the cross-encoder architecture has been found to be more capable than the dual-encoder architecture in both theory and practice.

Therefore, we utilize a cross-encoder to generate high quality pseudo labels for unlabeled data which are used to train the dual-encoder retriever.

在这里插入图片描述


2. Related work

2.1 Passage retrieval for open-domain QA

Recently, researchers have utilized deep learning to improve traditional passage retrievers, including:

  • document expansions,
  • question expansions,
  • term weight estimation.

Different from the above term-based approaches, dense passage retrieval has been proposed to represent both questions and documents as dense vectors (i.e., embeddings), typically in a dual-encoder architecture (as shown in Figure 1a).

在这里插入图片描述

Existing approaches can be divided into two categories:
(1) self-supervised pre-training for retrieval.
(2) fine-tuning pre-trained language models on labeled data.

Our work follows the second class of approaches, which show better performance with less cost.

2.2 Passage re-ranking for open-domain QA

Based on the retrieved passages from a first-stage retriever, BERT-based rerankers have recently been applied to retrieval-based question answering and search-related tasks, and yield substantial improvements over the traditional methods.
基于从第一阶段检索器检索到的段落,BERT-based(基于BERT的)重排器最近被应用于基于检索的问答系统和搜索相关任务,相较于传统方法,取得了显著的改进。

Although effective to some extent, these rankers employ the cross-encoder architecture (as shown in Figure 1b) that is impractical to be applied to all passages in a corpus with respect to a question.
尽管在某种程度上是有效的,但这些排序器采用了交叉编码器架构(如图1b所示),这对于应用于语料库中与问题有关的所有段落是不切实际的。

The re-rankers with light weight interaction based on the representations of dense retrievers have been studied. However, these techniques still rely on a separate retriever which provides candidates and representations.
已经研究了基于密集检索器表示且具有轻量级交互的重排器。然而,这些技术仍然依赖于一个单独的检索器来提供候选结果和表示。

As a comparison, we focus on developing dual-encoder based retrievers.

3. Approach

3.1 Task Description

The task of open-domain QA is described as follows.
Given a natural language question, a system is required to answer it based on a large collection of documents.

Let C C C denote the corpus, consisting of N N N documents.

We split the N N N documents into M M M passages, denoted by p 1 p_{1} p1, p 2 p_{2} p2, …, p M p_{M} pM,

where each passage p i p_{i} pi can be viewed as an l l l-length sequence of tokens p i ( 1 ) p_{i}^{(1)} pi(1), p i ( 2 ) p_{i}^{(2)} pi(2), …, p i ( l ) p_{i}^{(l)} pi(l).

Given a question q q q, the task is to find a passage p i p_{i} pi among the M M M candidates,

and extract a span p i ( s ) p_{i}^{(s)} pi(s), p i ( s + 1 ) p_{i}^{(s+1)} pi(s+1), …, p i ( e ) p_{i}^{(e)} pi(e) from p i p_{i} pi that can answer the question.

In this paper, we mainly focus on developing a dense retriever to retrieve the passages that contain the answer.


每个段落的长度 l l l 是同一个数值吗?

见4.1.3:

4.1.3 Implementation Details

1. Maximal length

We set the maximum length of questions and passages as 32 and 128, respectively.


3.2 The Dual-Encoder Architecture

We develop our passage retriever based on the typical dual-encoder architecture, as illustrated in Figure 1a.

First, a dense passage retriever uses an encoder E p ( ⋅ ) E_{p}(·) Ep() to obtain the d d d-dimensional real-valued vectors (a.k.a., embedding) of passages.

Then, an index of passage embeddings is built for retrieval.

At query time, another encoder E q ( ⋅ ) E_{q}(·) Eq() is applied to embed the input question to a d d d-dimensional real-valued vector, and k k k passages whose embeddings are the closest to the question’s will be retrieved.

The similarity between the question q q q and a candidate passage p p p can be computed as the dot product of their vectors:
在这里插入图片描述

In practice, the separation of question encoding and passage encoding is desirable, so that the dense representations of all passages can be precomputed for efficient retrieval.
在实践中,将问题编码和段落编码分离是理想的做法,因为这样可以先预先计算出所有段落的密集表示,从而实现高效的检索。

Here, we adopt two independent neural networks initialized from pre-trained LMs for the two encoders E q ( ⋅ ) E_{q}(·) Eq() and E p ( ⋅ ) E_{p}(·) Ep() separately,
在这里,我们分别为两个编码器 Eq(·) 和 Ep(·) 采用了两个从预训练语言模型(LMs)初始化的独立神经网络,

and take the representations at the first token (e.g., [CLS] symbol in BERT) as the output for encoding.
并取第一个标记(例如,在BERT中的[CLS]符号)的表示作为编码的输出。


为什么使用[CLS]符号)的表示作为编码的输出,简单解释的话,是BERT使用的是transformer结构,而一句话的开始的标记[CLS]能够“兼顾”整句话的含义。

详细可以看链接:
https://blog.csdn.net/sdsasaAAS/article/details/142926242
https://blog.csdn.net/weixin_45947938/article/details/144232649


3.2.1 Training

Formally, given a question q i q_{i} qi together with its positive passage p i + p_{i}^+ pi+ and m m m negative passages { p i , j − } j = 1 m \left\{p_{i, j}^-\right\}_{j=1}^m {pi,j}j=1m, we minimize the loss function:

在这里插入图片描述

where we aim to optimize the negative log likelihood of the positive passage against a set of m m m negative passages.

Ideally, we should take all the negative passages in the whole collection into consideration in Equation 2.

However, it is computationally infeasible to consider a large number of negative samples for a question, and hence m m m is practically set to a small number that is far less than M M M.

As what will be discussed later, both the number and the quality of negatives affect the final performance of passage retrieval.

3.2.2 Inference

In our implementation, we use FAISS to index the dense representations of all passages.
使用了FAISS(Facebook AI Similarity Search)库来对所有段落的密集表示进行索引。

Specifically, we use IndexFlatIP for indexing and the exact maximuminner product search for querying.
具体地说,使用了 IndexFlatIP 作为索引类型,以及精确的最大内积搜索(exact maximum inner product search)作为查询方法。

  • FAISS:是一个高效相似性搜索和稠密向量聚类的库,尤其适用于在大型数据集上进行快速相似性搜索。

  • IndexFlatIP:这是一个基于平坦(flat)索引的FAISS类;
    它直接存储了所有向量,并在查询时计算查询向量与所有存储向量的内积。
    IP代表内积(Inner Product),所以 IndexFlatIP 适用于那些需要基于内积相似性度量(如余弦相似度)的应用场景。

  • 最大内积搜索:这是基于内积相似度的一种搜索方法。对于给定的查询向量,它会找到与查询向量内积最大的存储向量。这在信息检索、推荐系统等领域中特别有用,因为这些领域通常涉及到计算向量之间的相似性。

通过结合使用IndexFlatIP和最大内积搜索,能够在大型文本集合中高效地找到与给定查询最相似的段落。

对于更大规模的数据集,可能需要考虑使用FAISS提供的更高效的索引方法,如基于聚类的索引(如IndexIVFPQ)或基于图的索引(如IndexHNSW),以在保持较高搜索质量的同时提高搜索速度。

不理解,没用过FAISS

3.3 Optimized Training Approach

Three major challenges in training the dual-encoder based retriever, including:

  • the training and inference discrepancy,
  • the existence of unlabeled positives,
  • limited training data.

3.3.1 Cross-batch Negatives

Assume that there are B questions in a mini-batch on a single GPU, and each question has one positive passage.
在这里插入图片描述
Figure 2: The comparison of traditional in-batch negatives and our cross-batch negatives when trained on multiple GPUs, where A is the number of GPUs, and B is the number of questions in each min-batch.

With A GPUs (or mini-batches) , we can indeed obtain A × B − 1 A×B-1 A×B1 negatives for a given question, which is approximately A A A times as many as the original number of in-batch negatives.

In this way, we can use more negatives in the training objective of Equation 2, so that the results are expected to be improved.

3.3.2 Denoised Hard Negatives

因为人工标记的标签是有限的,存在大量未标记的正确答案;所以之前:

To obtain hard negatives, a straightforward method is to select the top-ranked passages (excluding the labeled positive passages) as negative samples.

这种方法,容易 假阴;

基于此:

We first train a cross-encoder.

Then, when sampling hard negatives from the top-ranked passages retrieved by a dense retriever, we select only the passages that are predicted as negatives by the cross-encoder with high confidence scores.

The selected top-retrieved passages can be considered as denosied samples that are more reliable to be used as hard negatives.

3.3.3 Data Augmentation

The third strategy aims to alleviate the issue of limited training data.

Since the cross-encoder is more powerful in measuring the similarity between questions and passages, we utilize it to annotate unlabeled questions for data augmentation.

Specifically, we incorporate a new collection of unlabeled questions, while reuse the passage collection.

Then, we use the learned cross-encoder to predict the passage labels for the new questions.

To ensure the quality of the automatically labeled data, we only select the predicted positive and negative passages with high confidence scores estimated by the cross-encoder.

Finally, the automatically labeled data is used as augmented training data to learn the dual encoder.

3.4 The Training Procedure

在这里插入图片描述
Require:
Let C C C denote a collection of passages.
Q L Q_{L} QL is a set of questions that have corresponding labeled passages in C C C,
Q U Q_{U} QU is a set of questions that have no corresponding labeled passages.
D L D_{L} DL is a dataset consisting of C C C and Q L Q_{L} QL,
D U D_{U} DU is a dataset consisting of C C C and Q U Q_{U} QU.

Step1:
Train a dual-encoder M D ( 0 ) M_{D}^{(0)} MD(0) by using cross-batch negatives on D L D_{L} DL.

STEP 2:
Train a cross-encoder M C M_{C} MC on D L D_{L} DL.

  • The positives used for training the cross-encoder are from the original training set D L D_{L} DL,
  • while the negatives are randomly sampled from the top-k passages (excluding the labeled positive passages) retrieved by M D ( 0 ) M_{D}^{(0)} MD(0) from C C C for each question q ∈ D L q \in D_{L} qDL.

This design is to let the cross-encoder adjust to the distribution of the results retrieved by the dual-encoder, since the cross-encoder will be used in the following two steps for optimizing the dual-encoder.

STEP 3:
Train a dual-encoder M D ( 1 ) M_{D}^{(1)} MD(1) by further introducing denoised hard negative sampling on D L D_{L} DL.

Regarding to each question q ∈ D L q \in D_{L} qDL, the hard negatives are sampled from the top passages retrieved by M D ( 0 ) M_{D}^{(0)} MD(0) from C C C,

and only the passages that are predicted as negatives by the cross-encoder M C M_{C} MC with high confidence scores will be selected.

STEP 4:
Construct pseudo training data D U D_{U} DU by using M C M_{C} MC to label the top-k passages retrieved by M D ( 1 ) M_{D}^{(1)} MD(1) from C C C for each question q ∈ D U q \in D_{U} qDU,

and then train a dual-encoder M D ( 2 ) M_{D}^{(2)} MD(2) on both the manually labeled training data D L D_{L} DL and the automatically augmented training data D U D_{U} DU.


我个人理解为,

先用人工标记的数据集, D L D_{L} DL,训练一个检索模型 dual-encoder M D ( 0 ) M_{D}^{(0)} MD(0)

然后,训练一个分类模型,cross-encoder M C M_{C} MC ,该模型最后给出正负样本的二分类。 其中,正样本来自 D L D_{L} DL,负样本来自: M D ( 0 ) M_{D}^{(0)} MD(0) 给出的 top-k passages (excluding the labeled positive passages)。

然后,训练检索模型 dual-encoder M D ( 1 ) M_{D}^{(1)} MD(1);其增加的负样本,仍然来自 M D ( 0 ) M_{D}^{(0)} MD(0) 给出的 top-k passages (excluding the labeled positive passages),不过经过了一些筛选,是第二步中经过cross-encoder预测过为负样本的负样本;

这样会排除一些直接使用 M D ( 0 ) M_{D}^{(0)} MD(0) 给出的 top-k passages (excluding the labeled positive passages)导致的未标记的正样本;

再然后,将 D U D_{U} DU喂给 M D ( 1 ) M_{D}^{(1)} MD(1),get the top-k passages;将这些数据再喂给 M C M_{C} MC输出标签;

然后使用人工标记的 D L D_{L} DL,和,得到“伪标签”的 D U D_{U} DU,再训练一个检索模型 dual-encoder M D ( 2 ) M_{D}^{(2)} MD(2)


M C M_{C} MC 是二分类模型是不合适的,结合 4.1.3 来看,其也是个检索模型:

4.1 Experimental Setup

4.1.3 Implementation Details

1. Pre-trained LMs

The dual-encoder is initialized with the parameters of ERNIE 2.0 base, and the cross-encoder is initialized with ERNIE 2.0 large.

2. Denoised hard negatives and data augmentation

We use the cross-encoder for both denoising hard negatives and data augmentation.

Specifically, we select the top retrieved passages with scores less than 0.1 as negatives and those with scores higher than 0.9 as positives.

We manually evaluated the selected data, and the accuracy was higher than 90%.

3. The number of positives and negatives

When training the cross-encoders, the ratios of the number of positives to the number of negatives are 1:4 and 1:1 on MSMARCO and NQ, respectively.

The negatives used for training cross-encoders are randomly sampled from the top-1000 and top-100 passages retrieved by the dual-encoder M D ( 0 ) M_{D}^{(0)} MD(0) on MSMARCO and NQ, respectively.

When training the dual-encoders in the last two steps ( M D ( 1 ) M_{D}^{(1)} MD(1)​ and M D ( 2 ) M_{D}^{(2)} MD(2)​), we set the ratios of the number of positives to the number of hard negatives as 1:4 and 1:1 on MSMARCO and NQ, respectively.


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/66098.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

ESP8266 AP模式 网页配网 arduino ide

ESP8266的AP配网,可以自行配置网络,一个简单的demo,文档最后有所有的代码,已经测试通过. 查看SPIFFS文件管理系统中的文件 账号密码是否存在,如不存在进入AP配网,如存在进入wifi连接模式 // 检查Wi-Fi凭据if (isWiFiConfigured()) {Serial.println("找到Wi-Fi凭据&#…

ubuntu官方软件包网站 字体设置

在https://ubuntu.pkgs.org/22.04/ubuntu-universe-amd64/xl2tpd_1.3.16-1_amd64.deb.html搜索找到需要的软件后,点击,下滑, 即可在Links和Download找到相关链接,下载即可, 但是找不到ros的安装包, 字体设…

使用 WPF 和 C# 绘制覆盖网格的 3D 表面

此示例展示了如何使用 C# 代码和 XAML 绘制覆盖有网格的 3D 表面。示例使用 WPF 和 C# 将纹理应用于三角形展示了如何将纹理应用于三角形。此示例只是使用该技术将包含大网格的位图应用于表面。 在类级别,程序使用以下代码来定义将点的 X 和 Z 坐标映射到 0.0 - 1.…

[Do374]Ansible一键搭建sftp实现用户批量增删

[Do374]Ansible一键搭建sftp实现用户批量增删 1. 前言2. 思路3. sftp搭建及用户批量新增3.1 配置文件内容3.2 执行测试3.3 登录测试3.4 确认sftp服务器配置文件 4. 测试删除用户 1. 前言 最近准备搞一下RHCA LV V,外加2.9之后的ansible有较大变化于是练习下Do374的课程内容. 工…

SK海力士(SK Hynix)是全球领先的半导体制造商之一,其在无锡的工厂主要生产DRAM和NAND闪存等存储器产品。

SK海力士(SK Hynix)是全球领先的半导体制造商之一,其在无锡的工厂主要生产DRAM和NAND闪存等存储器产品。以下是SK海力士的一些主要产品型号和类别: DRAM 产品 DDR4 DRAM 特点: 高速、低功耗,广泛应用于PC、服务器和移…

WordPress如何配置AJAX以支持点击加载更多?

WordPress 配置 AJAX 支持点击加载更多内容通常涉及到前端 JavaScript 和服务器端的配合。以下是基本步骤: 安装插件:你可以选择一个现成的插件如 “Advanced Custom Fields” 或者 “WP Infinite Scroll”,它们已经内置了 AJAX 功能&#xf…

【IDEA 2024】学习笔记--文件选项卡

在我们项目的开发过程中,由于项目涉及的类过多,以至于我们会打开很多的窗口。使用IDEA默认的配置,个人觉得十分不便。 目录 一、设置多个文件选项卡按照文件字母顺序排列 二、设置多个文件选项卡分行显示 一、设置多个文件选项卡按照文件字…

【C】数组和指针的关系

在 C 语言 和 C 中,数组和指针 有非常密切的关系。它们在某些情况下表现类似,但也有重要的区别。理解数组和指针的关系对于掌握低级内存操作和优化程序性能至关重要。 1. 数组和指针的基本关系 数组是一个 连续存储的元素集合,在内存中占据一…

Maven 配置本地仓库

步骤 1&#xff1a;修改 Maven 的 settings.xml 文件 找到你的 Maven 配置文件 settings.xml。 Windows: C:\Users\<你的用户名>\.m2\settings.xmlLinux/macOS: ~/.m2/settings.xml 打开 settings.xml 文件&#xff0c;找到 <localRepository> 标签。如果没有该标…

Docker save load 镜像 tag 为 <none>

一、场景分析 我从 docker hub 上拉了这么一个镜像。 docker pull tomcat:8.5-jre8-alpine 我用 docker save 命令想把它导出成 tar 文件以便拷贝到内网机器上使用。 docker save -o tomcat-8.5-jre8-alpine.tar.gz 镜像ID 当我把这个镜像传到别的机器&#xff0c;并用 dock…

O2O同城系统架构与功能分析

2015工作至今&#xff0c;10年资深全栈工程师&#xff0c;CTO&#xff0c;擅长带团队、攻克各种技术难题、研发各类软件产品&#xff0c;我的代码态度&#xff1a;代码虐我千百遍&#xff0c;我待代码如初恋&#xff0c;我的工作态度&#xff1a;极致&#xff0c;责任&#xff…

《盘古大模型——鸿蒙NEXT的智慧引擎》

在当今科技飞速发展的时代&#xff0c;华为HarmonyOS NEXT的发布无疑是操作系统领域的一颗重磅炸弹&#xff0c;其将人工智能与操作系统深度融合&#xff0c;开启了智能新时代。而盘古大模型在其中发挥着至关重要的核心作用。 赋予小艺智能助手超强能力 在鸿蒙NEXT中&#xf…

走出实验室的人形机器人,将复刻ChatGPT之路?

1月7日&#xff0c;在2025年CES电子展现场&#xff0c;黄仁勋不仅展示了他全新的皮衣和采用Blackwell架构的RTX 50系列显卡&#xff0c;更进一步展现了他对于机器人技术领域&#xff0c;特别是人形机器人和通用机器人技术的笃信。黄仁勋认为机器人即将迎来ChatGPT般的突破&…

EF Core执行原生SQL语句

目录 EFCore执行非查询原生SQL语句 为什么要写原生SQL语句 执行非查询SQL语句 有SQL注入漏洞 ExecuteSqlInterpolatedAsync 其他方法 执行实体相关查询原生SQL语句 FromSqlInterpolated 局限性 执行任意原生SQL查询语句 什么时候用ADO.NET 执行任意SQL Dapper 总…

Java中网络编程的学习

目录 网络编程概述 网络模型 网络通信三要素: IP 端口号 通信协议 IP地址&#xff08;Internet Protocol Address&#xff09; 端口号 网络通信协议 TCP 三次握手 四次挥手 UDP TCP编程 客户端Socket的工作过程包含以下四个基本的步骤&#xff1a; 服务器程序…

HarmonyOS NEXT开发进阶(七):页面跳转

文章目录 一、前言二、页面跳转三、页面返回四、页面返回前增加确认对话框4.1 系统的默认询问框4.2 自定义询问框 五、拓展阅读 一、前言 APP开发过程中&#xff0c;多页面跳转场景十分常见&#xff0c;例如&#xff0c;登录 -> 首页 -> 个人中心。在鸿蒙开发中&#xf…

【Python】第一弹---解锁编程新世界:深入理解计算机基础与Python入门指南

✨个人主页&#xff1a; 熬夜学编程的小林 &#x1f497;系列专栏&#xff1a; 【C语言详解】 【数据结构详解】【C详解】【Linux系统编程】【MySQL】【Python】 目录 1、计算机基础概念 1.1、什么是计算机 1.2、什么是编程 1.3、编程语言有哪些 2、Python 背景知识 2.…

LeetCode:131. 分割回文串

跟着carl学算法&#xff0c;本系列博客仅做个人记录&#xff0c;建议大家都去看carl本人的博客&#xff0c;写的真的很好的&#xff01; 代码随想录 LeetCode:131. 分割回文串 给你一个字符串 s&#xff0c;请你将 s 分割成一些子串&#xff0c;使每个子串都是回文串。返回 s 所…

优化神马关键词排名原理(优化神马搜索引擎关键词排名规则)

优化神马&#xff08;即百度&#xff09;关键词排名的原理主要基于搜索引擎的算法和用户体验的考量。以下是一些关键的优化原理&#xff1a; 一、搜索引擎算法 网页重要性评估&#xff1a; 搜索引擎通过复杂的算法评估网页的重要性和权威性&#xff0c;如基于PageRank的评估方…