Does Negative Sampling Matter? A Review with Insights into its Theory and Applications
负采样重要吗?它的理论与应用综述
Does Negative Sampling Matter? A Review with Insights into its Theory and Applications
Zhen Yang,Ming Ding,Tinglin Huang,Yukuo Cen,Junshuai Song,Bin Xu ⋆ ,Yuxiao Dong,and Jie Tang ⋆Zhen Yang and Ming Ding are with the Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China. E-mail:{yangz21, dm18}@mails.tsinghua.edu.cn
Zhen Yang和Ming Ding来自清华大学计算机科学与技术系,北京,100084。邮箱:{yangz 21,dm 18}@mails.tsinghua.edu.cn
Tinglin Huang is with the Department of Computer Science, Yale University, New Haven, CT 06511, USA.E-mail:tinglin.huang@yale.edu
Tinglin Huang就职于耶鲁大学计算机科学系,地址:New Haven,CT 06511,USA。yale.edu
Work was done during his visit to Tsinghua University.
工作是在他访问清华大学期间完成的。
Yukuo Cen is with Zhipu AI, Beijing, China, 100084. E-mail: yukuo.cen@zhipuai.cn
Yukuo Cen是Zhipu AI,北京,中国,100084。电子邮件:yukuo. zhipuai.cn
Junshuai Song is with the Technology and Engineering Group, Tencent, Beijing, China, 100080. E-mail:jasonjssong@tencent.com
宋俊帅就职于腾讯科技工程部,北京,中国,100080。E-mail:jasonjssong@tencent.com
Bin Xu, Yuxiao Dong, and Jie Tang are with the Department of Computer Science and Technology of Tsinghua University, Beijing, China, 100084. E-mail:{xubin, yuxiaod, jietang}@tsinghua.edu.cn
清华大学计算机科学与技术系的徐斌、董宇晓和唐杰,中国北京,100084。邮箱:{xubin,yuxiaod,jietang}@tsinghua.edu.cn
⋆Corresponding authors.
⋆ 通讯作者。
Abstract 摘要
Negative sampling has swiftly risen to prominence as a focal point of research, with wide-ranging applications spanning machine learning, computer vision, natural language processing, data mining, and recommender systems. This growing interest raises several critical questions: Does negative sampling really matter? Is there a general framework that can incorporate all existing negative sampling methods? In what fields is it applied? Addressing these questions, we propose a general framework that leverages negative sampling. Delving into the history of negative sampling, we trace the development of negative sampling through five evolutionary paths. We dissect and categorize the strategies used to select negative sample candidates, detailing global, local, mini-batch, hop, and memory-based approaches. Our review categorizes current negative sampling methods into five types: static, hard, GAN-based, Auxiliary-based, and In-batch methods, providing a clear structure for understanding negative sampling. Beyond detailed categorization, we highlight the application of negative sampling in various areas, offering insights into its practical benefits. Finally, we briefly discuss open problems and future directions for negative sampling.
负采样已经迅速成为研究的焦点,其广泛的应用涵盖机器学习,计算机视觉,自然语言处理,数据挖掘和推荐系统。这种日益增长的兴趣提出了几个关键问题:负采样真的重要吗?是否有一个总体框架,可以纳入所有现有的消极抽样方法?它应用于哪些领域?为了解决这些问题,我们提出了一个利用负采样的一般框架。通过对负采样的历史进行深入研究,我们可以通过五条演化路径来追溯负采样的发展。我们剖析和分类用于选择负样本候选人的策略,详细介绍了全球,本地,小批量,跳,和基于内存的方法。 我们的评论将当前的负采样方法分为五种类型:静态,硬,基于GAN,基于神经网络和批量方法,为理解负采样提供了一个清晰的结构。除了详细的分类,我们还强调了负抽样在各个领域的应用,并深入了解其实际好处。最后,我们简要讨论了负采样的开放问题和未来的发展方向。
Index Terms:
Negative Sampling Framework; Negative Sampling Algorithms; Negative Sampling Applications索引术语:负采样框架;负采样算法;负采样应用
1Introduction 1引言
Does negative sampling matter? Negative sampling (NS) is a critical technique used in machine learning, designed to enhance the efficiency of models by selecting a small subset of negative samples from a vast pool of possible negative samples (i.e., non-positive samples). A prime example of NS in action is found in the word2vec [1], particularly within its Skip-Gram architecture. The Skip-Gram model is designed to predict the likelihood of occurrence of nearby context words given a target word within a sequence. Formally, for a given sequence of training words �1,�2,⋯,��, the objective is to maximize the average log probability of context words within a specified window on either side of the target word, which can be presented as:
负采样重要吗?负采样(NS)是机器学习中使用的一种关键技术,旨在通过从大量可能的负样本池中选择一小部分负样本来提高模型的效率(即,非阳性样本)。在word2vec [1]中可以找到NS的一个主要例子,特别是在其Skip-Gram架构中。Skip-Gram模型被设计为预测给定序列内的目标词的附近上下文词出现的可能性。形式上,对于给定的训练单词序列 �1,�2,⋯,�� ,目标是最大化目标单词两侧的指定窗口内的上下文单词的平均对数概率,其可以表示为:
min−1�∑�=1�∑−�≤�≤�,�≠0log�(��+�|��) |
where � is the length of the word sequence, � is the size of the training context (that is a window of words around the target word), �� is the target word, ��+� are the context words within the window around ��.
其中 � 是单词序列的长度, � 是训练上下文的大小(即目标单词周围的单词窗口), �� 是目标单词, ��+� 是在 �� 周围的窗口内的上下文单词。
The probability �(��+�|��) represents the likelihood of encountering a context word ��+� given a target word ��, defined by the softmax function:
概率 �(��+�|��) 表示在给定目标词 �� 的情况下遇到上下文词 ��+� 的可能性,由softmax函数定义:
�(��+�|��)=���(���+�′����)∑�=1����(��′����) |
where �� and ��′ represent the “input” and “output” vector embeddings of a word �, and � denotes the total number of words in the vocabulary.
其中 �� 和 ��′ 表示词 � 的“输入”和“输出”向量嵌入,并且 � 表示词汇表中的词的总数。
The objective of the training process is to optimize these word vector embeddings to maximize the probability for true context words while minimizing it for all other words. However, implementing the Skip-Gram model with a standard softmax function poses significant computational challenges, particularly with large vocabularies. Each training step requires calculating and normalizing the probabilities of all words in the vocabulary relative to a given target word, which is computationally intensive.
训练过程的目标是优化这些词向量嵌入,以最大化真实上下文词的概率,同时最小化所有其他词的概率。然而,使用标准softmax函数实现Skip-Gram模型带来了巨大的计算挑战,特别是在大词汇量的情况下。每个训练步骤都需要计算和归一化词汇表中所有单词相对于给定目标单词的概率,这是计算密集型的。
Negative sampling is leveraged to simplify this. Rather than considering all words in the vocabulary, the model only considers a small subset of negative words (not present in the current context) along with the actual positive context words. This substantially reduces the computational burden. Besides, negative sampling changes the training objective. Instead of predicting the probability distribution across the entire vocabulary for a given input word, it redefines the problem as a binary classification task. For each pair of words, the model predicts whether they are likely to appear in the same context (positive samples) or not (negative samples). The objective of the Skip-Gram model with negative sampling can be represented as:
利用负采样来简化这一点。该模型不是考虑词汇表中的所有单词,而是只考虑一小部分否定词(当前上下文中不存在)沿着实际的积极上下文单词。这大大减少了计算负担。此外,负采样改变了训练目标。它不是预测给定输入单词在整个词汇表中的概率分布,而是将问题重新定义为二元分类任务。对于每对单词,模型预测它们是否可能出现在相同的上下文中(阳性样本)或不(阴性样本)。具有负采样的Skip-Gram模型的目标可以表示为:
min−log�(���+�′����)−∑�=1�𝔼��∼��(�)[log�(−���′����)] |
where � is the number of negative samples, �� are negative words sampled for the target word ��, �(⋅) is the sigmoid function and ��(�) is the negative sampling distribution that is designed as the 3/4 power of word frequencies.
其中 � 是负样本的数量, �� 是针对目标词 �� 采样的负词, �(⋅) 是S形函数,而 ��(�) 是被设计为词频的3/4幂的负采样分布。
This example demonstrates the importance of negative sampling. Without it, the model would need to compute the co-occurrence probability with every other word for each training instance, which is computationally expensive, particularly for large vocabularies. By employing negative sampling, word2vec significantly reduces the computational burden and achieves great performance.
这个例子说明了负采样的重要性。如果没有它,模型将需要计算每个训练实例的每隔一个单词的共现概率,这在计算上是昂贵的,特别是对于大词汇表。通过采用负采样,word2vec显着减少了计算负担,并取得了很好的性能。
In a broader perspective, negative sampling is a critical technique employed across a variety of fields, such as recommendation systems (RS) [2], graph representation learning (GRL) [3, 4, 5, 6, 7], knowledge graph learning (KGE) [8, 9], natural language processing (NLP) [1, 10, 11, 12], and computer vision (CV) [13, 14, 15, 16, 17, 18]. The core principle of negative sampling – selecting a representative subset of negative samples to improve the efficiency and effectiveness of the learning process – remains consistent across different domains.
从更广泛的角度来看,负采样是跨各种领域使用的关键技术,例如推荐系统(RS)[ 2],图表示学习(GRL)[ 3,4,5,6,7],知识图学习(KGE)[ 8,9],自然语言处理(NLP)[ 1,10,11,12]和计算机视觉(CV)[ 13,14]。14、15、16、17、18]。负采样的核心原则-选择负样本的代表性子集,以提高学习过程的效率和有效性-在不同的领域保持一致。
To clearly verify the improvements, we compare a basic method with other NS methods on performance and convergence speed in recommendation system. As shown in Figure 1, compared with the basic RNS, the method without negative sampling (Non-NS) achieves a close performance but needs a longer training time since it uses all negatives for model training. Compared to RNS, other advanced NS methods can reduce the computational burden, accelerate training convergence(up to ∼48× with Cache-NS), degrade performance bias, and boost performance (14% improvement with Mix-NS). Compared with RNS, PNS (0.75) achieves a faster convergence of 1.6× but reduces performance by 24%. Compared with Mix-NS, DNS (64) shows faster convergence but slightly degrades performance.
为了清楚地验证改进,我们比较了一个基本的方法与其他NS方法在推荐系统的性能和收敛速度。如图1所示,与基本的RNS相比,无负采样的方法(Non—NS)实现了接近的性能,但需要更长的训练时间,因为它使用所有的负数进行模型训练。与RNS相比,其他高级NS方法可以减少计算负担,加速训练收敛(使用Cache—NS可达 ∼48× ),降低性能偏差,并提高性能(使用Mix—NS可达 14% )。与RNS相比,PNS(0.75)实现了 1.6× 的更快收敛,但性能降低了 24% 。与Mix—NS相比,DNS(64)显示出更快的收敛速度,但性能略有下降。
Despite the widespread application of NS, its implementation can differ greatly across domains. This raises the question: Is there a general framework that can incorporate all the NS methods and be applicable to different domains? Our survey aims to answer this by formalizing negative sampling and introducing a general framework tailored for model training (See Figure 2). Under this framework, we categorize different NS methods to make them more accessible and comparable. Besides, we summarize three critical aspects to be taken into account when designing a better NS method: (1) where to sample from? (2) how to sample? and (3) in which field should it be applied?
尽管NS的应用非常广泛,但其实现在不同领域之间可能存在很大差异。这就提出了一个问题:是否有一个通用的框架,可以合并所有的NS方法,并适用于不同的领域?我们的调查旨在通过形式化负采样并引入为模型训练量身定制的通用框架来回答这个问题(见图2)。在这个框架下,我们对不同的NS方法进行分类,使它们更容易访问和比较。此外,我们总结了设计一个更好的NS方法时需要考虑的三个关键方面:(1)从哪里采样?(2)如何取样?(3)在哪些领域应用?
Contributions. The main contributions of this survey:
捐款.本次调查的主要贡献:
- •
We highlight the importance of negative sampling and propose a general framework that can incorporate existing NS methods from various domains.
·我们强调了负采样的重要性,并提出了一个通用框架,可以将现有的NS方法从各个领域。 - •
We identify three aspects that should be considered for designing negative sampling: negative sample candidates, negative sampling distributions, and negative sampling applications. We sum up five selection methods for negative sample candidates and categorize existing negative sampling methods into five groups.
·我们确定了设计负采样时应该考虑的三个方面:负样本候选,负采样分布和负采样应用。本文总结了五种负样本候选的选择方法,并将现有的负样本选择方法分为五类。 - •
We report the characteristics of negative sampling methods in various domains and demonstrate the pros and cons of each type of negative sampling method. We summarize several open problems and discuss the future directions for negative sampling.
·我们报告了各个领域的负抽样方法的特点,并展示了每种负抽样方法的优缺点。我们总结了几个开放的问题,并讨论了负采样的未来发展方向。
Figure 1:Performance and Convergence speed comparison in the RS domain with the Non-NS and other methods using negative sampling on the Yelp2018 datasets with LightGCN as an encoder. PNS(�) denotes popularity-based negative sampling with various power parameters �. DNS(�) represents dynamic negative sampling with a different number of negative sample candidates �. The details of the negative sampling methods can be found in Section 2.
图1:RS域中的性能和收敛速度比较,使用非NS和其他方法,使用LightGCN作为编码器,在Yelp 2018数据集上使用负采样。PNS( � )表示具有各种功率参数 � 的基于流行度的负采样。DNS( � )表示具有不同数量的负样本候选 � 的动态负采样。阴性采样方法的详细信息见第2节。
Survey Organization. The rest of this survey is organized as follows. In Section 2, we provide the overview of negative sampling, first briefly tracing back the development history and then giving a general framework. Section 3 reviews existing NS algorithms, as well as the pros and cons of each category. Section 4 further explores applications of NS in various domains. Finally, we discuss the open problems, future directions, and our conclusions in Section 5 and Section 6.
调查组织。本调查的其余部分组织如下。第二节对负抽样进行了概述,首先简要回顾了负抽样的发展历史,然后给出了一个总体框架。第3节回顾了现有的NS算法,以及每个类别的优点和缺点。第四节进一步探讨了NS在各个领域的应用。最后,我们讨论了开放的问题,未来的方向,我们的结论在第5节和第6节。
Variables Definitions. Here, we elucidate the meanings of the variables employed in our survey. �, �+, and �− denote an anchor sample, a positive sample, and the selected negative sample respectively. �� represents a designed negative sampling distribution. �′ denotes a negative candidate within the pool of negative sample candidates 𝒞. These candidates are the potential selections for the negative sample �−, as determined by the negative sampling strategy. �(⋅) is a function used to measure the similarity between samples, such as dot product [19, 6], L1 and L2 norms [20, 21], depending on the specific requirements of the learning task. �(⋅) represents the model that maps these input samples into their respective embeddings.
变量定义。在这里,我们解释了我们调查中使用的变量的含义。 � 、 �+ 和 �− 分别表示锚样本、阳性样本和所选择的阴性样本。 �� 表示设计的负抽样分布。 �′ 表示负样本候选池 𝒞 内的负候选。这些候选是负样本 �− 的潜在选择,如由负采样策略确定的。 �(⋅) 是用于测量样本之间相似性的函数,例如点积[19,6],L1和L2范数[20,21],取决于学习任务的具体要求。 �(⋅) 表示将这些输入样本映射到其各自嵌入的模型。
Figure 2:An illustration of a general framework that uses negative sampling. Positive and negative pairs are sampled implicitly or explicitly by positive and negative samplers respectively, both of them composing the training data. An encoder is applied for latent representation learning in various domains. In contrastive learning, positive pairs (i.e., Pos-view Pairs) are derived from data augmentations of the same instance or different perspectives of the same entity, while in other domains, such as metric learning, positive pairs (Pos-coexist Pairs) are the other instances in the dataset.
图2:使用负采样的一般框架的说明。正负对分别由正负采样器隐式或显式采样,它们都构成训练数据。编码器被应用于各个领域的潜在表示学习。在对比学习中,正对(即,Pos-view Pairs)是从同一实体的同一实例或不同视角的数据增强中导出的,而在其他领域,如度量学习,Pos-coexist Pairs(Pos-coexist Pairs)是数据集中的其他实例。
2Negative Sampling 2负采样
In this section, we provide a comprehensive overview of negative sampling, detailing its formalization and framework that uses negative sampling for model training, tracing its historical development from five development lines, and elaborating its important role in machine learning.
在本节中,我们全面概述了负采样,详细介绍了它的形式化和使用负采样进行模型训练的框架,从五条发展线追溯了它的历史发展,并阐述了它在机器学习中的重要作用。
2.1Formalization and Framework
2.1形式化和框架
Negative sampling aims to select a subset of negative samples from a larger pool, which is a technique used to improve the efficiency and effectiveness of training in machine learning models. The formalization of negative sampling can be encapsulated in a general loss function, which is designed to maximize the similarity of positive pairs and minimize it between negative pairs. For an anchor sample � and a specific positive sample �+, negative sampling selects negative sample �− from a candidate pool 𝒞 based on a negative sampling distribution ��(�−) for model training.
负采样旨在从较大的池中选择负样本的子集,这是一种用于提高机器学习模型训练效率和有效性的技术。负采样的形式化可以被封装在一个通用的损失函数中,该损失函数被设计为最大化正对之间的相似性,并最小化负对之间的相似性。对于锚样本 � 和特定正样本 �+ ,负采样基于负采样分布 ��(�−) 从候选池 𝒞 中选择负样本 �− 用于模型训练。
ℒ=�(�,�+,�−),�−∼��(�−) |
where �(⋅) is a specific loss function in various domains.
其中 �(⋅) 是各个域中的特定损失函数。
In different application domains, the specific loss function may vary based on various tasks and datasets.
在不同的应用领域中,特定的损失函数可以基于各种任务和数据集而变化。
- •
Bayesian Personalized Ranking (BPR) Loss for Pair-based Sampling. For a pair of positive and negative samples (pos, neg), BPR loss aims to ensure that the log-likelihood of a positive sample surpasses that of a negative sample.
·用于基于对的采样的贝叶斯个性化排名(BPR)损失。对于一对阳性和阴性样本(阳性,阴性),BPR损失旨在确保阳性样本的对数似然超过阴性样本的对数似然。 - •
Hinge Loss for Triplet-based Sampling. For a triplet of an anchor, a positive sample, and a negative sample (anchor, pos, neg), hinge loss seeks to maintain a minimum margin between the similarity of an anchor-positive pair and that of an anchor-negative pair.
·基于三重态采样的铰链损耗。对于锚、阳性样品和阴性样品(锚、阳性、阴性)的三联体,铰链损失试图维持锚-阳性对和锚-阴性对的相似性之间的最小裕度。 - •
Cross-Entropy Loss for Single Positive with Multiple Negatives. In cases with one positive and multiple negatives, cross-entropy loss discriminates the positive sample from the negatives.
·单个阳性与多个阴性的交叉熵损失。在具有一个阳性和多个阴性的情况下,交叉熵损失将阳性样本与阴性样本区分开。 - •
InfoNCE Loss in Contrastive Learning. InfoNCE loss, common in contrastive learning frameworks, is designed to contrast positive pairs against negative pairs.
·对比学习中的InfoNCE损失。InfoNCE损失,在对比学习框架中很常见,旨在将积极对与消极对进行对比。
The core principle underlying the various loss functions associated with negative sampling is to enhance the model’s ability to distinguish between positive and negative pairs, effectively pulling positive samples closer and pushing negative samples apart in the representation space. Detailed formulations of these functions can be found in Table I. Based on the formalization of negative sampling, we propose a general framework that uses negative sampling for model training, which contains a positive sampler, a negative sampler, and a trainable encoder. The overall framework is illustrated in Figure 2. The positive sampler is applied to generate positive training pairs. For example, positive pairs in recommendation are sampled explicitly from the observed user-item interactions, while these in contrastive learning are generated implicitly by data augmentations. Negative training pairs are sampled by different NS strategies via a negative sampler. The sampled pairs are then processed by a trainable encoder, which is tailored to the specific application domain. The trainable encoder could be graph neural networks (GNNs) [22, 23, 24] in graph representation learning, ResNet [25] in unsupervised visual representation learning, BERT [26] and RoBERTa [27] in unsupervised sentence embedding learning, Skip-Gram [1] in word embedding, TransE [20] and TransH [28] in knowledge graph embedding.
与负采样相关的各种损失函数的核心原理是增强模型区分正负对的能力,有效地将正样本拉得更近,并将负样本在表示空间中推开。这些函数的详细公式见表。基于负采样的形式化,我们提出了一个使用负采样进行模型训练的通用框架,该框架包含一个正采样器,一个负采样器和一个可训练编码器。总体框架如图2所示。应用正采样器来生成正训练对。例如,推荐中的正对是从观察到的用户-项目交互中显式采样的,而对比学习中的正对是通过数据扩充隐式生成的。通过一个负采样器,用不同的NS策略对负训练对进行采样。 然后,采样对由可训练编码器处理,该编码器针对特定应用领域定制。可训练编码器可以是图表示学习中的图神经网络(GNN)[ 22,23,24],无监督视觉表示学习中的ResNet [ 25],无监督句子嵌入学习中的BERT [ 26]和RoBERTa [ 27],单词嵌入中的Skip-Gram [ 1],知识图嵌入中的TransE [ 20]和TransH [ 28]。
Figure 3:The five development lines of negative sampling. Each development line addresses different challenges and has been adopted and adapted to suit the specific needs of individual domains, such as RS, NLP, GRL, KGE, and CV.
图3:负采样的五条发展线。每个开发线都解决了不同的挑战,并已被采用和调整,以适应各个领域的特定需求,如RS,NLP,GRL,KGE和CV。 TABLE I:An overview of loss functions used in a general framework that uses negative sampling for model training. � represents a temperature parameter that scales the similarity scores. � denotes the batch size.
表I:在使用负采样进行模型训练的一般框架中使用的损失函数的概述。 � 表示缩放相似性分数的温度参数。 � 表示批量。
Loss Formulation 损失公式 |
ℒBPR=In�(�(�+)−�(�−)) |
ℒHinge=max(0,�(�(�),�(�+))−�(�(�),�(�−))+�) |
ℒCE=−[log�(�(�(�),�(�+)))+log(�(−�(�(�),�(�−)))] |
ℒInfoNCE=−log���(�(�(�),�(�+))/�)���(�(�(�),�(�+))/�)+∑�=1�−1���(�(�(�),�(��−))/�) |
2.2Negative Sampling History
2.2阴性采样历史
The history of negative sampling is a fascinating development line through the evolution of machine learning techniques. Here, we delve into the historical development of negative sampling, highlighting the motivations behind various negative sampling methods. As shown in Figure 3, the earliest implementation of negative sampling is random negative sampling (RNS). The basic idea of RNS is to randomly select a subset of negative samples from the large pool, thereby reducing the computational requirements of the training process. Notably in 2008, Pan et al. [29] utilized RNS to prevent the model from developing a bias towards the majority of negative samples in one-class collaborative filtering (OCCF) for recommender systems, which effectively balanced the data. Rendle et al. [2] proposed a pair-wise learning method, which leveraged pair-wise loss and RNS to solve the lack of negative data in implicit feedback. Such a RNS method is also applied to graph embedding [20, 30] and GNN-based recommendation [31, 32].
负采样的历史是机器学习技术发展的一条迷人的发展线。在这里,我们深入研究了负抽样的历史发展,突出了各种负抽样方法背后的动机。如图3所示,负采样的最早实现是随机负采样(RNS)。RNS的基本思想是从大样本池中随机选择阴性样本的子集,从而减少训练过程的计算需求。值得注意的是,在2008年,Pan等人[ 29]利用RNS来防止模型在推荐系统的一类协同过滤(OCCF)中对大多数阴性样本产生偏见,这有效地平衡了数据。Rendle等人[ 2]提出了一种成对学习方法,该方法利用成对损失和RNS来解决隐式反馈中缺乏负数据的问题。 这种RNS方法也适用于图嵌入[ 20,30]和基于GNN的推荐[ 31,32]。
Over time, as machine learning datasets expanded and model complexities increased, the limitations of RNS became evident. These randomly selected negative samples might not always be relevant for the model training. This realization leads to the development of popularity-based negative sampling (PNS) [1, 3, 5, 4, 24], which selects negative samples based on their occurrence frequency, popularity, or connectivity within the dataset (i.e., degree). This method, predicated on the hypothesis that frequently occurring negatives are more likely to represent true negatives, attempts to refine the relevance of negative samples to the dataset. The adoption of a distribution proportional to the 3/4 power of sample frequency became a hallmark of PNS, proving its efficacy in domains such as NLP [1] and graph embedding learning [3, 5, 24].
随着时间的推移,随着机器学习数据集的扩展和模型复杂性的增加,RNS的局限性变得明显。这些随机选择的阴性样本可能并不总是与模型训练相关。这种认识导致了基于流行度的负采样(PNS)的发展[ 1,3,5,4,24],它根据它们在数据集中的出现频率,流行度或连通性来选择负样本(即,度)。该方法基于频繁出现的阴性更可能代表真阴性的假设,试图改进阴性样本与数据集的相关性。采用与样本频率的3/4次方成比例的分布成为PNS的标志,证明了其在NLP [ 1]和图嵌入学习[ 3,5,24]等领域的有效性。
Nonetheless, both RNS and PNS exhibit a static nature, failing to adapt as model training, which can lead to suboptimal performance. This challenge paves the way for hard negative sampling strategies (Hard NS) [33, 34, 35, 19], a strategy designed to identify and leverage negatives that are difficult for the model to distinguish accurately. These hard negatives, providing a more effective learning signal, align the negative sampling more closely with the model’s current state, enhancing training efficiency and effectiveness. Additionally, these hard negatives provide more information for gradients in model optimization, which can accelerate convergence and boost performance. Hard NS is applied to word embedding [36], answer selection [37], knowledge graph embedding [38], graph embedding learning [6], GNN-based recommendation [39, 7, 40], and object detection [41].
尽管如此,RNS和PNS都表现出静态性质,无法适应模型训练,这可能导致次优性能。这一挑战为硬负采样策略(Hard NS)[ 33,34,35,19]铺平了道路,该策略旨在识别和利用模型难以准确区分的负因素。这些硬否定提供了更有效的学习信号,使否定采样与模型的当前状态更紧密地对齐,从而提高了训练效率和有效性。此外,这些硬负值为模型优化中的梯度提供了更多信息,这可以加速收敛并提高性能。硬NS应用于单词嵌入[ 36],答案选择[ 37],知识图嵌入[ 38],图嵌入学习[ 6],基于GNN的推荐[39,7,40]和对象检测[ 41]。
The inspiration drawn from Generative Adversarial Networks (GANs) [42] leads to the development of GAN-based negative sampling (GAN-based NS) [43, 9, 44, 45, 46, 47], a technique where adversarial training and adversarial examples are leveraged to generate negative samples that closely mirror the representations of true negative samples not easily found within the dataset. In this framework, a generator serves as a negative sampler while the training model acts as the discriminator. Such methods rely on a generator that adaptively approximates the underlying negative sampling distribution. GAN-based NS is a general strategy that can be applied to recommendation, graph learning, knowledge graph learning, and unsupervised visual representation learning.
从生成对抗网络(GANs)[ 42]中获得的灵感导致了基于GAN的负采样(GAN-based NS)[ 43,9,44,45,46,47]的发展,这种技术利用对抗训练和对抗示例来生成负样本,这些负样本密切反映了在数据集中不易找到的真负样本的表示。在这个框架中,生成器充当负采样器,而训练模型充当负采样器。这样的方法依赖于自适应地近似潜在的负采样分布的生成器。基于GAN的NS是一种通用策略,可应用于推荐、图学习、知识图学习和无监督视觉表示学习。
Recently, contrastive learning, particularly within unsupervised visual representation learning, adopts in-batch negative sampling (In-batch NS) as a strategy to capitalize on the efficiency of using other samples within the same batch as negatives. This approach obviates the need for external steps to generate or select negatives, accelerating the training process. For example, SimCLR [48] utilizes other samples in the current mini-batch as negatives. MoCo [49] maintains a memory bank to store the past several batches as negatives for model training, thereby enriching the pool of negatives. However, the effectiveness of In-batch NS is constrained by the batch size, where smaller batches might not offer a diverse array of negatives, potentially limiting the model’s learning capacity. To address this, several works [16, 17, 18, 50, 51, 52] have focused on introducing hard negatives into the contrastive learning paradigm.
最近,对比学习,特别是在无监督的视觉表示学习中,采用批内负采样(In-batch NS)作为一种策略,以利用使用同一批次中的其他样本作为负样本的效率。这种方法消除了外部步骤来生成或选择否定的需要,加速了训练过程。例如,Simplified [ 48]利用当前小批量中的其他样本作为阴性。MoCo [ 49]维护了一个内存库,将过去的几个批次存储为模型训练的阴性,从而丰富了阴性池。然而,In-batch NS的有效性受到批量大小的限制,其中较小的批量可能无法提供各种各样的负面信息,从而可能限制模型的学习能力。为了解决这个问题,一些作品[ 16,17,18,50,51,52]专注于将硬否定引入对比学习范式。
2.3The Importance of Negative Sampling
2.3负采样的重要性
Here, we explicitly elaborate on the importance of negative sampling from three aspects.
在此,我们从三个方面明确阐述了负抽样的重要性。
- 1.
Computational Efficiency. By selecting a representative subset of negative samples, negative sampling eliminates the need for the model to consider all possible negative samples during training. Without negative sampling, the model computes the probability distribution for a sample against the entire dataset, which is computationally intensive, especially for large datasets [29, 1]. Negative sampling simplifies this by transforming the task into a binary classification problem, which allows the model to differentiate between positive samples and a small number of selected negative samples. Thus, negative sampling significantly reduces the computational burden and accelerates the training process.
1.计算效率。通过选择负样本的代表性子集,负采样消除了模型在训练期间考虑所有可能的负样本的需要。在没有负采样的情况下,该模型计算样本相对于整个数据集的概率分布,这是计算密集型的,特别是对于大型数据集[ 29,1]。负采样通过将任务转换为二元分类问题来简化这一点,这允许模型区分正样本和少量选定的负样本。因此,负采样显著降低了计算负担并加速了训练过程。 - 2.
Handing Class Imbalance. Real-world datasets often exhibit a significant imbalance between positive and negative samples, resulting in performance bias. Negative sampling addresses this issue by curating a more balanced training dataset, selecting a representative subset of negatives. This method prevents the model from developing a bias towards the majority class, thereby improving the model’s ability to accurately predict less frequent classes. In cases like one-class application scenarios, only target positive samples with an inherent absence of explicit negative feedback, often indiscriminately treat all non-positive samples as negatives. By judiciously choosing negatives, negative sampling ensures more equitable data for the model training. Thus, negative sampling is a key technique in addressing performance bias in models dealing with real-world data.
2.处理阶级不平衡。真实世界的数据集通常在正样本和负样本之间表现出显著的不平衡,从而导致性能偏差。负采样通过管理更平衡的训练数据集来解决这个问题,选择负样本的代表性子集。这种方法可以防止模型偏向多数类,从而提高模型准确预测不太频繁的类的能力。在一类应用场景中,只针对阳性样本,而不存在显式的负反馈,往往不加区别地将所有非阳性样本视为阴性。通过明智地选择负数,负采样确保了模型训练的更公平的数据。因此,负采样是解决处理真实世界数据的模型中的性能偏差的关键技术。 - 3.
Improved Model Performance. At the heart of negative sampling’s value is its ability to improve model performance through the selection of negatives, particularly those that are closely aligned with positive samples in embedding space. By focusing on a subset of more informative negative samples, the model can better capture the subtle distinctions between different samples. As highlighted in studies [53, 6, 54, 14, 13], negative samples, particularly more informative or “hard” negatives, contribute significantly to the gradients during the training process. Hard negatives refer to samples that closely resemble positive samples in feature space, making the difficult for the model to distinguish from the positives. Training with these hard negative samples forces the model to learn finer distinctions because these negatives contribute more significantly to the gradients, leading to a more effective optimization process and improvement in the model’s ability to distinguish between positive and negative samples. Thus, negative sampling, especially when it involves the strategic use of hard negatives, is an essential technique for improving the performance of machine learning models.
3.改进模型性能。负采样的核心价值在于它能够通过选择负样本来提高模型性能,特别是那些在嵌入空间中与正样本紧密对齐的样本。通过关注信息量更大的阴性样本子集,该模型可以更好地捕捉不同样本之间的细微差别。正如研究[53,6,54,14,13]中所强调的那样,阴性样本,特别是信息量更大或“硬”阴性样本,在训练过程中对梯度有显著贡献。硬负样本是指在特征空间中与正样本非常相似的样本,这使得模型很难与正样本区分开来。 使用这些硬负样本进行训练会迫使模型学习更精细的区分,因为这些负样本对梯度的贡献更大,从而实现更有效的优化过程,并提高模型区分正样本和负样本的能力。因此,负采样,特别是当它涉及到硬否定的策略使用时,是提高机器学习模型性能的一项重要技术。
3Algorithms 3算法
TABLE II:An overview of negative sampling methods collected from various domains. For acronyms used, “S” represents Static NS; “H” refers to Hard NS; “G” means GAN-based NS; “A” means Auxiliary-based NS; “B” represents In-batch NS.
表二:从各个领域收集的阴性抽样方法概述。对于所使用的首字母缩略词,“S”表示静态NS;“H”表示硬NS;“G”表示基于GAN的NS;“A”表示基于汇编的NS;“B”表示批内NS。
Cate | Subcategory | Model Recipe | Candidates |
S | PNS | Word2Vec [1](���), Deepwalk [3](���) Word2Vec [ 1] (���) ,Deepwalk [ 3] (���) | Global |
LINE [4](���),Node2vec [5](���) LINE [ 4] (���) ,Node2vec [ 5] (���) | |||
RNS | BPR [2](��), LightGCN [32](��) BPR [ 2] (��) 、LightGCN [ 32] (��) | ||
TransE [20](���), DISTMULT [55](���),RUBER [56](���),USR [57](���) TransE [20] (���) ,DISTMULT [55] (���) ,RUBER [56] (���) ,USR [57] (���) ,100美元 | |||
H | DNS | FaceNet [13](��), Max Sampling [37](���), PinSage [39](��), [41], [58], [59], [60] [ 13] (��) ,Max Sampling [ 37] (���) ,PinSage [ 39] (��) ,[ 58],[ 59],[ 60] | Global |
SGA [36](���),DNS [19],AOBPR [61](��),BootEA [38],Dual-AMN [62](���) SGA [ 36] (���) 、DNS [ 19]、AOBPR [ 61] (��) 、BootEA [ 38]、Dual-AMN [ 62] (���) | Local | ||
NSCaching [63](���),ESimCSE [12](���),MoCoSE [64](���),MocoRing [18](��) NSCaching [ 63] (���) 、ESimCSE [ 12] (���) 、MoCoSE [ 64] (���) 、MocoRing [ 18] (��) | Memory-based | ||
Mix-NS | MoChi [16](��) (美)[16]#03 | Cache | |
MixGCF [7](��), MixKG [65](���) MixGCF [7] (��) ,MixKKG [65] (���) ,MixGCF [7] | Global | ||
G | Mining | IRGAN [43](��),SeqGAN [66](���),ACE [67](���) [2019—11—16 16:01:00]·武汉市2019—11—16 00:00:00 | Global |
KBGAN [9](���),IGAN [68](���), NMRN [45](��) KBGAN [9] (���) ,IGAN [68] (���) ,NMRN [45] (��) , | |||
GraphGAN [44](���) ,ProGAN [69](���) GraphGAN [ 44] (���) 、ProGAN [ 69] (���) | |||
Generation | CFGAN [46](��),AdvIR [47](��),HeGAN [70](���),SAN [71](���) [2019—11—16 17:01:01]·广东省2019年中考英语作文范文 | ||
NDA-GAN [72],AdCo [50], CLAE [73],NEGCUT [74],DAML [75](��) 72、子曰:“君子之道,焉可诬也?有始有卒者,其惟圣人乎!”,13.14冉子退朝。 | |||
A | Graph | MCNS [6](���),SANS [76](���) MCNS [6] (���) ,SANS [7] (���) | Hop |
GNEG [11](���),SamWalker [77],KGPolicy [40], DSKReG [78], MixGCF [7](��) [2019—11—11 17:01:01]·管家婆中的“三”字,第一个是“七”字,第二个是“七”字。 | Global | ||
Extra | SBPR [79](��),PRFMC [80](��),MF-BPR [81](��), View-aware NS [82](��) [2019—11—25 16:01:00]·管家婆2019年12月28日 | Local | |
ReinforcedNS [83],RecNS [84] [83] 84.第83章[84] 83 | |||
Cache | Unsupervised Feature Learning [85](��),NSCaching [63](���),SRNS [86](��) 无监督特征学习[85] (��) ,NSCaching [63] (���) ,SRNS [86] (��) | Cache | |
MoCo [49], MoCo-V2 [87],MoCoRing [18](��),GCC [88](���),ESimCSE [12],MoCoSE [64](���) MoCo [ 49]、MoCo-V2 [ 87]、MoCoRing [ 18] (��) 、GCC [ 88] (���) 、ESimCSE [ 12]、MoCoSE [ 64] (���) | |||
B | Basic | N.S. [89],�3-Rec [90],SGL [91],MHCN [92],DHCN [93] (��), SimCLR [48](��) N. S。[2019—11—16 16:01:00]·管家婆中特玄机图下载 | Mini-batch |
MVGRL [94],GRACE [95],GraphCL [96](���) ,SimCSE [97],InfoWord [98] (���) MVGRL [94],GRACE [95],GraphCL [96] (���) ,SimCSE [97],InfoWord [98] (���) | |||
Debiased | DCL [15](��) ,GDCL [99](���) DCL [ 15] (��) 、GDCL [ 99] (���) | ||
Hard | MoCoRing [18](��) ,CuCo [100],ProGCL [101](���) ,VaSCL [102] MoCoRing [ 18] (��) 、CuCo [ 100]、ProGCL [ 101] (���) 、VaSCL [ 102], | ||
SNCSE [103] (���), BatchSampler [52] (���,���,��) SNCSE [ 103] (���) ,BatchSampler [ 52] (���,���,��) |
In this section, we first summarize five categories of selection methods to form negative sample candidates (i.e., where to sample from?). Next, we summarize a variety of negative sampling algorithms into five categories (i.e., how to sample?).
在本节中,我们首先总结了五类选择方法,以形成负样本候选人(即,从哪里取样?)。接下来,我们将各种负采样算法归纳为五类(即,如何取样?)。
3.1Negative Sample Candidates Construction
3.1阴性样本候选构建
The process of constructing negative sample candidates is a critical first step in the negative sampling pipeline, determining the pool 𝒞 from which negatives are drawn for model training. Here, we answer the first question where should we sample negative examples from? In terms of the composition of the negative sample candidates, we summarize their selection methods into the following five categories.
构建阴性样本候选的过程是阴性采样管道中关键的第一步,确定从中提取阴性样本用于模型训练的池 𝒞 。在这里,我们回答第一个问题,我们应该从哪里抽样负面的例子?就负样本候选人的构成而言,我们将其选择方法归纳为以下五类。
- •
Global Selection is one of the most common methods for negative sample candidate selection, where the pool of negative samples is composed of all possible negatives from the entire dataset. It ensures diversity in the negative samples but may include less relevant negatives, which could impact the learning efficiency. For example, word2vec [1] utilizes the whole vocabulary as the pool of possible negative samples; LightGCN [32] leverages all unobserved items as the pool.
全局选择是阴性样本候选选择的最常见方法之一,其中阴性样本池由整个数据集中所有可能的阴性样本组成。它确保了阴性样本的多样性,但可能包括不太相关的阴性样本,这可能会影响学习效率。例如,word 2 vec [ 1]利用整个词汇表作为可能的负样本池; LightGCN [ 32]利用所有未观察到的项目作为池。 - •
Local Selection focuses on sampling a specific subset of the total available negatives as the pool of negative samples. This method is more selective compared to Global Selection, aiming to construct a pool that is more relevant or challenging for a specific query or anchor. For example, ANCE [104] selects top-k negative samples based on the query as the pool of negative samples. Besides, the specific subset can reduce the computational complexity that would be involved in handling the entire set of available negatives. For example, DNS [19] randomly samples a subset as the pool of negatives for probability calculations.
·局部选择侧重于对总可用阴性样本的特定子集进行采样,作为阴性样本池。与全局选择相比,该方法更具选择性,旨在构建与特定查询或锚更相关或更具挑战性的池。例如,ANCE [ 104]基于查询选择前k个负样本作为负样本池。此外,特定子集可以降低处理整个可用底片集所涉及的计算复杂度。例如,DNS [ 19]随机采样一个子集作为概率计算的否定池。 - •
Mini-batch Selection uses other samples in the current mini-batch as the pool without the additional process of choosing. It leverages already-loaded batch data, making it computationally efficient. For example, SimCLR [48] and SimCSE [97] leverage other samples in the same batch as the pool of negatives.
·小批量选择使用当前小批量中的其他样本作为池,而无需额外的选择过程。它利用已经加载的批处理数据,使其在计算上高效。例如,Simplified [ 48]和SimCSE [ 97]利用同一批次中的其他样本作为阴性样本池。 - •
Hop Selection is a novel selection method for graph-structured data, which selects �-hop neighbors as negative sample candidates. This method effectively takes advantage of the graph structure where the information propagation mechanism provides theoretical support. However, matrix operation for obtaining �-hop neighbors is impractical for web-scale datasets. Therefore, a path obtained from a random walk or DFS is often considered as a negative sample candidate. For example, RecNS [84] selects the intermediate region (i.e., k-hop neighbors) as the pool of negatives for graph-based recommendation; SANS [76] utilizes k-hop neighbors as the pool for knowledge graph embedding.
·跳选择是一种新颖的图结构数据选择方法,它选择 � 跳邻居作为负样本候选。该方法有效地利用了图结构的信息传播机制提供了理论支持。然而,矩阵运算获得 � 跳邻居是不切实际的网络规模的数据集。因此,从随机游走或DFS获得的路径通常被认为是负样本候选。例如,RecNS [ 84]选择中间区域(即,k-hop邻居)作为基于图的推荐的否定池; SANS [ 76]利用k-hop邻居作为知识图嵌入的池。 - •
Memory-based Selection maintains a memory bank or a cache as a pool to store the pool of negative sample candidates. This bank retains a large number of negatives from past iterations or batches, which are used for subsequent training steps. The memory bank is typically updated continuously, with new negatives being added and the oldest ones being removed, ensuring a fresh and diverse set of negative samples. Memory-based selection is particularly advantageous for models that benefit from contrasting a wide array of negatives against positives, such as in contrastive learning scenarios. MoCo [49] uses a queue bank to store the pool of negative samples.
·基于存储器的选择将存储器组或高速缓存维持为池以存储负样本候选者的池。该库保留了来自过去迭代或批次的大量否定,用于后续训练步骤。存储库通常会不断更新,添加新的底片并删除最旧的底片,确保底片样本的新鲜和多样化。基于记忆的选择对于那些受益于将大量消极因素与积极因素进行对比的模型特别有利,例如在对比学习场景中。MoCo [ 49]使用队列库来存储阴性样本池。
3.2Negative Sampling Algorithms
3.2负采样算法
Once the negative candidates are constructed, the next step involves negative sampling algorithms, which aim to design a negative distribution (i.e., a specific probability distribution) or a sophisticated negative sampling strategy to sample negative samples from a constructed pool of candidates for model training.
一旦构建了负候选,下一步涉及负采样算法,其目的是设计负分布(即,特定的概率分布)或复杂的负采样策略,以从用于模型训练的构造的候选池中采样负样本。
The design of negative sampling algorithms can be simple, such as random sampling, or it can be more sophisticated, taking into account factors like the current state of the model, the difficulty level of the negatives, or their frequency of occurrence in the dataset. Here, we briefly present an overview of negative sampling algorithms. Table II summarizes existing negative sampling algorithms. The general categorization of negative sampling algorithms and the abbreviated description of each category can be illustrated as follows:
负采样算法的设计可以很简单,例如随机采样,也可以更复杂,考虑到模型的当前状态,负的难度水平或它们在数据集中的出现频率等因素。在这里,我们简要介绍了负采样算法的概述。表二总结了现有的负采样算法。负采样算法的一般分类和每个类别的简要描述可以说明如下: