【论文速递】2025年06周（Robotics/Embodied AI/LLM）

SMOLLM2：当Smol变得大 - 以数据为中心的小语言模型
- 英文摘要
- 中文摘要
OmniHuman-1：重新考虑一阶段的人类动画模型的扩展
- 英文摘要
- 中文摘要
S1：简单的测试时间缩放
- 英文摘要
- 中文摘要
直接对齐算法间的差异日渐模糊
- 英文摘要
- 中文摘要
VideoJAM：在视频模型中增强运动生成的联合外观运动表示
- 英文摘要
- 中文摘要
LIMO：少即是多，有助于推理
- 英文摘要
- 中文摘要
分析功能流以增强语言模型中的解释和转向
- 英文摘要
- 中文摘要
通过隐式奖励进行过程加强
- 英文摘要
- 中文摘要
在LLMS中揭开长期思考推理的神秘面纱
- 英文摘要
- 中文摘要
AlphaGeometry2 在奥赛几何问题中的金牌级表现
- 英文摘要
- 中文摘要
Preference Leakage：LLM-AS-A-Gudge中的污染问题
- 英文摘要
- 中文摘要
AlignVLM：桥接视觉和语言潜在空间，用于多模式理解
- 英文摘要
- 中文摘要
奖励指导的投机解码，以实现有效的LLM推理
- 英文摘要
- 中文摘要
TwinMarket：金融市场的可扩展行为和社交模拟
- 英文摘要
- 中文摘要
概念：扩散Transformers学习高度可解释的特征
- 英文摘要
- 中文摘要
MatAnyone：基于一致记忆传播的稳定视频抠图
- 英文摘要
- 中文摘要
大模型的思维趋同，这对 AI 监管构成了挑战
- 英文摘要
- 中文摘要
OLA：以渐进式对齐方式推动Omni-Modal语言模型的前沿
- 英文摘要
- 中文摘要
DynVFX：增强具有动态内容的真实视频
- 英文摘要
- 中文摘要
SafeRAG：基准测试安全性的大型语言模型
- 英文摘要
- 中文摘要
ACECODER：Acing编码器RL通过自动测试案例合成
- 英文摘要
- 中文摘要
反向桥接匹配蒸馏
- 英文摘要
- 中文摘要
Llasa：基于Llama的语音合成的缩放火车时间和推理时间计算
- 英文摘要
- 中文摘要
SliderSpace：分解扩散模型的视觉功能
- 英文摘要
- 中文摘要
自监督量化表示，用于无缝集成知识图谱与大型语言模型
- 英文摘要
- 中文摘要
BOLT：无需蒸馏即可引导语言模型实现长链推理
- 英文摘要
- 中文摘要
在语言模型中缩放嵌入层
- 英文摘要
- 中文摘要
DeepRAG：考虑大型语言模型的逐步检索
- 英文摘要
- 中文摘要
MM-IQ：在多模型中基准类似人类的抽象和推理
- 英文摘要
- 中文摘要

SMOLLM2：当Smol变得大 - 以数据为中心的小语言模型

标题: SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model
作者: Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf
日期: 2025-02-04
论文链接: https://arxiv.org/pdf/2502.02737

英文摘要

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art “small” (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

中文摘要

尽管大型语言模型在人工智能的许多应用中都促进了突破性，但它们的固有宽容使它们在计算上昂贵且充满挑战，可以在资源受限的设置中部署。在本文中，我们记录了SmollM2的开发，SmollM2是一种最先进的“小”（17亿个参数）语言模型（LM）。为了实现强大的性能，我们使用多阶段训练过程将〜11万亿个数据的SmollM2推出了大约11万亿代币，该过程将Web文本与专门的数学，代码和跟随数据的数据混合在一起。我们还在阶段引入了新的专业数据集（Finemath，stack-edu和smoltalk），在那里我们发现现有的数据集在有问题的小或低质量上。为了告知我们的设计决策，我们同时执行小规模消融和手动完善过程，该过程会根据上一阶段的性能在每个阶段更新数据集混合率。最终，我们证明了SmollM2优于其他最近的小型LM，包括QWEN2.5-1.5B和LLAMA3.2-1B。为了促进对LM开发以及小型LM的应用的未来研究，我们均释放SmollM2以及我们在本项目过程中准备的所有数据集。

OmniHuman-1：重新考虑一阶段的人类动画模型的扩展

标题: OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
作者: Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01061

英文摘要

End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)

中文摘要

端到端的人类动画，例如音频驱动的人类一代，在最近几年中取得了显着的进步。但是，现有的方法仍然很难扩大作为大型一般视频生成模型的扩展，从而限制了它们在实际应用中的潜力。在本文中，我们提出了Omnihuman，这是一种基于扩散Transformers 的框架，可通过将与运动相关条件混合到训练阶段来扩展数据。为此，我们介绍了这些混合条件的两种培训原则，以及相应的模型体系结构和推理策略。这些设计使Omnihuman能够充分利用数据驱动的运动产生，最终实现了高度现实的人类视频生成。更重要的是，Omnihuman支持各种肖像内容（面部特写，肖像，半身，全身），支持说话和唱歌，处理人类对象的相互作用和具有挑战性的身体姿势，并适应不同的图像样式。与现有的端到端音频驱动方法相比，Omnihuman不仅会产生更现实的视频，而且还提供了更大的输入灵活性。它还支持多种驾驶方式（音频驱动，视频驱动和组合的驾驶信号）。视频样本在TTFamily项目页面（https://omnihuman-lab.github.io）上提供

S1：简单的测试时间缩放

标题: s1: Simple test-time scaling
作者: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
日期: 2025-01-31
论文链接: https://arxiv.org/pdf/2501.19393

英文摘要

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

中文摘要

测试时间缩放是一种有希望的语言建模方法，它使用额外的测试时间计算来提高性能。最近，OpenAI的O1模型显示了这种能力，但没有公开共享其方法，从而导致了许多复制工作。我们寻求最简单的方法来实现测试时间缩放和强大的推理性能。首先，我们策划了一个小型数据集S1K，其中1000个问题与推理轨迹搭配，依赖于我们通过消融验证的三个标准：难度，多样性和质量。其次，我们通过有力终止模型的思维过程或通过将“等待”加密到模型来结束时，通过强制终止模型的思维过程来控制测试时间计算，以控制测试时间计算。这可能会导致该模型对其答案进行仔细检查，从而解决不正确的推理步骤。在监督了S1K上的QWEN2.5-32B-INSTRUCT语言模型并为其配备预算强迫之后，我们的Model S1超过了竞争数学问题的O1-preview高达27％（Math和Aime24）。此外，通过预算强迫缩放S1可以超越其性能外推，而无需测试时间干预：AIME24的50％至57％。我们的模型，数据和代码是https://github.com/simplescaling/s1的开源。

直接对齐算法间的差异日渐模糊

标题: The Differences Between Direct Alignment Algorithms are a Blur
作者: Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01237

英文摘要

Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the beta parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.

中文摘要

直接对齐算法（DAAS）通过替换强化学习（RL）和奖励建模（RM）来简化语言模型对齐，并通过直接的策略优化从人类反馈（RLHF）进行强化学习。DAA可以通过其排名损失（成对与点为角度）进行分类，这些损失中使用的奖励（例如，策略和参考策略的可能性比率或参考策略或优势比），或者是通过监督的微调（SFT）阶段（两阶段与一阶段）。我们首先表明，一阶段的方法表现不足两阶段方法。为了解决这个问题，我们结合了一个显式的SFT阶段，并将Beta参数（控制偏好优化的强度）引入单级ORPO和ASFT。这些修改通过+3.46（ORPO）和+8.27（ASFT）提高了它们在羊驼毛评估2的性能，与DPO等两阶段方法匹配。进一步的分析表明，关键因素是该方法是否使用成对或指尖目标，而不是特定的隐式奖励或损失函数。这些结果强调了仔细评估的重要性，以避免对性能提升过早主张或对齐算法的总体优势。

VideoJAM：在视频模型中增强运动生成的联合外观运动表示

标题: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
作者: Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin
日期: 2025-02-04
论文链接: https://arxiv.org/pdf/2502.02492

英文摘要

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model’s own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

中文摘要

尽管最近取得了巨大进展，但生成的视频模型仍然很难捕获现实世界的运动，动态和物理。我们表明，这种局限性是由常规像素重建物镜产生的，该物镜以牺牲运动相干性为代价偏向于外观保真度。为了解决这个问题，我们介绍了Videojam，这是一个新颖的框架，通过鼓励模型学习联合外观运动表示，该框架在视频发电机之前灌输了有效的动作。Videojam由两个互补单元组成。在训练过程中，我们扩展了目标，以预测产生的像素及其相应的运动。在推断期间，我们引入了内部指导，该机制通过利用模型自己不断发展的运动预测作为动态引导信号来引导生成朝着连贯运动。值得注意的是，我们的框架可以应用于具有最小适应的任何视频模型，不需要修改训练数据或模型缩放。Videojam在运动连贯性方面达到了最先进的性能，超过了竞争性高度的专有模型，同时还提高了世代相传的视觉质量。这些发现强调，外观和运动可以是互补的，并且在有效整合时，可以增强视频生成的视觉质量和连贯性。项目网站：https：//hila-chefer.github.io/videojam-paper.github.io/

LIMO：少即是多，有助于推理

标题: LIMO: Less is More for Reasoning
作者: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
日期: 2025-02-05
论文链接: https://arxiv.org/pdf/2502.03387

英文摘要

We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models’ 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model’s encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as “cognitive templates” that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.

中文摘要

我们提出了一个基本发现，挑战了我们对大语言模型中复杂推理如何出现的理解。虽然传统的观点表明，复杂的推理任务需要广泛的培训数据（> 100,000个示例），但我们证明，复杂的数学推理能力可以有效地引起，但令人惊讶的例子很少。通过全面的实验，我们提出的模型豪华轿车在数学推理中表现出了前所未有的表现。豪华轿车仅仅817个策划的训练样本，在AIME上获得了57.1％的精度和数学的94.8％，从以前的基于SFT的模型分别提高了6.5％和59.2％，而仅使用以前方法所需的1％的培训数据。豪华轿车表现出非凡的分布概括，在10种不同的基准中实现了40.5％的绝对改进，优于对100倍培训的模型多于100倍的数据，这挑战了SFT导致记忆而不是概括的观念。基于这些结果，我们提出了较少的推理假设（豪华假说）：在基础模型中，在培训预训练期间已经对领域知识进行了全面编码，可以通过最低限制但精确的认知过程的管弦乐表现出来。该假设表明，复杂推理的启发阈值取决于两个关键因素：（1）模型在培训期间模型编码的知识基础的完整性，以及（2）训练后示例作为“认知模板”的有效性，这些示例是“认知模板”，这些示例显示了模型如何利用其知识基础来解决复杂的理解工作。为了促进数据有效推理的可重复性和未来的研究，我们在https://github.com/gair-nlp/limo上发布了豪华轿车作为全面的露天套件。

分析功能流以增强语言模型中的解释和转向

标题: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
作者: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
日期: 2025-02-05
论文链接: https://arxiv.org/pdf/2502.03032

英文摘要

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

中文摘要

我们介绍了一种新方法，以系统地绘制由稀疏自动编码器在连续大型语言模型层发现的系统映射功能，从而扩展了检查层间功能链接的早期工作。通过使用无数据的余弦相似性技术，我们可以追踪特定特征在每个阶段如何持续，转换或首次出现。该方法得出特征演化的粒状流程图，使精细元素的可解释性和机械洞察力对模型计算。至关重要的是，我们通过放大或抑制所选特征，在文本生成中实现有针对性的主题控制，来说明这些跨层特征图如何通过放大或抑制所选特征来促进模型行为的直接转向。我们的发现共同介绍了因果，跨层的解释性框架的实用性，该框架不仅阐明了特征如何通过向前传递而发展，而且还为大型语言模型的透明操作提供了新的手段。

通过隐式奖励进行过程加强

标题: Process Reinforcement through Implicit Rewards
作者: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01456

英文摘要

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME’s effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

中文摘要

事实证明，密集的过程奖励是在大型语言模型（LLMS）的推理时间缩放中稀疏结果级奖励的更有效替代方案，尤其是在需要复杂的多步推理的任务中。尽管密集的奖励还为LLM的强化学习（RL）提供了一个吸引人的选择，因为它们的细粒度奖励有可能解决一些固有的结果奖励问题，例如培训效率和信贷分配，但这种潜力在很大程度上仍未实现。这主要归因于在线培训过程奖励模型（PRMS）的挑战，在线收集高质量的流程标签非常昂贵，使其特别容易受到奖励黑客的影响。为了应对这些挑战，我们提出了PRIME（通过隐式奖励加强过程加强），该挑战仅通过隐含过程奖励，可以使用策略推出和结果标签进行在线PRM更新。Prime与各种优势功能很好地结合在一起，并放弃了现有方法所需的专用奖励模型培训短语，从而大大降低了开发开销。我们展示了Prime对竞争数学和编码的有效性。从QWEN2.5-MATH-7B基础开始，Prime在SFT模型上的几个关键推理基准中的平均值提高了15.1％。值得注意的是，我们由此产生的模型Eurus-2-7B-Prime在七个推理基准上超过了QWEN2.5-MATH-7B-INSTRUCT，其培训数据的10％。

在LLMS中揭开长期思考推理的神秘面纱

标题: Demystifying Long Chain-of-Thought Reasoning in LLMs
作者: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue
日期: 2025-02-05
论文链接: https://arxiv.org/pdf/2502.03373

英文摘要

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

中文摘要

缩放推理计算在大语言模型（LLMS）中增强推理，并具有长长的链条（COTS）启用诸如回溯和误差校正之类的策略。增强学习（RL）已成为开发这些能力的关键方法，但长期COTS仍然不清楚的条件，RL培训需要仔细的设计选择。在这项研究中，我们系统地研究了长COT推理的机制，从而确定了使模型能够生成长COT轨迹的关键因素。通过广泛的监督微调（SFT）和RL实验，我们提出了四个主要发现：（1）虽然不是严格必要的SFT，但它简化了培训并提高了效率；（2）推理能力往往会随着训练计算的增加而出现，但是不能保证它们的发展，这对于稳定COT长度增长至关重要；（3）缩放可验证的奖励信号对于RL至关重要。我们发现，通过过滤机制利用嘈杂的Web提取的解决方案显示出强大的潜力，尤其是对于诸如STEM推理之类的分布（OOD）任务；（4）基本模型中固有地存在诸如误差校正之类的核心能力，但是通过RL有效地激励这些技能来激励这些技能，需要大量的计算，并且测量其出现需要细微的方法。这些见解为优化培训策略提供了实用的指导，以增强LLMS中的长期COT推理。我们的代码可在以下网址找到：https：//github.com/eddycmu/demystify-long-cot。

AlphaGeometry2 在奥赛几何问题中的金牌级表现

标题: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
作者: Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V. Le, Thang Luong
日期: 2025-02-05
论文链接: https://arxiv.org/pdf/2502.03544

英文摘要

We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for all geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.

中文摘要

我们介绍了AlphageMetry2，这是Trinh等人中引入的字母计量法的显着改进的版本。（2024），现在已经超过了解决奥林匹克几何问题的平均金牌得主。为了实现这一目标，我们首先将原始的字母计量学语言扩展到解决涉及对象运动的更严重问题，以及包含角度，比率和距离的线性方程的问题。这与其他增加一起显着提高了国际数学奥林匹克运动会（IMO）2000-2024几何问题的覆盖率从66％到88％。通过使用Gemini架构来更好地建模，并且通过使用Gemini架构以及一种结合了多个搜索树的新型知识共享机制，可以极大地改善Alphageometry2的搜索过程。加上对符号发动机和合成数据生成的进一步增强，在过去25年中，所有几何问题的总体求解率显着提高到了84％，而先前的54％。Alphageometry2也是在IMO 2024 https://dpmd.ai/imo-silver上实现银色标准的系统的一部分。最后但并非最不重要的一点是，我们报告了使用Alphageometry2作为完全自动化系统的一部分，该系统直接从自然语言输入中可靠地解决了几何问题。

Preference Leakage：LLM-AS-A-Gudge中的污染问题

标题: Preference Leakage: A Contamination Problem in LLM-as-a-judge
作者: Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01534

英文摘要

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.

中文摘要

大型语言模型（LLM）作为法官和基于LLM的数据综合，已成为模型开发中的两种基本LLM驱动的数据注释方法。尽管它们的组合显着提高了模型培训和评估的效率，但对这种新型模型开发范式带来的潜在污染的关注很少。在这项工作中，我们暴露了偏好泄漏，这是由合成数据生成器与基于LLM的评估器之间的相关性引起的LLM-AS-A-A-A-AUDGE中的污染问题。为了研究此问题，我们首先定义了数据生成器LLM和LLM法官：是同一模型，具有继承关系并属于同一模型家族之间的三个共同相关性。通过广泛的实验，我们从经验上证实了法官对其相关学生模型的偏见，这是由于多个LLM基准和基准的偏好泄漏引起的。进一步的分析表明，与LLM-AS-A-A-A-a-a-Gudge场景中先前确定的偏见相比，偏好泄漏是一个普遍存在的问题。所有这些发现表明，在LLM-AS-A-Audge领域，偏好泄漏是一个普遍且具有挑战性的问题。我们在以下网址发布所有代码和数据：https：//github.com/david-li0406/preference-leakage。

AlignVLM：桥接视觉和语言潜在空间，用于多模式理解

标题: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
作者: Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01341

英文摘要

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.

中文摘要

将视觉特征与语言嵌入对齐是视觉模型（VLMS）的关键挑战。此类模型的性能取决于拥有一个良好的连接器，该连接器将视觉编码器生成的视觉特征映射到使用LLM的共享嵌入空间，同时保持语义相似性。现有的连接器（例如多层感知器（MLP））通常会产生分布式或嘈杂的输入，从而导致模态之间的错位。在这项工作中，我们提出了一种新颖的视觉文本对准方法AlignVLM，该方法将视觉特征映射到LLM文本嵌入的加权平均值。我们的方法利用了LLM编码的语言先验，以确保将视觉特征映射到LLM可以有效解释的空间区域。AlignVLM对于文档理解任务特别有效，必须将扫描的文档图像准确地映射到其文本内容。我们的广泛实验表明，与先前的对齐方法相比，AlignVLM实现了最先进的性能。我们提供了进一步的分析，证明了视力文本的提高特征特征对齐和与噪声的鲁棒性。

奖励指导的投机解码，以实现有效的LLM推理

标题: Reward-Guided Speculative Decoding for Efficient LLM Reasoning
作者: Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
日期: 2025-01-31
论文链接: https://arxiv.org/pdf/2501.19324

英文摘要

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.

中文摘要

我们引入了奖励指导的投机解码（RSD），这是一个旨在提高大语模型（LLMS）推断效率的新型框架。RSD协同结合了一个轻巧的草稿模型与更强大的目标模型，并结合了受控偏见以优先考虑高回报输出，与现有的投机解码方法相反，该方法实施了严格的无偏见。RSD采用过程奖励模型来评估中间解码步骤，并动态决定是否调用目标模型，优化计算成本和输出质量之间的权衡。从理论上讲，我们证明了基于阈值的混合物策略在资源利用和性能之间取得了最佳平衡。对包括奥林匹克级任务在内的挑战性推理基准测试的广泛评估表明，RSD可在仅使用目标模型（较少的拖鞋少4.4倍）中提供显着的效率提升，同时比平均平均解码方法（最高为+3.5）实现明显的高准确性。这些结果强调了RSD是在资源密集型方案中部署LLM的强大且具有成本效益的方法。

TwinMarket：金融市场的可扩展行为和社交模拟

标题: TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets
作者: Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01506v2.pdf

英文摘要

The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.

中文摘要

对社会出现的研究长期以来一直是社会科学的核心重点。传统的建模方法，例如基于规则的代理模型（ABM），难以捕获人类行为的多样性和复杂性，尤其是行为经济学强调的非理性因素。最近，大型语言模型（LLM）代理人已获得吸引人的仿真工具，用于建模社会科学和角色扮演应用中的人类行为。研究表明，LLM可以解释认知偏见，情绪波动和其他非理性影响，从而使社会经济动态更现实。在这项工作中，我们介绍了Twinmarket，这是一个新型的多代理框架，该框架利用LLMS模拟社会经济系统。具体而言，我们通过相互作用和反馈机制来研究单个行为如何产生集体动态和新兴现象。通过模拟股票市场环境中的实验，我们演示了个人行动如何触发群体行为，从而导致新兴的结果，例如财务泡沫和衰退。我们的方法为个人决策与集体社会经济模式之间的复杂相互作用提供了宝贵的见解。

概念：扩散Transformers学习高度可解释的特征

标题: ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
作者: Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng Chau
日期: 2025-02-06
论文链接: https://arxiv.org/pdf/2502.04320

英文摘要

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

中文摘要

多模式扩散Transformers（DIT）的丰富表示表现出可增强其可解释性的独特特性吗？我们介绍了一种新颖的方法，它利用DIT注意层的表现力来产生高质量的显着性图，这些图精确地定位了图像中的文本概念。在不需要额外的培训的情况下，概念重新利用了DIT注意层的参数，以产生高度上下文化的概念嵌入，这有助于一个主要发现，即在DIT注意层的输出空间中执行线性预测与常用的交叉意见机制相比，在DIT注意力层的输出空间中产生明显的显着性图。值得注意的是，概念甚至可以在零拍图像分割基准上实现最先进的性能，在成像网段数据集和pascalvoc的单层子集上表现优于其他11种其他零摄像的可解释性方法。我们的工作有助于第一个证据表明，多模式DIT模型（如通量）的表示可以高度转移到诸如细分之类的视觉任务，甚至超过了剪辑（例如剪辑）的多模式基础模型。

MatAnyone：基于一致记忆传播的稳定视频抠图

标题: MatAnyone: Stable Video Matting with Consistent Memory Propagation
作者: Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy
日期: 2025-01-24
论文链接: https://arxiv.org/pdf/2501.14677
项目链接: https://pq-yang.github.io/projects/MatAnyone/

英文摘要

Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.

中文摘要

无辅助的人类视频效果方法仅依赖于输入框架，通常在复杂或模棱两可的背景中挣扎。为了解决这个问题，我们提出了Matanyone，这是一个针对目标分配的视频垫子量身定制的强大框架。具体而言，在基于内存的范式上，我们通过区域自适应内存融合引入了一个一致的内存传播模块，该模块可以自适应地整合了上一个帧中的内存。这样可以确保核心区域的语义稳定性，同时在对象边界沿着细节保留细颗粒的细节。对于强大的培训，我们提供了一个更大，高质量且多样化的数据集用于视频垫子。此外，我们结合了一种新颖的培训策略，该策略有效利用大规模分割数据，从而提高了均值稳定性。借助这种新的网络设计，数据集和培训策略，Matanyone提供了强大而准确的视频效果，从而导致不同的现实情况，表现优于现有方法。

大模型的思维趋同，这对 AI 监管构成了挑战

标题: Great Models Think Alike and this Undermines AI Oversight
作者: Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
日期: 2025-02-06
论文链接: https://arxiv.org/pdf/2502.04313

英文摘要

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as “AI Oversight”. We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from “weak-to-strong generalization”. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

中文摘要

随着语言模型（LM）功能的提高，对人类进行评估和监督越来越困难。希望其他语言模型可以自动化这两个任务，我们将其称为“ AI监督”。我们研究模型相似性如何通过基于模型错误重叠的LM相似性提出概率度量来影响AI监督的两个方面。使用此指标，我们首先表明LLM-AS-A-A-Gudge分数有利于与法官类似的模型，从而推广了最新的自我挑战结果。然后，我们研究了关于LM注释的培训，并找到弱主管和强大的学生模型之间的互补知识在“弱到较大的概括”中的收益中起着至关重要的作用。随着模型能力的增加，很难找到自己的错误，我们可能会更多地推荐给AI的监督。但是，我们观察到了一个有关趋势的信息 - 随着功能的增加，模型错误越来越相似，指出相关故障的风险。我们的工作强调了报告和纠正模型相似性的重要性，尤其是在AI监督的新兴范式中。

OLA：以渐进式对齐方式推动Omni-Modal语言模型的前沿

标题: Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
作者: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
日期: 2025-02-06
论文链接: https://arxiv.org/pdf/2502.04328

英文摘要

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

中文摘要

大型语言模型的最新进展，尤其是在GPT-4O之后，引发了人们对开发能够理解更多模式的Omni-Modal模型的兴趣。尽管已经出现了一些开源替代方案，但性能中的专用单模式模型仍然存在着一个显着的滞后。在本文中，我们提出了Ola，这是一种Omni-Modal语言模型，与专业对应物相比，在图像，视频和音频理解中实现了竞争性能。OLA的核心设计在于其渐进式形态对准策略，该策略逐渐扩展了语言模型的支持方式。我们的培训管道始于最不同的方式：图像和文本，然后使用连接语言和音频知识的语音数据以及连接所有模式的视频数据逐渐扩展模型的技能。渐进式学习管道还使我们能够保持相对较小的跨模式对齐数据，从而使现有视觉模型的Omni-Modal模型变得更加容易且成本较低。此外，要解锁GPT-4O等先进的互动体验，我们进一步设计了句子解码的解决方案，用于流式语音生成。广泛的实验表明，与具有类似尺寸的最新专业模型相比，OLA在所有模式中超过了现有的Omni-Modal LLM，同时实现了高度竞争性能。我们的目标是使Ola成为一个完全开放的Omni-Modal理解解决方案，以推进这个新兴领域的未来研究。型号的权重，代码和数据在https://github.com/ola-omni/ola上进行开源。

DynVFX：增强具有动态内容的真实视频

标题: DynVFX: Augmenting Real Videos with Dynamic Content
作者: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel
日期: 2025-02-05
论文链接: https://arxiv.org/pdf/2502.03621

英文摘要

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

中文摘要

我们提出了一种使用新生成的动态内容来增强现实世界视频的方法。给定一个输入视频和一个简单的用户提供的文本指令描述了所需的内容，我们的方法综合了动态对象或复杂的场景效果，这些效果自然会随着时间的推移与现有场景相互作用。新内容的位置，外观和运动被无缝集成到原始素材中，同时考虑相机运动，遮挡和与场景中其他动态对象的交互，从而产生了凝聚力和现实的输出视频。我们通过一个零射击，无训练的框架来实现这一目标，该框架利用预先训练的文本对视频扩散Transformers 合成新内容和预先训练的视觉语言模型，以详细设想增强场景。具体而言，我们引入了一种基于推理的新方法，该方法在注意机制中操纵特征，在保留原始场景的完整性的同时，可以准确地定位和无缝集成。我们的方法是完全自动化的，只需要简单的用户指令。我们展示了其在应用于现实世界视频的广泛编辑中的有效性，其中包括涉及相机和对象运动的各种对象和场景。

SafeRAG：基准测试安全性的大型语言模型

标题: SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
作者: Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang
日期: 2025-01-28
论文链接: https://arxiv.org/pdf/2501.18636

英文摘要

The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.

中文摘要

通过将外部知识纳入大型语言模型（LLMS），取回索引的索引 - 回程生成范式（RAG）在解决知识密集型任务方面非常成功。但是，外部和未经验证的知识的合并会增加LLM的脆弱性，因为攻击者可以通过操纵知识来执行攻击任务。在本文中，我们引入了一个名为Saferag的基准，旨在评估抹布安全性。首先，我们将攻击任务分类为银噪声，互字中的冲突，软广告和白色拒绝服务。接下来，我们主要针对每个任务构建了RAG安全评估数据集（即Saferag数据集）。然后，我们利用Saferag数据集模拟RAG可能遇到的各种攻击方案。在14个代表性的RAG组件上进行的实验表明，RAG对所有攻击任务表现出很大的脆弱性，即使是最明显的攻击任务也可以轻松绕过现有的检索器，过滤器或高级LLM，从而导致抹布服务质量的降解。代码可在以下网址找到：https：//github.com/iaar-hanghai/saferag。

ACECODER：Acing编码器RL通过自动测试案例合成

标题: ACECODER: Acing Coder RL via Automated Test-Case Synthesis
作者: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01718

英文摘要

Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25% and MBPP-plus by 6% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

中文摘要

最近的编码器模型中的大多数进展是由监督的微调（SFT）驱动的，而强化学习的潜力（RL）仍然在很大程度上没有开发，这主要是由于代码域中缺乏可靠的奖励数据/模型。在本文中，我们通过利用自动大规模测试案例合成来提高代码模型培训来应对这一挑战。具体而言，我们设计了一条管道，该管道从现有代码数据中生成广泛的（问题，测试案例）对。使用这些测试用例，我们根据与采样程序相比，以布拉德利（Bradley-Terry）损失培训奖励模型，构建偏好对。它显示了通过32个最佳采样的llama-3.1-8b-ins的平均10分改进，QWEN2.5-编码-7b-ins的5分提高，使7B型号与236B DeepSeek-V2.5相当。此外，我们通过奖励模型和测试案例通过奖励进行增强学习，从而导致人类Val，MBPP，BigCodebench和Livecodebench（V4）的一致改进。值得注意的是，我们遵循R1风格的训练，直接从QWEN2.5代码基本开始，并表明我们的RL训练可以将HOMANEVAL-PLUS上的模型提高25 \％以上，而MBPP Plus则只需6 \％即可获得80个优化步骤。我们相信我们的结果突出了编码器模型中加强学习的巨大潜力。

反向桥接匹配蒸馏

标题: Inverse Bridge Matching Distillation
作者: Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01362

英文摘要

Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.

中文摘要

学习扩散桥模型很容易；使它们快速而实用是一门艺术。扩散桥模型（DBM）是图像到图像翻译中应用的扩散模型的有希望的扩展。但是，像许多现代扩散和流程模型一样，DBM遭受了缓慢推断的问题。为了解决它，我们提出了一种基于反桥匹配公式的新型蒸馏技术，并得出了在实践中解决该技术的可拖动目标。与以前开发的DBM蒸馏技术不同，所提出的方法可以提炼DBM的条件和无条件类型，在一步生成器中提取模型，并仅使用损坏的图像进行训练。我们评估了在各种设置上的有条件和无条件类型的桥梁匹配的方法，包括超分辨率，JPEG恢复，素描到图像和其他任务，并表明我们的蒸馏技术使我们能够加快DBMS的推理，从4 x到100X，甚至比使用特定的设置相比，您可以从4 x到100 x，甚至提供更好的生成质量。

Llasa：基于Llama的语音合成的缩放火车时间和推理时间计算

标题: Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
作者: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue
日期: 2025-02-06
论文链接: https://arxiv.org/pdf/2502.04128

英文摘要

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

中文摘要

基于文本的大语言模型（LLM）的最新进展，尤其是在GPT系列和O1模型中，已经证明了扩展训练时间和推理时间计算的有效性。但是，利用LLM的当前最新TTS系统通常是多阶段，需要单独的模型（例如，LLM后的扩散模型），使是否在训练或测试过程中缩放特定模型的决定变得复杂。这项工作做出了以下贡献：首先，我们探讨了火车时间和推理时间计算的缩放语音合成。其次，我们提出了一个简单的框架LLASA用于语音合成，该框架采用单层矢量量化器（VQ）编解码器和单个Transformers 体系结构，以完全与标准LLMS（例如Llama）完全保持一致。我们的实验表明，LLASA的缩放列车时间计算始终改善综合语音的自然性，并能够产生更复杂，更准确的韵律模式。此外，从缩放推理时间计算的角度来看，我们在搜索过程中采用语音理解模型作为验证者，发现缩放推理时间计算会将采样模式转移到特定验证者的偏好上，从而提高情绪表达，音色的一致性和内容准确性。此外，我们发布了TTS模型（1B，3B，8B）和编解码器模型的检查点和培训代码。

SliderSpace：分解扩散模型的视觉功能

标题: SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
作者: Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01639
项目链接: https://sliderspace.baulab.info

英文摘要

We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model’s latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace’s effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model’s knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info

中文摘要

我们提出了SliderSpace，这是一种将扩散模型的视觉能力自动分解为可控制和人类可靠的方向的框架。与需要用户分别指定每个编辑方向的属性的现有控制方法不同，SliderSpace从单个文本提示符中同时发现了多个可解释和不同的方向。每个方向都被训练为低级适配器，从而实现了组成控制，并发现模型潜在空间中令人惊讶的可能性。通过对最新扩散模型进行的广泛实验，我们证明了SliderSpace在三个应用中的有效性：概念分解，艺术风格探索和增强多样性。我们的定量评估表明，滑动空间 - 发现的方向有效地分解了模型知识的视觉结构，从而洞悉了扩散模型中编码的潜在能力。用户研究进一步验证了我们的方法与基线相比会产生更多样化和有用的变化。我们的代码，数据和受过训练的权重可从https://sliderspace.baulab.info获得。

自监督量化表示，用于无缝集成知识图谱与大型语言模型

标题: Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
作者: Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
日期: 2025-01-30
论文链接: https://arxiv.org/pdf/2501.18119

英文摘要

Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.

中文摘要

由于知识图（kg）结构与自然语言之间存在自然差距，因此出现了KG的整体结构信息与大语言模型（LLMS）的有效整合是一个重要的问题。为此，我们提出了一个两阶段的框架，以学习和应用每个实体的量化代码，旨在将KG与LLMS无缝集成。首先，提出了一种自我监督的量化表示（SSQR）方法，以将KG结构知识和语义知识压缩为离合语言句子格式的离散代码（\ ie，tokens）。我们通过将这些学习的代码视为直接输入LLM的功能，从而进一步设计KG指导跟随数据，从而实现无缝集成。实验结果表明，SSQR的表现优于现有的无监督量化方法，从而产生更多可区分的代码。此外，微调的Llama2和Llama3.1在KG链接预测和三重分类任务上也具有出色的性能，在常规提示方法中仅利用16个令牌，而不是数千个代币。

BOLT：无需蒸馏即可引导语言模型实现长链推理

标题: BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation
作者: Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong
日期: 2025-02-06
论文链接: https://arxiv.org/pdf/2502.03860

英文摘要

Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM’s LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

中文摘要

大型语言模型（LLM），例如来自OpenAI的O1，已经表现出了出色的推理能力。在回答问题之前，O1产生了漫长的经营链（LongCot）。LongCot允许LLM有效地分析问题，设计计划，反思和回溯。这些动作使LLM能够解决复杂的问题。发行O1后，许多团队试图复制其长角和推理能力。在方法方面，它们主要依赖于具有长距离能力的现有模型的数据（例如OpenAI-O1，Qwen-QWQ，DeepSeek-R1-Preview），从而系统地发展了这种推理能力。就数据域而言，这些作品狭义地关注数学，而其他一些作品包括编码，限制了它们的普遍性。本文介绍了一种新颖的方法，可以使LLM的Longcot容量无需从O1型模型或昂贵的人类注释进行蒸馏而来，我们从标准指示模型中引导Longcot（Bolt）。螺栓涉及三个阶段：1）在标准指示模型上使用封闭式学习的长角数据自举；2）长期有监督的登录；3）在线培训，以进一步完善长角能力。在螺栓中，只需要在引导阶段构建少数几个示例。在我们的实验中，我们创建了10个示例，证明了这种方法的可行性。我们使用Llama-3.1-70B教学来引导长科，并将我们的方法应用于各种型号（7b，8b，70b）。我们在各种基准测试中取得了令人印象深刻的性能，竞技场 - 甲板，山基地，野人宽松，Zebralogic，Math500，它们评估了各种任务解决和推理能力。

在语言模型中缩放嵌入层

标题: Scaling Embedding Layers in Language Models
作者: Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01637

英文摘要

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

中文摘要

我们提出了Scone（可扩展，上下文化，卸载，n-gram嵌入），这是一种扩展输入嵌入层以增强语言模型性能作为层尺寸量表的方法。为了避免增加解码成本，Scone保留了原始词汇，同时为一组频繁的N-gram引入嵌入。这些嵌入为每个输入令牌提供了上下文化表示，并在训练过程中以单独的模型学习。在推断期间，它们被预先计算并存储在离心剂内存中，对推理速度的影响很小。Scone启用了两种新的缩放策略：增加缓存的N-Gram嵌入式数量，并扩展用于学习它们的模型，同时保持固定的推理时间拖失lop。我们表明，缩放这两个方面都可以使Scone在不同的语料库中胜过1.9B参数基线，而仅使用一半的推理时间拖鞋。

DeepRAG：考虑大型语言模型的逐步检索

标题: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
作者: Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.01142

英文摘要

Large Language Models (LLMs) have shown remarkable potential in reasoning while they still suffer from severe factual hallucinations due to timeliness, accuracy, and coverage of parametric knowledge. Meanwhile, integrating reasoning with retrieval-augmented generation (RAG) remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling strategic and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.

中文摘要

大型语言模型（LLM）在推理方面表现出巨大的潜力，而由于及时性，准确性和参数知识的覆盖范围，它们仍然遭受严重的事实幻觉。同时，由于无效的任务分解和冗余检索，将推理与检索型发电（RAG）集成在一起仍然具有挑战性，这会引入噪声和降低响应质量。在本文中，我们提出了DeepRag，该框架将检索效果的推理模拟为马尔可夫决策过程（MDP），从而实现了战略和适应性检索。通过迭代分解查询，DeepRag动态地确定是在每个步骤中检索外部知识还是依赖参数推理。实验表明，DeepRag提高了检索效率，同时将答案准确性提高了21.99％，这表明了其在优化检索效果的推理方面的有效性。

MM-IQ：在多模型中基准类似人类的抽象和推理

标题: MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
作者: Huanqia Cai, Yijun Yang, Winston Hu
日期: 2025-02-02
论文链接: https://arxiv.org/pdf/2502.00698

英文摘要

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.

中文摘要

智商测试已成为评估人类认知能力的基础方法，故意将评估与语言背景，语言水平或特定领域的知识分离，以隔离抽象和推理方面的核心能力。然而，目前，人工智能研究缺乏系统的基准来量化多模式系统中这些关键认知维度。为了解决这个关键的差距，我们提出了MM-IQ，这是一个全面的评估框架，其中包括2,710个精心策划的测试项目，涵盖了8种不同的推理范式。通过对领先的开源和专有多模式的系统评估，我们的基准揭示了惊人的局限性：即使是最先进的架构也只能达到比随机机会的略有优势（27.49％和25％的基线精度）。这种实质性的鸿沟强调了当前多模式系统在近似基本的人类推理能力方面的不足，强调了对范式转移进步以弥合这种认知鸿沟的需求。