[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: SymbolicAI: A framework for logic-based approaches combining generative models and solvers

作者: Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00854v1

GitHub: https://github.com/ExtensityAI/symbolicai|https://github.com/ExtensityAI/benchmark|

中文摘要: 我们介绍SymbolicAI，这是一个通用的模块化框架，采用基于逻辑的方法来进行生成过程中的概念学习和流程管理。SymbolicAI通过将大型语言模型（LLMs）视为基于自然和形式语言指令执行任务的语义解析器，实现了生成模型与各种求解器的无缝集成，从而弥合了符号推理和生成人工智能之间的差距。我们利用概率编程原理来处理复杂的任务，并利用可微分的经典编程范例及其各自的优势。该框架为数据流操作引入了一组多态、组合和自引用操作，使LLM输出与用户目标保持一致。因此，我们可以在具有零次和少量学习能力的各种基础模型的能力与擅长解决特定问题的专门、微调的模型或求解器之间进行转换。反过来，该框架有助于创建和评估可解释的计算图。最后，我们引入了一个质量度量及其评估这些计算图的经验分数，并提出了一个基准，在一组复杂的工作流中比较各种最先进的LLMs。我们将经验分数称为“通过交叉相似性进行关系轨迹评估的向量嵌入”，或简称为顶点分数。框架代码库和基准链接如下。

摘要: We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for data stream manipulation, aligning LLM outputs with user objectives. As a result, we can transition between the capabilities of various foundation models endowed with zero- and few-shot learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. In turn, the framework facilitates the creation and evaluation of explainable computational graphs. We conclude by introducing a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the “Vector Embedding for Relational Trajectory Evaluation through Cross-similarity”, or VERTEX score for short. The framework codebase and benchmark are linked below.

标题: ALISON: Fast and Effective Stylometric Authorship Obfuscation

作者: Eric Xing, Saranya Venkatraman, Thai Le

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00835v1

GitHub: https://github.com/EricX003/ALISON|

中文摘要: 作者归属（AA）和作者混淆（AO）是隐私研究中两个越来越重要的相互竞争的任务。现代AA利用作者一致的写作风格，使用AA分类器将文本与其作者进行匹配。AO是相应的对抗性任务，旨在以保留其语义的方式修改文本，然而AA模型不能正确地推断其作者身份。为了解决由最先进的（SOTA）AA方法引起的隐私问题，已经提出了新的AO方法，但是由于它们的训练和混淆速度非常慢，通常需要几个小时，所以在很大程度上仍然不切实际。为了应对这一挑战，我们提出了一种实用的AO方法ALISON，它（1）显著减少了训练/混淆时间，比SOTA AO方法的混淆速度快10倍以上，（2）通过在两个基准数据集上攻击三种基于Transformer model的AA方法，实现了更好的混淆成功，通常比竞争方法性能好15%，（3）在混淆期间不需要来自目标AA分类器的直接信号，以及（4）利用独特的风格特征，允许对可解释的混淆进行合理的模型解释。我们还证明了ALISON可以有效地防止四种SOTA AA方法准确地确定ChatGPT生成的文本的作者，同时最大限度地改变原始文本的语义。为确保研究结果的可重复性，我们的代码和数据可在以下网址查阅：https://github.com/EricX003/ALISON。

摘要: Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author’s consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by state-of-the-art (SOTA) AA methods, new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours. To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON.

标题: Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents

作者: Zelong Li, Wenyue Hua, Hao Wang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00798v1

GitHub: https://github.com/agiresearch/Formal-LLM|

中文摘要: 大型语言模型（LLMs）的最新进展使人工智能代理能够自动生成和执行多步计划来解决复杂的任务。然而，由于LLM的内容生成过程难以控制，当前基于LLM的代理经常生成无效或不可执行的计划，这危及了生成的计划的性能，并破坏了用户对基于LLM的代理的信任。作为回应，本文通过整合自然语言的表达性和形式语言的精确性，为基于LLM的代理提出了一个新的“形式-LLM”框架。具体来说，该框架允许人类用户以自动机的形式表达他们对规划过程的需求或约束。然后在自动机的监督下进行基于堆栈的LLM计划生成过程，以确保生成的计划满足约束，从而使计划过程可控。我们在基准任务和实际生活任务上进行了实验，我们的框架实现了超过50%的整体性能提升，这验证了采用Formal-LLM来指导Agent的计划生成的可行性和有效性，防止了Agent生成无效和不成功的计划。此外，更可控的基于LLM的代理可以促进LLM在高有效性规划至关重要的应用场景中的更广泛应用。这项工作在https：//github.com/agi research/Formal-LLM开源。

摘要: Recent advancements on Large Language Models (LLMs) enable AI Agents to automatically generate and execute multi-step plans to solve complex tasks. However, since LLM’s content generation process is hardly controllable, current LLM-based agents frequently generate invalid or non-executable plans, which jeopardizes the performance of the generated plans and corrupts users’ trust in LLM-based agents. In response, this paper proposes a novel ``Formal-LLM’’ framework for LLM-based agents by integrating the expressiveness of natural language and the precision of formal language. Specifically, the framework allows human users to express their requirements or constraints for the planning process as an automaton. A stack-based LLM plan generation process is then conducted under the supervision of the automaton to ensure that the generated plan satisfies the constraints, making the planning process controllable. We conduct experiments on both benchmark tasks and practical real-life tasks, and our framework achieves over 50% overall performance increase, which validates the feasibility and effectiveness of employing Formal-LLM to guide the plan generation of agents, preventing the agents from generating invalid and unsuccessful plans. Further, more controllable LLM-based agents can facilitate the broader utilization of LLM in application scenarios where high validity of planning is essential. The work is open-sourced at https://github.com/agiresearch/Formal-LLM.

标题: Generative quantum machine learning via denoising diffusion probabilistic models

作者: Bingzhi Zhang, Peng Xu, Xiaohui Chen

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2310.05866v3

GitHub: https://github.com/francis-hsu/quantgenmdl|

中文摘要: 深度生成模型是计算机视觉、文本生成和大型语言模型的关键技术。去噪扩散概率模型（DDPMs）最近受到了很多关注，因为它们能够在许多计算机视觉任务中生成多样化和高质量的样本，并且结合了灵活的模型架构和相对简单的训练方案。由纠缠和叠加增强的量子生成模型为学习经典和量子数据带来了新的见解。受经典对应模型的启发，我们提出了\emph{量子去噪扩散概率模型}（QuDDPM），以实现量子数据的有效可训练生成学习。QuDDPM采用足够的电路层来保证表达性，同时引入多个中间训练任务作为目标分布和噪声之间的插值，以避免贫瘠平台，保证高效训练。我们给出了学习误差的界限，并证明了QuDDPM在学习相关量子噪声模型、量子多体相位和量子数据拓扑结构方面的能力。这些结果为多功能和高效的量子生成学习提供了一个范例。

摘要: Deep generative models are key-enabling technology to computer vision, text generation and large language models. Denoising diffusion probabilistic models (DDPMs) have recently gained much attention due to their ability to generate diverse and high-quality samples in many computer vision tasks, as well as to incorporate flexible model architectures and relatively simple training scheme. Quantum generative models, empowered by entanglement and superposition, have brought new insight to learning classical and quantum data. Inspired by the classical counterpart, we propose the \emph{quantum denoising diffusion probabilistic model} (QuDDPM) to enable efficiently trainable generative learning of quantum data. QuDDPM adopts sufficient layers of circuits to guarantee expressivity, while introduces multiple intermediate training tasks as interpolation between the target distribution and noise to avoid barren plateau and guarantee efficient training. We provide bounds on the learning error and demonstrate QuDDPM’s capability in learning correlated quantum noise model, quantum many-body phases and topological structure of quantum data. The results provide a paradigm for versatile and efficient quantum generative learning.

标题: Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model

作者: Mingyu Jin, Qinkai Yu, Chong Zhang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00746v1

GitHub: https://github.com/jmyissb/HealthLLM|

中文摘要: 医疗保健领域的人工智能（AI）显著推进了智能医疗。然而，传统的智能医疗保健受到静态数据和统一标准的限制，无法与个人情况和其他挑战完全集成。因此，开发需要更专业和详细的智能医疗保健方法。为此，我们提出了一个创新的框架Heath-LLM，它结合了大规模特征提取和医学知识权衡评分。与传统的健康管理方法相比，我们的方法有三个主要优势。首先，我们的方法将健康报告集成到一个大模型中，以提供详细的任务信息。第二，使用专业的医学专业知识来调整健康特征的加权得分。第三，我们使用半自动特征提取框架来增强语言模型的分析能力，并融入专家的见解，以提高疾病预测的准确性。我们对大量健康报告进行了疾病预测实验，以评估Health-LLM的有效性。实验结果表明，该方法优于传统方法，具有革新疾病预测和个性化健康管理的潜力。该代码可在https：//github.com/jmyissb/healthlm。获得

摘要: Artificial intelligence (AI) in healthcare has significantly advanced intelligent medical treatment. However, traditional intelligent healthcare is limited by static data and unified standards, preventing full integration with individual situations and other challenges. Hence, a more professional and detailed intelligent healthcare method is needed for development. To this end, we propose an innovative framework named Heath-LLM, which combines large-scale feature extraction and medical knowledge trade-off scoring. Compared to traditional health management methods, our approach has three main advantages. First, our method integrates health reports into a large model to provide detailed task information. Second, professional medical expertise is used to adjust the weighted scores of health characteristics. Third, we use a semi-automated feature extraction framework to enhance the analytical power of language models and incorporate expert insights to improve the accuracy of disease prediction. We have conducted disease prediction experiments on a large number of health reports to assess the effectiveness of Health-LLM. The results of the experiments indicate that the proposed method surpasses traditional methods and has the potential to revolutionize disease prediction and personalized health management. The code is available at https://github.com/jmyissb/HealthLLM.

标题: Augmenting Math Word Problems via Iterative Question Composing

作者: Haoxiong Liu, Yifan Zhang, Yifan Luo

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.09003v3

Project: https://huggingface.co/datasets/Vivacem/MMIQC|

GitHub: https://github.com/iiis-ai/IterativeQuestionComposing|

中文摘要: 尽管用于数学推理的大型语言模型（LLMs）取得了进步，但解决竞争级别的数学问题仍然是一个重大挑战，特别是对于没有外部工具的开源LLMs。我们引入了MMIQC数据集，它由处理过的web数据和合成的问题——回答对组成，旨在增强基础语言模型的数学推理能力。在MMIQC上微调的模型在各种模型尺寸的数学基准测试中的性能始终超过其同行。值得注意的是，Qwen-72B-MMIQC实现了45.0%的准确率，比之前的开源技术水平高出8.2%，并超过了2023年发布的初始版本GPT-4。对匈牙利高中期末考试的广泛评估结果表明，这种改善可以推广到看不见的数据。我们对MMIQC的消融研究表明，很大一部分改进可以归因于我们的新增强方法，迭代问题合成（IQC），它涉及使用LLM从种子问题迭代合成新问题，并通过另一个LLM应用拒绝采样。MMIQC数据集可在https://huggingface.co/datasets/Vivacem/MMIQC。的HuggingFace中心获得我们的代码可从https://github.com/iiis-ai/IterativeQuestionComposing获得。

摘要: Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM. The MMIQC dataset is available on the HuggingFace hub at https://huggingface.co/datasets/Vivacem/MMIQC. Our code is available at https://github.com/iiis-ai/IterativeQuestionComposing.

== VLM ==

标题: Revisiting the Role of Language Priors in Vision-Language Models

作者: Zhiqiu Lin, Xinyue Chen, Deepak Pathak

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2306.01879v3

Project: https://linzhiqiu.github.io/papers/visual_gpt_score/|

中文摘要: 视觉语言模型（VLM）之所以有影响力，部分是因为它们可以以零镜头的方式应用于各种视觉理解任务，无需任何微调。我们研究 $\textit{生成VLMs}$ tha万亿被训练用于给定图像的下一代单词。我们在8个流行的视觉语言基准测试中探索了它们在图像文本检索的说明性任务中的零镜头性能。我们的第一个观察是，通过简单地计算给定图像生成特定文本字符串的匹配分数，它们可以被重新用于区分任务（例如图像——文本检索）。我们称这个概率分数为 $\textit{视觉生成预训练分数}$ （visualgenerative Pre-Training score）。虽然VisualGPTScore在某些检索基准上产生了近乎完美的准确性，但在其他基准上却产生了较差的准确性。我们从概率的角度分析了这种行为，指出一些基准通过创建对立但不太可能的文本标题，无意中捕捉到了不自然的语言分布。事实上，我们证明了即使是忽略任何图像证据的“盲”语言模型有时也能胜过所有现有技术，这让人想起多年前视觉问答（VQA）社区面临的类似挑战。我们推导了一个概率后处理方案，该方案在测试时控制生成VLM中的语言偏差量，而不必重新训练或微调模型。我们表明，VisualGPTScore在适当去偏的情况下，是视觉语言理解的一个强有力的零镜头基线，经常产生最先进的准确性。

摘要: Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a “blind” language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

标题: AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning

作者: Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00769v1

Project: https://animatelcm.github.io/|

GitHub: https://github.com/G-U-N/AnimateLCM|

中文摘要: 视频扩散模型因其能够产生连贯且高保真的视频而受到越来越多的关注。然而，迭代去噪过程使其计算密集且耗时，从而限制了其应用。受一致性模型（CM）的启发，我们提出了AnimateLCM，它提取预训练的图像扩散模型，以最小的步骤加速采样，并成功扩展了潜在一致性模型（LCM）在条件图像生成上，允许在最小的步骤内生成高保真视频。我们提出了一种解耦一致性学习策略，将图像生成先验和运动生成先验的提取解耦，而不是直接对原始视频数据集进行一致性学习，这提高了训练效率并增强了生成视觉质量。此外，使稳定扩散社区中的即插即用适配器的组合能够实现各种功能（例如，用于可控生成的ControlNet）。我们提出了一种有效的策略来使现有的适配器适应我们的提取的文本条件视频一致性模型，或者在不损害采样速度的情况下从头开始训练适配器。我们在图像条件视频生成和布局条件视频生成中验证了所提出的策略，都实现了最佳性能的结果。实验结果验证了该方法的有效性。代码和重量将被公开。更多详情请访问https://github.com/G-U-N/AnimateLCM。

摘要: Video diffusion models has been gaining increasing attention for its ability to produce videos that are both coherent and of high fidelity. However, the iterative denoising process makes it computationally intensive and time-consuming, thus limiting its applications. Inspired by the Consistency Model (CM) that distills pretrained image diffusion models to accelerate the sampling with minimal steps and its successful extension Latent Consistency Model (LCM) on conditional image generation, we propose AnimateLCM, allowing for high-fidelity video generation within minimal steps. Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy that decouples the distillation of image generation priors and motion generation priors, which improves the training efficiency and enhance the generation visual quality. Additionally, to enable the combination of plug-and-play adapters in stable diffusion community to achieve various functions (e.g., ControlNet for controllable generation). we propose an efficient strategy to adapt existing adapters to our distilled text-conditioned video consistency model or train adapters from scratch without harming the sampling speed. We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results. Experimental results validate the effectiveness of our proposed method. Code and weights will be made public. More details are available at https://github.com/G-U-N/AnimateLCM.

标题: CapHuman: Capture Your Moments in Parallel Universes

作者: Chao Liang, Fan Ma, Linchao Zhu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00627v1

Project: https://caphuman.github.io/|

GitHub: https://github.com/VamosC/CapHuman|

中文摘要: 我们专注于一项新颖的以人为中心的图像合成任务，即，仅给定一张参考面部照片，它有望在不同的上下文中生成具有不同头部位置、姿势和面部表情的特定个体图像。为了实现这一目标，我们认为我们的生成模型应该具有以下有利的特征：（1）对我们的世界和人类社会有很强的视觉和语义理解，用于基本对象和人类图像的生成。（2）广义身份保持能力。（3）灵活细粒度的头部控制。最近，大型预训练文本到图像扩散模型显示出显著的效果，作为一个强大的生成基础。作为基础，我们旨在释放预训练模型的上述两种能力。在这项工作中，我们提出了一个新的框架命名为CapHuman。我们采用“编码然后学习对齐”范式，这种范式能够为新个体保留可推广的身份，而无需在推理时进行繁琐的调整。CapHuman对身份特征进行编码，然后学习将它们排列到潜在空间中。此外，我们在以灵活和3D一致的方式为我们的模型配备对人类头部的控制之前引入了3D面部。广泛的定性和定量分析表明，我们的CapHuman可以制作出身份保存良好、照片逼真和高保真的肖像，具有内容丰富的表示和各种头部再现，优于既定的基线。代码和检查点将在https：//github.com/VamosC/CapHuman。发布

摘要: We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, and facial expressions in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the ``encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.

标题: Parrot Captions Teach CLIP to Spot Text

作者: Yiqi Lin, Conghui He, Alex Jinpeng Wang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2312.14232v3

Project: https://linyq17.github.io/CLIP-Parrot-Bias/|

中文摘要: 尽管CLIP是许多视觉语言应用程序中的基础模型，但该CLIP存在严重的文本识别偏差。这种偏差导致剪辑模型“鹦鹉学舌”嵌入图像中的视觉文本，而忽略了真实的视觉语义。我们发现在最流行的图像——文本数据集LAION-2B中，标题也密集地鹦鹉学舌（拼写）嵌入在图像中的文本。我们的分析表明，大约50%的图像嵌入了视觉文本内容，大约30%的标题词在这些嵌入的视觉内容中。基于这样的观察，我们彻底检查了剪辑模型的不同发布版本，并验证了视觉文本是测量这些模型的LAION风格图像——文本相似性的主导因素。为了检验这些鹦鹉字幕是否形成了文本定位偏差，我们用由不同的鹦鹉字幕导向标准策划的LAION子集训练了一系列剪辑模型。我们表明，用鹦鹉字幕训练很容易形成这种偏见，但损害了剪辑模型中预期的视觉语言表征学习。这表明，迫切需要重新审视类似剪辑模型的设计或现有的基于剪辑分数过滤的图像——文本数据集管理管道。

摘要: Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot’ the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

标题: StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

作者: Renqiu Xia, Bo Zhang, Haoyang Peng

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2309.11268v3

GitHub: https://github.com/UniModal4Reasoning/SimChart9K|

中文摘要: 图表在不同科学领域的文献中很常见，传达了读者容易获得的丰富信息。当前与图表相关的任务集中在图表感知上，图表感知指的是从视觉图表中提取信息，或者在给定提取的数据的情况下执行推理，例如以表格形式。在本文中，我们旨在为联合感知和推理任务建立一个统一的和标签有效的学习范式，该范式可以普遍适用于不同的下游任务，而不仅仅是同行作品中专门研究的问答任务。具体来说，StructChart首先将图表信息从流行的管状形式（特别是线性化的CSV）重新表述为所提出的结构化三元组表示（STR），由于采用了图表的结构化信息提取，这对于减少图表感知和推理之间的任务差距更加友好。然后，我们提出了一个结构化的面向图表的表示度量（SCRM）来定量评估图表感知任务的性能。为了丰富用于训练的数据集，我们进一步探索了利用大型语言模型（LLM）的可能性，在图表视觉风格及其统计信息方面增强了图表的多样性。在各种与图表相关的任务上进行了广泛的实验，证明了统一的图表感知——推理范式的有效性和有希望的潜力，以推动图表理解的前沿。

摘要: Charts are common in literature across different scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception which refers to extracting information from the visual charts, or performing reasoning given the extracted data, e.g. in a tabular form. In this paper, we aim to establish a unified and label-efficient learning paradigm for joint perception and reasoning tasks, which can be generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart information from the popular tubular form (specifically linearized CSV) to the proposed Structured Triplet Representations (STR), which is more friendly for reducing the task gap between chart perception and reasoning due to the employed structured information extraction for charts. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the performance for the chart perception task. To enrich the dataset for training, we further explore the possibility of leveraging the Large Language Model (LLM), enhancing the chart diversity in terms of both chart visual style and its statistical information. Extensive experiments are conducted on various chart-related tasks, demonstrating the effectiveness and promising potential for a unified chart perception-reasoning paradigm to push the frontier of chart understanding.

标题: Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

作者: Minjie Zhu, Yichen Zhu, Jinming Li

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.04181v2

Project: https://jlm-z.github.io/RSFT/|

中文摘要: 语言条件机器人操作旨在将自然语言指令转换为可执行的动作，从简单的拾取和放置到需要意图识别和视觉推理的任务。受认知科学中的双重过程理论的启发，该理论提出了人类决策中快速和慢速思维的两个平行系统，我们引入了快速和慢速思维机器人（RFST），这是一个模仿人类认知架构对任务进行分类并根据指令类型在两个系统上做出决策的框架。我们的RFST由两个关键组件组成：1）指令鉴别器，用于根据当前用户指令确定应该激活哪个系统，以及2）慢速思维系统，该系统由与策略网络一致的微调视觉语言模型组成，允许机器人识别用户意图或执行推理任务。为了评估我们的方法，我们建立了一个以真实世界轨迹为特色的数据集，捕捉从自发冲动到需要深思熟虑的任务的各种行为。我们在模拟和真实世界场景中的结果证实，我们的方法能够熟练地管理需要意图识别和推理的复杂任务。该项目可在https：//jlm-z.github.io/RSFT/

摘要: The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/

== diffusion model ==

标题: ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields

作者: Jiahua Dong, Yu-Xiong Wang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00864v1

GitHub: https://github.com/Dongjiahua/VICA-NeRF|

中文摘要: 我们介绍ViCA-NeRF，这是第一个使用文本指令进行3D编辑的视图一致性感知方法。除了隐式神经辐射场（NeRF）建模之外，我们的关键见解是利用两个正则化源，这两个正则化源在不同视图之间显式传播编辑信息，从而确保多视图的一致性。对于几何正则化，我们利用从NeRF获得的深度信息来建立不同视图之间的图像对应。对于学习的正则化，我们在编辑和未编辑的图像之间对齐2D扩散模型中的潜在代码，使我们能够编辑关键视图并在整个场景中传播更新。结合这两种策略，我们的ViCA-NeRF分两个阶段运作。在初始阶段，我们混合来自不同视图的编辑来创建初步的3D编辑。接下来是第二阶段的NeRF训练，致力于进一步完善场景的外观。实验结果表明，与现有技术相比，ViCA-NeRF提供了更灵活、更高效（速度快3倍）的编辑，具有更高水平的一致性和细节。我们的代码是公开的。

摘要: We introduce ViCA-NeRF, the first view-consistency-aware method for 3D editing with text instructions. In addition to the implicit neural radiance field (NeRF) modeling, our key insight is to exploit two sources of regularization that explicitly propagate the editing information across different views, thus ensuring multi-view consistency. For geometric regularization, we leverage the depth information derived from NeRF to establish image correspondences between different views. For learned regularization, we align the latent codes in the 2D diffusion model between edited and unedited images, enabling us to edit key views and propagate the update throughout the entire scene. Incorporating these two strategies, our ViCA-NeRF operates in two stages. In the initial stage, we blend edits from different views to create a preliminary 3D edit. This is followed by a second stage of NeRF training, dedicated to further refining the scene’s appearance. Experimental results demonstrate that ViCA-NeRF provides more flexible, efficient (3 times faster) editing with higher levels of consistency and details, compared with the state of the art. Our code is publicly available.

标题: AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning

作者: Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00769v1

Project: https://animatelcm.github.io/|

GitHub: https://github.com/G-U-N/AnimateLCM|

标题: Generative quantum machine learning via denoising diffusion probabilistic models

作者: Bingzhi Zhang, Peng Xu, Xiaohui Chen

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2310.05866v3

GitHub: https://github.com/francis-hsu/quantgenmdl|

标题: CapHuman: Capture Your Moments in Parallel Universes

作者: Chao Liang, Fan Ma, Linchao Zhu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00627v1

Project: https://caphuman.github.io/|

GitHub: https://github.com/VamosC/CapHuman|

标题: Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach

作者: Eloi Moliner, Filip Elvander, Vesa Välimäki

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2306.01433v2

Project: http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/|

中文摘要: 音频带宽扩展涉及从带限观测中真实重建高频频谱。在低通退化未知的情况下，例如在恢复历史音频记录时，这成为一个盲目的问题。本文介绍了一种称为BABE（盲音频带宽扩展）的新方法，该方法利用预先训练的无条件扩散模型的生成先验来解决零镜头设置下的盲问题。在推断过程中，BABE利用扩散后验采样的广义版本，其中退化算子是未知的，但参数化和迭代推断。使用客观和主观指标对所提出的方法的性能进行了评估，结果表明，当用合成数据测试时，BABE超过了最先进的盲带宽扩展基线，并实现了与知情方法相比具有竞争力的性能。此外，BABE在增强真实历史记录时表现出强大的概括能力，有效地重建缺失的高频内容，同时保持与原始记录的一致性。主观偏好测试证实，贝比显著提高了历史音乐录音的音频质量。用建议的方法恢复的历史记录的例子可在配套网页上查阅：（http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/）

摘要: Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)

标题: Repositioning the Subject within Image

作者: Yikai Wang, Chenjie Cao, Qiaole Dong

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16861v1

Project: https://yikai-wang.github.io/seele/|

GitHub: https://github.com/Yikai-Wang/ReS|

摘要: Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image’s fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE’s effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Our results on ReS demonstrate the quality of repositioned image generation.

== VLN ==

标题: Test-time Adaptive Vision-and-Language Navigation

作者: Junyu Gao, Xuan Yao, Changsheng Xu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2311.13209v2

中文摘要: 视觉和语言导航（VLN）近年来取得了重大进展，这在很大程度上归功于精心策划的数据集和熟练训练的模型。然而，当在不同的环境中测试时，训练好的模型不可避免地会遇到数据分布的重大变化，这突出表明仅仅依靠预训练和固定的导航模型是不够的。为了增强模型的泛化能力，测试时间自适应（TTA）通过利用未标记的测试样本进行模型更新，在计算机视觉领域显示出巨大的潜力。然而，简单地将现有的TTA方法应用于VLN任务不能很好地处理VLN模型的适应性——稳定性困境，即频繁的更新会导致模型参数的剧烈变化，而偶尔的更新会使模型不适于处理动态变化的环境。因此，我们提出了一种用于VLN的快——慢测试时间适应（FSTTA）方法，通过在统一的框架中对梯度和参数进行分解——累积分析。具体来说，在快速更新阶段，在最近的多步导航过程中生成的梯度被分解成具有不同一致性水平的分量。然后，这些分量被自适应地累积，以精确定位一致的方向，用于快速模型自适应。在缓慢更新阶段，收集历史记录的参数，并进行类似的分解——累积分析，以将模型恢复到稳定状态。大量的实验表明，我们的方法在四个流行的基准测试中获得了令人印象深刻的性能提升。

摘要: Vision-and-Language Navigation (VLN) has witnessed significant advancements in recent years, largely attributed to meticulously curated datasets and proficiently trained models. Nevertheless, when tested in diverse environments, the trained models inevitably encounter significant shifts in data distribution, highlighting that relying solely on pre-trained and fixed navigation models is insufficient. To enhance models’ generalization ability, test-time adaptation (TTA) demonstrates significant potential in the computer vision field by leveraging unlabeled test samples for model updates. However, simply applying existing TTA methods to the VLN task cannot well handle the adaptability-stability dilemma of VLN models, i.e., frequent updates can result in drastic changes in model parameters, while occasional updates can make the models ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for VLN by performing decomposition-accumulation analysis for both gradients and parameters in a unified framework. Specifically, in the fast update phase, gradients generated during the recent multi-step navigation process are decomposed into components with varying levels of consistency. Then, these components are adaptively accumulated to pinpoint a concordant direction for fast model adaptation. In the slow update phase, historically recorded parameters are gathered, and a similar decomposition-accumulation analysis is conducted to revert the model to a stable state. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks.