专属领域论文订阅
VX关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。
为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:123456@xx.com + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有
分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉语言导航VLN
- 强化学习 RL
- 模仿学习 IL
- 机器人
- 开放词汇,检测分割
== LLM ==
标题: AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
作者: Yu Du, Fangyun Wei, Hongyang Zhang
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04253v1
GitHub: https://github.com/dyabel/AnyTool|
中文摘要: 我们介绍AnyTool,这是一个大型语言模型代理,旨在彻底改变大量工具在解决用户查询方面的应用。我们利用了来自Rapid API的16,000多个API,在假设这些API的子集可能解决查询的情况下运行。AnyTool主要包含三个元素:一个具有分层结构的API检索器,一个旨在使用一组选定的API候选解决方案来解决用户查询的求解器,以及一个自我反射机制,如果初始解决方案被证明不可行,它会重新激活AnyTool。AnyTool由GPT-4的函数调用特性提供支持,无需训练外部模块。我们还重新审视了以前的工作引入的评估协议,并确定了该协议中导致人为高通过率的限制。通过修改评估协议以更好地反映实际应用场景,我们引入了一个额外的基准,称为AnyToolBench。跨各种数据集的实验证明了我们的AnyTool优于强基线,如ToolLLM和为工具使用定制的GPT-4变体。例如,就ToolBench上的平均通过率而言,AnyTool比ToolLLM高出+35.4%。代码将在https://github.com/dyabel/AnyTool。
摘要: We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism, which re-activates AnyTool if the initial solution proves impracticable. AnyTool is powered by the function calling feature of GPT-4, eliminating the need for training external modules. We also revisit the evaluation protocol introduced by previous works and identify a limitation in this protocol that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical application scenarios, we introduce an additional benchmark, termed AnyToolBench. Experiments across various datasets demonstrate the superiority of our AnyTool over strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool outperforms ToolLLM by +35.4% in terms of average pass rate on ToolBench. Code will be available at https://github.com/dyabel/AnyTool.
标题: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
作者: Mantas Mazeika, Long Phan, Xuwang Yin
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04249v1
Project: https://www.harmbench.org|
GitHub: https://github.com/centerforaisafety/HarmBench|
中文摘要: 自动化red teaming为发现和减轻与恶意使用大型语言模型(LLM)相关的风险带来了巨大的希望,但该领域缺乏一个标准化的评估框架来严格评估新方法。为了解决这个问题,我们引入了HarmBench,这是一个自动化红队的标准化评估框架。我们确定了几个以前在red团队评估中未考虑的理想属性,并系统地设计了HarmBench以满足这些标准。使用HarmBench,我们对18种红色团队方法和33种目标LLMs和防御进行了大规模比较,产生了新的见解。我们还介绍了一种高效的对抗性训练方法,该方法极大地增强了LLM在各种攻击中的鲁棒性,展示了HarmBench如何实现攻击和防御的共同开发。我们在https://github.com/centerforaisafety/HarmBench。开源HarmBench。
摘要: Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.
标题: DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
作者: Susung Hong, Junyoung Seo, Heeseong Shin
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2305.14330v3
GitHub: https://github.com/KU-CVLAB/DirecT2V|
中文摘要: 在人工智能生成内容(AIGC)的范式中,人们越来越关注将知识从预先训练的文本到图像(T2I)模型转移到文本到视频(T2V)生成。尽管它们很有效,但这些框架在保持一致的叙述和处理来自单个抽象用户提示的场景构成或对象放置的变化方面面临挑战。为了探索大型语言模型(LLMs)生成依赖于时间的逐帧提示的能力,本文引入了一个新的框架,称为DirecT2V。DirecT2V利用指令调整的LLM作为控制器,支持包含时变内容并促进一致的视频生成。为了保持时间一致性并防止将值映射到不同的对象,我们为扩散模型配备了一种新的值映射方法和双softmax滤波,这不需要任何额外的训练。实验结果验证了我们的框架在从抽象的用户提示生成视觉连贯和故事丰富的视频方面的有效性,成功地解决了零镜头视频生成的挑战。
摘要: In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos from abstract user prompts, successfully addressing the challenges of zero-shot video generation.
标题: CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
作者: Ji Qi, Ming Ding, Weihan Wang
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04236v1
GitHub: https://github.com/THUDM/CogCoM|
中文摘要: 视觉语言模型(VLM)已经证明了它们广泛的可行性,这要归功于在将视觉指令与答案对齐方面的广泛培训。然而,这种结论性的对齐导致模型忽略关键的视觉推理,并进一步导致细致的视觉问题和不忠实的反应的失败。在本文中,我们提出了操纵链,这是一种使VLMs能够通过一系列操纵来解决问题的机制,其中每个操纵都是指对视觉输入的操作,或者来自通过先前训练获得的内在能力(例如,接地),或者来自模仿类似人类的行为(例如,放大)。这种机制鼓励VLM通过证据视觉推理产生忠实的响应,并允许用户在可解释的路径中跟踪错误原因。因此,我们训练CogCoM,一个通用的17B VLM,具有赋予这种推理机制的基于内存的兼容架构。实验表明,我们的模型在3个类别的8个基准测试中实现了最先进的性能,并且有限数量的数据训练步骤迅速获得了有竞争力的性能。代码和数据可在https://github.com/THUDM/CogCoM。公开获得
摘要: Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.
标题: Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction
作者: Jiaming Ji, Boyuan Chen, Hantao Lou
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.02416v2
Project: https://aligner2024.github.io|
中文摘要: 调整大型语言模型(LLMs)的努力主要通过来自人类反馈的强化学习(RLHF)方法进行。然而,RLHF遇到了重大挑战,包括培训奖励模型,演员——评论家工程,更重要的是,它需要访问LLM参数。在这里,我们介绍Aligner,一种新的有效比对范例,它通过学习对齐和未对齐答案之间的校正残差来绕过整个RLHF过程。我们的校准器提供了几个关键优势。首先,它是一个自回归seq2seq模型,通过监督学习在查询——答案——校正数据集上训练;这以最少的资源提供了参数高效的对齐解决方案。其次,校准器便于从弱到强的概括;通过校准器的监督信号对大型预训练模型进行微调显示了强大的性能提升。第三,Aligner作为一个模型无关的即插即用模块,允许它直接应用于不同的开源和基于API的模型。值得注意的是,Aligner-7B平均将11种不同的LLMs的有用性提高了21.9%,无害性提高了23.8%(GPT-4分别提高了17.5%和26.9%)。当在(弱)校准器-13B的监督下对(强)Llama2-70B进行微调时,我们可以将Llama2的有益性提高8.2%,无害性提高61.6%。查看我们的数据集和代码,网址为https://aligner2024.github.io
摘要: Efforts to align Large Language Models (LLMs) are mainly conducted via Reinforcement Learning from Human Feedback (RLHF) methods. However, RLHF encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to LLM parameters. Here we introduce Aligner, a new efficient alignment paradigm that bypasses the whole RLHF process by learning the correctional residuals between the aligned and the unaligned answers. Our Aligner offers several key advantages. Firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. Secondly, the Aligner facilitates weak-to-strong generalization; finetuning large pretrained models by Aligner’s supervisory signals demonstrates strong performance boost. Thirdly, Aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and API-based models. Remarkably, Aligner-7B improves 11 different LLMs by 21.9% in helpfulness and 23.8% in harmlessness on average (GPT-4 by 17.5% and 26.9%). When finetuning (strong) Llama2-70B with (weak) Aligner-13B’s supervision, we can improve Llama2 by 8.2% in helpfulness and 61.6% in harmlessness. See our dataset and code at https://aligner2024.github.io
标题: OceanGPT: A Large Language Model for Ocean Science Tasks
作者: Zhen Bi, Ningyu Zhang, Yida Xue
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2310.02031v5
Project: https://zjunlp.github.io/project/OceanGPT/|
GitHub: https://github.com/zjunlp/KnowLM|
中文摘要: 海洋科学深入研究作为生命和生物多样性宝库的海洋,鉴于海洋覆盖了我们星球表面的70%以上,因此具有重要意义。最近,大型语言模型(LLMs)的进步已经改变了科学范式。尽管在其他领域取得了成功,但目前的法学硕士往往无法满足海洋学家等领域专家的需求,法学硕士在海洋科学方面的潜力也没有得到充分开发。其内在原因可能是海洋数据的巨大和复杂性质,以及对更高粒度和知识丰富性的必要性。为了缓解这些问题,我们引入了OceanGPT,这是海洋领域的第一个LLM,它是各种海洋科学任务的专家。我们提出了DoInstruct,这是一个自动获取大量海洋领域指令数据的新框架,它基于多智能体协作生成指令。此外,我们构建了第一个海洋学基准OceanBench,以评估LLMs在海洋领域的能力。通过全面的实验,OceanGPT不仅显示了较高水平的海洋科学任务知识专长,而且在海洋技术方面获得了初步体现的智能能力。代码、数据和检查点将很快在https://github.com/zjunlp/KnowLM。提供
摘要: Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet’s surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reason may be the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean domain, which is expert in various ocean science tasks. We propose DoInstruct, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology. Codes, data and checkpoints will soon be available at https://github.com/zjunlp/KnowLM.
== CLIP@ViT @ VLM @ visual model ==
标题: "Task Success" is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors
作者: Lin Guan, Yifan Zhou, Denis Liu
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04210v1
Project: https://guansuns.github.io/pages/vlm-critic|
中文摘要: 大规模生成模型被证明对于采样有意义的候选解决方案是有用的,但是它们经常忽略任务约束和用户偏好。当模型与外部验证器耦合,并且根据验证反馈迭代或渐进地导出最终解决方案时,它们的全部能力被更好地利用。在具体化人工智能的上下文中,验证通常只涉及评估指令中指定的目标条件是否已经满足。尽管如此,为了将这些代理无缝集成到日常生活中,除了简单的任务成功之外,考虑更广泛的约束和偏好是至关重要的(例如,机器人应该小心地抓住面包以避免明显的变形)。然而,鉴于机器人任务的无限范围,构建类似于用于显式知识任务(如围棋和定理证明)的脚本验证器是不可行的。这就引出了一个问题:当没有声音验证器可用时,我们能否使用近似全知的大型视觉和语言模型(VLM)作为可扩展的行为批评者来捕捉视频中不良的机器人行为?为了回答这个问题,我们首先构建了一个基准,其中包含了达到目标但不受欢迎的机器人策略的各种情况。然后,我们对VLM批评者进行全面评估,以更深入地了解他们的优势和失败模式。基于评估,我们提供了如何有效利用VLM评论的指南,并展示了将反馈整合到政策完善迭代过程中的实用方法。数据集和代码库发布于:https://guansuns.github.io/pages/vlm-critic。
摘要: Large-scale generative models are shown to be useful for sampling meaningful candidate solutions, yet they often overlook task constraints and user preferences. Their full power is better harnessed when the models are coupled with external verifiers and the final solutions are derived iteratively or progressively according to the verification feedback. In the context of embodied AI, verification often solely involves assessing whether goal conditions specified in the instructions have been met. Nonetheless, for these agents to be seamlessly integrated into daily life, it is crucial to account for a broader range of constraints and preferences beyond bare task success (e.g., a robot should grasp bread with care to avoid significant deformations). However, given the unbounded scope of robot tasks, it is infeasible to construct scripted verifiers akin to those used for explicit-knowledge tasks like the game of Go and theorem proving. This begs the question: when no sound verifier is available, can we use large vision and language models (VLMs), which are approximately omniscient, as scalable Behavior Critics to catch undesirable robot behaviors in videos? To answer this, we first construct a benchmark that contains diverse cases of goal-reaching yet undesirable robot policies. Then, we comprehensively evaluate VLM critics to gain a deeper understanding of their strengths and failure modes. Based on the evaluation, we provide guidelines on how to effectively utilize VLM critiques and showcase a practical way to integrate the feedback into an iterative process of policy refinement. The dataset and codebase are released at: https://guansuns.github.io/pages/vlm-critic.
标题: A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation
作者: Zhengbo Wang, Jian Liang, Lijun Sheng
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04087v1
GitHub: https://github.com/mrflogs/ICLR24|
中文摘要: 对比语言——图像预训练(CLIP)因其卓越的零镜头能力而广受欢迎。最近的研究集中在开发有效的微调方法,如即时学习和适配器,以增强CLIP在下游任务中的性能。然而,这些方法仍然需要额外的训练时间和计算资源,这对于资源有限的设备来说是不希望的。在本文中,我们重新审视了一个经典的算法,高斯判别分析(GDA),并将其应用于剪辑的下游分类。通常,GDA假设每个类的特征都遵循具有相同协方差的高斯分布。通过利用贝叶斯公式,分类器可以用类均值和协方差来表示,这可以从数据中估计,而不需要训练。为了整合来自视觉和文本模态的知识,我们将其与CLIP中的原始零镜头分类器集成在一起。在17个数据集上的广泛结果验证了我们的方法在少数镜头分类、不平衡学习和非分布泛化方面超越或实现了与最先进方法相当的结果。此外,我们将我们的方法扩展到从基到新的泛化和无监督学习,再次证明了它优于竞争方法。我们的代码可以在\url{https://github.com/mrflogs/ICLR24}公开获得。
摘要: Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP’s performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes’ formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at \url{https://github.com/mrflogs/ICLR24}.
标题: An Open-source Benchmark of Deep Learning Models for Audio-visual Apparent and Self-reported Personality Recognition
作者: Rongfan Liao, Siyang Song, Hatice Gunes
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2210.09138v2
GitHub: https://github.com/liaorongfan/DeepPersonality|
中文摘要: 个性决定了人类日常和工作行为的多样性,对于理解人类的内部和外部状态至关重要。近年来,已经开发了大量的自动人格计算方法来基于非语言视听行为预测受试者的表观人格或自我报告人格。然而,它们中的大多数都受到复杂的和特定于数据集的预处理步骤和模型训练技巧的困扰。在缺乏具有一致实验设置的标准化基准的情况下,不仅不可能公平地比较这些个性计算模型的真实性能,而且使它们难以复制。在本文中,我们提出了第一个可再现的视听基准框架,以提供对八个现有人格计算模型(例如,音频、视频和视听)和七个标准深度学习模型在自我报告和表观人格识别任务上的公平和一致的评估。基于一组基准模型,我们还调查了两种以前用于总结短期/框架水平预测的长期建模策略对个性计算结果的影响。结果得出结论:(i)由大多数基准深度学习模型从面部行为推断的明显人格特征比自我报告的更可靠;(ii)在个性识别方面,视觉模型的表现往往优于音频模型;(iii)非言语行为对不同人格特质的预测作用不同;以及(iv)我们复制的个性计算模型通常比它们最初报告的结果获得更差的性能。我们的基准测试可在\url{https://github.com/liaorongfan/DeepPersonality}公开获得。
摘要: Personality determines a wide variety of human daily and working behaviours, and is crucial for understanding human internal and external states. In recent years, a large number of automatic personality computing approaches have been developed to predict either the apparent personality or self-reported personality of the subject based on non-verbal audio-visual behaviours. However, the majority of them suffer from complex and dataset-specific pre-processing steps and model training tricks. In the absence of a standardized benchmark with consistent experimental settings, it is not only impossible to fairly compare the real performances of these personality computing models but also makes them difficult to be reproduced. In this paper, we present the first reproducible audio-visual benchmarking framework to provide a fair and consistent evaluation of eight existing personality computing models (e.g., audio, visual and audio-visual) and seven standard deep learning models on both self-reported and apparent personality recognition tasks. Building upon a set of benchmarked models, we also investigate the impact of two previously-used long-term modelling strategies for summarising short-term/frame-level predictions on personality computing results. The results conclude: (i) apparent personality traits, inferred from facial behaviours by most benchmarked deep learning models, show more reliability than self-reported ones; (ii) visual models frequently achieved superior performances than audio models on personality recognition; (iii) non-verbal behaviours contribute differently in predicting different personality traits; and (iv) our reproduced personality computing models generally achieved worse performances than their original reported results. Our benchmark is publicly available at \url{https://github.com/liaorongfan/DeepPersonality}.
标题: Human-Like Geometric Abstraction in Large Pre-trained Neural Networks
作者: Declan Campbell, Sreejan Kumar, Tyler Giallanza
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04203v1
中文摘要: 人类拥有识别和操纵抽象结构的非凡能力,这在几何领域尤其明显。认知科学的最新研究表明,神经网络没有这种能力,并得出结论,人类的几何能力来自人类心理表征中的离散符号结构。然而,人工智能(AI)的进展表明,在模型大小和训练数据量方面扩大标准架构后,神经网络开始表现出更像人类的推理。在这项研究中,我们回顾了认知科学中关于几何视觉处理的经验结果,并确定了几何视觉处理中的三个关键偏差:对复杂性、规律性以及对部分和关系的感知的敏感性。我们测试了研究人类这些偏见的文献中的任务,发现人工智能中使用的大型预训练神经网络模型展示了更像人类的抽象几何处理。
摘要: Humans possess a remarkable capacity to recognize and manipulate abstract structure, which is especially apparent in the domain of geometry. Recent research in cognitive science suggests neural networks do not share this capacity, concluding that human geometric abilities come from discrete symbolic structure in human mental representations. However, progress in artificial intelligence (AI) suggests that neural networks begin to demonstrate more human-like reasoning after scaling up standard architectures in both model size and amount of training data. In this study, we revisit empirical results in cognitive science on geometric visual processing and identify three key biases in geometric visual processing: a sensitivity towards complexity, regularity, and the perception of parts and relations. We test tasks from the literature that probe these biases in humans and find that large pre-trained neural network models used in AI demonstrate more human-like abstract geometric processing.
标题: ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection
作者: Yunsheng Ma, Ziran Wang
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2209.09178v4
摘要: Ensuring traffic safety and mitigating accidents in modern driving is of paramount importance, and computer vision technologies have the potential to significantly contribute to this goal. This paper presents a multi-modal Vision Transformer for Driver Distraction Detection (termed ViT-DD), which incorporates inductive information from training signals related to both distraction detection and driver emotion recognition. Additionally, a self-learning algorithm is developed, allowing for the seamless integration of driver data without emotion labels into the multi-task training process of ViT-DD. Experimental results reveal that the proposed ViT-DD surpasses existing state-of-the-art methods for driver distraction detection by 6.5% and 0.9% on the SFDDD and AUCDD datasets, respectively.
标题: Self-supervised visual learning for analyzing firearms trafficking activities on the Web
作者: Sotirios Konstantakos, Despina Ioanna Chalkiadaki, Ioannis Mademlis
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2310.07975v2
摘要: Automated visual firearms classification from RGB images is an important real-world task with applications in public space security, intelligence gathering and law enforcement investigations. When applied to images massively crawled from the World Wide Web (including social media and dark Web sites), it can serve as an important component of systems that attempt to identify criminal firearms trafficking networks, by analyzing Big Data from open-source intelligence. Deep Neural Networks (DNN) are the state-of-the-art methodology for achieving this, with Convolutional Neural Networks (CNN) being typically employed. The common transfer learning approach consists of pretraining on a large-scale, generic annotated dataset for whole-image classification, such as ImageNet-1k, and then finetuning the DNN on a smaller, annotated, task-specific, downstream dataset for visual firearms classification. Neither Visual Transformer (ViT) neural architectures nor Self-Supervised Learning (SSL) approaches have been so far evaluated on this critical task…
== diffusion policy@diffusion formulation@diffusion model ==
标题: Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning
作者: Ruoqi Zhang, Ziwei Luo, Jens Sjölund
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04080v1
GitHub: https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble|https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble|
中文摘要: 本文介绍了用于离线强化学习(RL)的训练扩散策略的高级技术。其核心是均值回归随机微分方程(SDE),它将复杂的动作分布转换为标准高斯分布,然后根据环境状态和相应的反向时间SDE对动作进行采样,就像典型的扩散策略一样。我们表明,这样的SDE有一个解决方案,我们可以用来计算策略的对数概率,产生一个熵正则化器,改善离线数据集的探索。为了减轻来自分布外数据点的不准确值函数的影响,我们进一步建议学习Q-集合的置信下限,以进行更稳健的策略改进。通过将熵正则化扩散策略与离线RL中的Q集成相结合,我们的方法在D4RL基准测试中的大多数任务上实现了最先进的性能。代码可从\href{https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble}{https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble}获得。
摘要: This paper presents advanced techniques of training diffusion policies for offline reinforcement learning (RL). At the core is a mean-reverting stochastic differential equation (SDE) that transfers a complex action distribution into a standard Gaussian and then samples actions conditioned on the environment state with a corresponding reverse-time SDE, like a typical diffusion policy. We show that such an SDE has a solution that we can use to calculate the log probability of the policy, yielding an entropy regularizer that improves the exploration of offline datasets. To mitigate the impact of inaccurate value functions from out-of-distribution data points, we further propose to learn the lower confidence bound of Q-ensembles for more robust policy improvement. By combining the entropy-regularized diffusion policy with Q-ensembles in offline RL, our method achieves state-of-the-art performance on most tasks in D4RL benchmarks. Code is available at \href{https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble}{https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble}.
标题: Polyp-DDPM: Diffusion-Based Semantic Polyp Synthesis for Enhanced Segmentation
作者: Zolnamar Dorjsembe, Hsing-Kuo Pao, Furen Xiao
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.04031v1
GitHub: https://github.com/mobaidoctor/polyp-ddpm|
中文摘要: 本研究介绍了息肉-DDPM,这是一种基于扩散的方法,用于生成基于掩模的息肉逼真图像,旨在增强胃肠道息肉的分割。我们的方法解决了与医学图像相关的数据限制、高注释成本和隐私问题的挑战。通过在分割掩模(代表异常区域的二进制掩模)上调节扩散模型,Polyp-DDPM在图像质量(实现78.47的Frechet初始距离(FID)得分,相比之下得分高于83.79)和分割性能(实现0.7156的并集交集(IoU),相比之下,来自基线模型的合成图像小于0.6694,真实数据小于0.7067)方面优于最先进的方法。我们的方法生成高质量、多样化的合成数据集用于训练,从而增强息肉分割模型,使其与真实图像具有可比性,并提供更强的数据增强能力来改进分割模型。Polyp-DDPM的源代码和预训练权重可在https://github.com/mobaidoctor/Polyp-DDPM上公开获得。
摘要: This study introduces Polyp-DDPM, a diffusion-based method for generating realistic images of polyps conditioned on masks, aimed at enhancing the segmentation of gastrointestinal (GI) tract polyps. Our approach addresses the challenges of data limitations, high annotation costs, and privacy concerns associated with medical images. By conditioning the diffusion model on segmentation masks-binary masks that represent abnormal areas-Polyp-DDPM outperforms state-of-the-art methods in terms of image quality (achieving a Frechet Inception Distance (FID) score of 78.47, compared to scores above 83.79) and segmentation performance (achieving an Intersection over Union (IoU) of 0.7156, versus less than 0.6694 for synthetic images from baseline models and 0.7067 for real data). Our method generates a high-quality, diverse synthetic dataset for training, thereby enhancing polyp segmentation models to be comparable with real images and offering greater data augmentation capabilities to improve segmentation models. The source code and pretrained weights for Polyp-DDPM are made publicly available at https://github.com/mobaidoctor/polyp-ddpm.
标题: EscherNet: A Generative Model for Scalable View Synthesis
作者: Xin Kong, Shikun Liu, Xiaoyang Lyu
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.03908v1
Project: https://kxhit.github.io/EscherNet|https://kxhit.github.io/EscherNet|
中文摘要: 我们介绍了EscherNet,一种用于视图合成的多视图条件扩散模型。EscherNet学习隐式和生成式3D表示,并结合专门的相机位置编码,允许对任意数量的参考视图和目标视图之间的相机转换进行精确和连续的相对控制。EscherNet在视图合成方面提供了卓越的通用性、灵活性和可伸缩性——它可以在单个消费级GPU上同时生成100多个一致的目标视图,尽管它是用固定数量的3个参考视图对3个目标视图进行训练的。因此,EscherNet不仅解决了零镜头新视图合成问题,还自然地统一了单图像和多图像3D重建,将这些不同的任务结合到一个单一的、有凝聚力的框架中。我们的大量实验表明,EscherNet在多个基准测试中实现了最先进的性能,即使与专门为每个问题定制的方法相比也是如此。这种非凡的多功能性为设计用于3D视觉的可扩展神经架构开辟了新的方向。Project page:\url{https://kxhit.github.io/EscherNet}。
摘要: We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis – it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: \url{https://kxhit.github.io/EscherNet}.
标题: SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising
作者: Yu-Tung Liu, Kuan-Chen Wang, Kai-Chun Liu
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.03808v1
GitHub: https://github.com/tonyliu0910/SDEMG|
中文摘要: 当被监测的肌肉靠近心脏时,表面肌电图(sEMG)记录会受到心电图(ECG)信号的影响。一些现有的方法使用基于信号处理的方法,例如高通滤波器和模板减法,而一些方法导出映射函数来从噪声sEMG(具有ECG干扰的sEMG)中恢复干净的sEMG信号。最近,基于分数的扩散模型,一种著名的生成模型,已经被引入来生成具有噪声输入数据的高质量和精确的样本。在这项研究中,我们提出了一种新的方法,称为SDEMG,作为一种基于分数的扩散模型,用于表面肌电信号去噪。为了评估提出的SDEMG方法,我们进行了减少sEMG信号中噪声的实验,采用了来自公开来源的数据,即非侵入性自适应假体数据库,以及来自麻省理工学院-BIH正常窦性心律数据库的心电图信号。实验结果表明,SDEMG优于对比方法,产生了高质量的表面肌电信号样本。SDEMG框架的源代码可从以下网址获得:https://github.com/tonyliu 0910/SDEMG
摘要: Surface electromyography (sEMG) recordings can be influenced by electrocardiogram (ECG) signals when the muscle being monitored is close to the heart. Several existing methods use signal-processing-based approaches, such as high-pass filter and template subtraction, while some derive mapping functions to restore clean sEMG signals from noisy sEMG (sEMG with ECG interference). Recently, the score-based diffusion model, a renowned generative model, has been introduced to generate high-quality and accurate samples with noisy input data. In this study, we proposed a novel approach, termed SDEMG, as a score-based diffusion model for sEMG signal denoising. To evaluate the proposed SDEMG approach, we conduct experiments to reduce noise in sEMG signals, employing data from an openly accessible source, the Non-Invasive Adaptive Prosthetics database, along with ECG signals from the MIT-BIH Normal Sinus Rhythm Database. The experiment result indicates that SDEMG outperformed comparative methods and produced high-quality sEMG samples. The source code of SDEMG the framework is available at: https://github.com/tonyliu0910/SDEMG
标题: Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation
作者: Junyoung Seo, Wooseok Jang, Min-Seop Kwak
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2303.07937v4
Project: https://ku-cvlab.github.io/3DFuse/|
中文摘要: Text-to-3D generation最近几天随着分数蒸馏的出现显示出快速的进展,分数蒸馏是一种使用预训练的文本到2D扩散模型来优化零拍摄设置中的神经辐射场(NeRF)的方法。然而,2D扩散模型中缺乏3D意识使得基于分数提取的方法无法重建可信的3D场景。为了解决这个问题,我们提出了3DFuse,这是一个新的框架,它将3D感知结合到预训练的2D扩散模型中,增强了基于分数提取的方法的鲁棒性和3D一致性。我们通过首先构建给定文本提示的粗略3D结构,然后利用投影的、特定于视图的深度图作为扩散模型的条件来实现这一点。此外,我们引入了一种训练策略,使2D扩散模型学习能够处理粗略3D结构中的错误和稀疏性,以进行鲁棒生成,以及一种确保场景的所有视点的语义一致性的方法。我们的框架超越了现有技术的限制,并且对于2D扩散模型的3D一致生成具有重要意义。
摘要: Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting. However, the lack of 3D awareness in the 2D diffusion models destabilizes score distillation-based methods from reconstructing a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into pretrained 2D diffusion models, enhancing the robustness and 3D consistency of score distillation-based methods. We realize this by first constructing a coarse 3D structure of a given text prompt and then utilizing projected, view-specific depth map as a condition for the diffusion model. Additionally, we introduce a training strategy that enables the 2D diffusion model learns to handle the errors and sparsity within the coarse 3D structure for robust generation, as well as a method for ensuring semantic consistency throughout all viewpoints of the scene. Our framework surpasses the limitations of prior arts, and has significant implications for 3D consistent generation of 2D diffusion models.
标题: Improving and Unifying Discrete&Continuous-time Discrete Denoising Diffusion
作者: Lingxiao Zhao, Xueying Ding, Lijun Yu
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2402.03701v1
GitHub: https://github.com/LingxiaoShawn/USD3|
中文摘要: 离散扩散模型在自然离散数据(如语言和图形)上的应用引起了广泛关注。虽然离散时间离散扩散已经建立了一段时间,但直到最近Campbell等人。(2022)引入了连续时间离散扩散的第一个框架。然而,它们的训练和采样过程与离散时间版本有很大不同,需要非平凡的近似来处理。在本文中,我们首先提出了变分下界的一系列数学简化,使离散扩散的训练更加准确和易于优化。此外,我们推导了一个简单的反向去噪公式,使精确和加速采样,更重要的是,离散时间和连续时间离散扩散的优雅统一。由于更简单的分析公式,正向概率和向后概率都可以灵活地适应任何噪声分布,包括多元素对象的不同噪声分布。实验表明,我们提出的USD3(用于统一简化离散去噪扩散)在已建立的数据集上优于所有SOTA基线。我们在https://github.com/LingxiaoShawn/USD3。
摘要: Discrete diffusion models have seen a surge of attention with applications on naturally discrete data such as language and graphs. Although discrete-time discrete diffusion has been established for a while, only recently Campbell et al. (2022) introduced the first framework for continuous-time discrete diffusion. However, their training and sampling processes differ significantly from the discrete-time version, necessitating nontrivial approximations for tractability. In this paper, we first present a series of mathematical simplifications of the variational lower bound that enable more accurate and easy-to-optimize training for discrete diffusion. In addition, we derive a simple formulation for backward denoising that enables exact and accelerated sampling, and importantly, an elegant unification of discrete-time and continuous-time discrete diffusion. Thanks to simpler analytical formulations, both forward and now also backward probabilities can flexibly accommodate any noise distribution, including different noise distributions for multi-element objects. Experiments show that our proposed USD3 (for Unified Simplified Discrete Denoising Diffusion) outperform all SOTA baselines on established datasets. We open-source our unified code at https://github.com/LingxiaoShawn/USD3.
== Visual Navigation@VLN @ Visual Language Navigation ==
标题: SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization
作者: Olaya Álvarez-Tuñón, Luiza Ribeiro Marnet, László Antal
PubTime: 2024-02-06
Downlink: http://arxiv.org/abs/2401.17907v2
GitHub: https://github.com/remaro-network/SubPipe-dataset|
中文摘要: 本文介绍了SubPipe,这是一个用于SLAM、对象检测和图像分割的水下数据集。SubPipe已经使用由OceanScan MST运营的\gls{LAUV}进行了记录,并携带了一套传感器,包括两个摄像机、一个侧扫声纳和一个惯性导航系统以及其他传感器。AUV已经部署在管道检查环境中,海底管道部分被沙子覆盖。AUV的姿态地面真实值由导航传感器估计。侧扫声纳和RGB图像分别包括目标检测和分割注释。最先进的分割、对象检测和SLAM方法在SubPipe上进行了基准测试,以展示数据集在利用计算机视觉算法方面的挑战和机遇。据作者所知,这是第一个带注释的水下数据集,提供了真实的管道检查场景。数据集和实验可在https://github.com/remaro-network/SubPipe-dataset
摘要: This paper presents SubPipe, an underwater dataset for SLAM, object detection, and image segmentation. SubPipe has been recorded using a \gls{LAUV}, operated by OceanScan MST, and carrying a sensor suite including two cameras, a side-scan sonar, and an inertial navigation system, among other sensors. The AUV has been deployed in a pipeline inspection environment with a submarine pipe partially covered by sand. The AUV’s pose ground truth is estimated from the navigation sensors. The side-scan sonar and RGB images include object detection and segmentation annotations, respectively. State-of-the-art segmentation, object detection, and SLAM methods are benchmarked on SubPipe to demonstrate the dataset’s challenges and opportunities for leveraging computer vision algorithms. To the authors’ knowledge, this is the first annotated underwater dataset providing a real pipeline inspection scenario. The dataset and experiments are publicly available online at https://github.com/remaro-network/SubPipe-dataset
标题: VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
作者: Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme
PubTime: 2024-02-05
Downlink: http://arxiv.org/abs/2402.03561v1
中文摘要: 户外视觉和语言导航(VLN)要求代理基于自然语言指令在真实的3D户外环境中导航。现有VLN方法的性能受到导航环境多样性不足和训练数据有限的限制。为了解决这些问题,我们提出了VLN视频,它利用美国多个城市驾驶视频中存在的各种户外环境,并辅以自动生成的导航指令和动作,以提高户外VLN性能。VLN-Video结合了直观的经典方法和现代深度学习技术的精华,使用模板填充来生成接地导航指令,结合基于图像旋转相似性的导航动作预测器,从驾驶视频中获得VLN风格的数据,用于预训练深度学习VLN模型。我们在触地数据集和我们的视频增强数据集上预训练模型,这些数据集是从具有三个代理任务的驾驶视频创建的:屏蔽语言建模、指令和轨迹匹配以及下一步动作预测,以便学习时间感知和视觉对齐的指令表示。当对触地数据集进行微调时,学习到的指令表示适应于最先进的导航器。实证结果表明,VLN-Video在任务完成率方面显著优于以前的最先进模型2.1%,在触地数据集上实现了新的最先进水平。
摘要: Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded navigation instructions, combined with an image rotation similarity-based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigator when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.
专属领域论文订阅
关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。
为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:123456@xx.com + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有