Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.
无分类器指导已成为基于去噪扩散模型的条件生成的标准方法。然而,对于无分类器指导的全面理解仍然缺失。在这项工作中,我们进行了一次实证研究,以提供对无分类器指导的新视角。具体而言,我们不仅关注无分类器指导,还追溯到其根本——即有分类器指导,指明推导中的关键假设,并系统地研究分类器的作用。我们发现,无论是有分类器指导还是无分类器指导,都是通过将去噪扩散轨迹远离决策边界来实现条件生成的,这些区域通常是条件信息纠缠且难以学习的地方。基于这种以分类器为中心的理解,我们提出了一种通用的后处理步骤,该步骤建立在流匹配之上,旨在缩小预训练去噪扩散模型所学分布与真实数据分布之间的差距,特别是在决策边界附近。实验结果验证了所提方法的有效性。
https://arxiv.org/abs/2503.10638
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at this https URL.
尽管开源大型视觉语言模型(LVLM)在性能方面表现出色,但基于迁移的定向攻击通常无法成功针对黑盒商业LVLM。分析失败的对抗性扰动揭示了学习到的扰动通常源自均匀分布,并缺乏明确的语义细节,导致产生意外响应。这种关键性的语义信息缺失使得商用LVLM要么完全忽略扰动,要么错误解读其嵌入的语义,从而导致攻击失败。为解决这些问题,我们注意到识别核心语义对象是使用不同数据集和方法训练的模型的关键目标。这一见解启发了我们的方法,通过在局部区域编码明确的语义细节来增强语义清晰度,确保互操作性并捕捉更细粒度特征,并且将修改集中在语义丰富的区域而非均匀分布上。 为此,我们提出了一种简单而高度有效的解决方案:在每次优化步骤中,随机裁剪对抗图像以受控的长宽比和尺度进行裁剪,然后调整大小,并将其与目标图像在嵌入空间内对齐。实验结果证实了我们的假设。我们针对关键区域构造并集中扰动的局部聚集对抗样本表现出出乎意料的良好迁移性,这些模型包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet以及推理模型如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking。我们的方法在GPT-4.5、4o 和 o1上的成功率达到90%以上,显著优于所有先前最先进的攻击方法。 我们优化的对抗样本在不同配置和训练代码下的版本可在[此处](https://此URL)获取。
https://arxiv.org/abs/2503.10635
Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.
表达自信对于在动态多模态环境中导航的具身智能体来说是一项挑战,这种不确定性既来自于感知也来自决策过程。我们提出了首个研究开放多模态环境中的具身自信激发的工作。我们介绍了“启发式策略”,该策略通过归纳、演绎和溯因推理来结构化信心评估,并且引入了“执行策略”,这些策略通过场景重构、行动采样及假设性思考等方式增强了信心校准。 在《我的世界》环境中进行的校准与故障预测任务评估中,我们展示了结构化推理方法(例如链式思维)可以改善信心校准。然而,我们的研究还揭示了在溯因设置下区分不确定性方面持续存在的挑战,这强调了需要更复杂和高级的具身自信激发方法来解决这些问题。
https://arxiv.org/abs/2503.10628
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: this https URL
大型多模态模型(LMM)的迅速发展已经使其在科学问题解决中的应用成为可能,然而这些模型的细粒度能力仍处于探索阶段。本文介绍了SciVerse,这是一个用于评估5,735个测试实例中五种不同版本的多模态科学基准。我们旨在调查LMM的三个关键维度:科学知识理解、多模态内容解释以及链式思维(CoT)推理。 为了揭示LMM是否具备足够的科学专长,我们将每个问题转换为包含解决所需不同程度的知识的三个版本,即无知识版、轻量级知识版和丰富知识版。然后,为了探索LMM如何解释多模态科学内容,我们标注了另外两个版本,即视觉丰富版和仅视觉版,在这些问题中从文本到图表标记了更多的问题信息。 通过对不同版本结果的比较,SciVerse系统地检查了大型多模态模型在科学领域的专业知识储备和视觉感知技能。此外,为了严格评估链式思维推理能力,我们提出了一种新的科学链式思维评价策略,对模型输出中的知识错误和逻辑错误进行分步评估。 我们在SciVerse上对不同LMM的广泛评估揭示了它们在科学专业性方面的关键限制,并为未来的开发提供了新的见解。项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2503.10627
Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.
获取物理上合理的运动技能,适用于多样且非传统的形态(包括人形机器人、四足动物和各种生物)对于推进角色模拟与机器人技术至关重要。传统方法,如强化学习(RL),具有任务和身体特定性,需要大量奖励函数工程,并且泛化能力较差。模仿学习则提供了一种替代方案,但严重依赖高质量的专家演示数据,这在非人类形态中很难获得。视频扩散模型能够生成各种生物形态的真实视频,从人类到蚂蚁皆可涵盖。利用这一能力,我们提出了一种数据独立的方法来获取技能,该方法通过二维生成视频学习三维运动技能,并具备向非常规和非人类形式泛化的潜力。 具体而言,我们的模仿学习过程由视觉变压器引导进行基于视频的比较,计算视频嵌入之间的成对距离。除了视频编码的距离外,我们还使用分段视频帧之间计算出的相似度作为指导奖励。我们在涉及独特身体配置的运动任务上验证了该方法的有效性,在人形机器人行走任务中,“无数据模仿学习”(NIL)优于基于三维动作捕捉数据训练的基础模型。 我们的研究成果突显了利用生成式视频模型进行物理上合理技能学习的潜力,尤其适用于多样化的生物形态。这种方法通过用数据生成替代数据收集,有效地提高了模仿学习的能力。
https://arxiv.org/abs/2503.10626
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
从单张图像中生成可动画化的3D人体模型是一个极具挑战性的问题,因为难以将几何形状、外观和变形解耦。最近的三维人体重建进展主要集中在静态人体建模上,并且依赖于使用合成3D扫描数据进行训练的方法在泛化能力方面存在局限。相比之下,基于优化方法的视频处理技术可以达到更高的保真度,但需要受控的捕捉条件以及计算密集型的细化过程。 鉴于大规模重建模型在高效静态重建中的出现,我们提出了LHM(大型可动画化人体重建模型),该模型能够通过前馈传递推断出以3D高斯点集合表示的高质量虚拟形象。我们的模型采用多模态变换器架构来有效地利用人体位置特征和图像特征,并借助注意力机制来详细保留服装几何形状和纹理。为了进一步增强面部身份保持能力和细节恢复能力,我们提出了一种头部特征金字塔编码方案,用于汇集头部区域的多尺度特性。 大量的实验表明,我们的LHM模型能够在几秒钟内生成具有可动画化特性的逼真人形,且无需对脸部和手部进行后处理,从而在重建准确性和泛化能力方面超越了现有的方法。
https://arxiv.org/abs/2503.10625
Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings. Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at this https URL.
为穿着衣物的人体点云拟合人体模型是一项常见但具有挑战性的任务。传统基于优化的方法采用多阶段流水线,对姿态初始化敏感;而近期的基于学习的方法往往在处理不同姿态和服饰类型时泛化能力较弱。我们提出了一种名为ETCH(Equivariant Tightness Fitting for Clothed Humans)的新颖管道,通过局部近似SE(3)等变性估计衣物表面到身体表面的映射,并将紧致度编码为从衣物表面到下方人体的距离矢量。随后,在这种映射之后,姿态不变的人体特征回归稀疏的身体标记点,简化了穿着衣物的人体拟合任务,将其转化为内部人体标记拟合任务。在CAPE和4D-Dress数据集上的大量实验表明,ETCH显著超越现有方法(无论是紧致度无关的方法还是紧致度相关的方法)——在宽松服装的模型贴合精度上提高了16.7%到69.5%,并且平均形状准确性提升了49.9%。我们的等变性紧致设计甚至能在一次性设置或分布外场景中减少方向误差(67.2%至89.8%)。定性的结果显示,ETCH无论在困难姿态、未见形态、宽松服装和非刚体动态情况下的泛化能力都很强。 我们将很快在这个[链接](https://this https URL)发布代码和模型用于研究目的。
https://arxiv.org/abs/2503.10624
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
归一化层在现代神经网络中无处不在,并且长期以来被视为必不可少的组成部分。这项工作表明,使用一种极其简单的技术,没有归一化的Transformer模型可以达到相同或更好的性能。我们引入了动态Tanh(DyT),这是一种逐元素操作$DyT($x$) = \tanh(\alpha $x$)$,作为Transformer中归一化层的即插即用替代方案。DyT受到这样的观察启发:在Transformer中的层归一化通常会产生类似Tanh、S形的输入-输出映射关系。通过加入DyT,不使用归一化的Transformer模型可以在大多数情况下无需超参数调整即可达到与带有归一化模型相同的性能甚至超越它们。 我们在各种设置下验证了采用DyT的Transformer的有效性,这些设置从识别到生成、监督学习到自我监督学习,以及计算机视觉到语言模型等领域。这些发现挑战了人们长期以来对现代神经网络中归一化层不可或缺性的理解,并为深入探讨其在深层网络中的作用提供了新的见解。
https://arxiv.org/abs/2503.10622
We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
我们介绍了名为“Siege”的多轮对抗框架,该框架通过树搜索的视角模拟大型语言模型(LLM)安全性的逐渐削弱。与依赖于单一精心设计提示的传统单轮越狱方法不同,Siege以广度优先的方式在每一轮中扩展对话,并生成多个利用先前响应部分合规性的敌对性提示。通过跟踪这些逐步泄露的策略并将其重新注入后续查询中,Siege揭示了如何微小的让步可以累积成完全不允许的输出。 在JailbreakBench数据集上的评估表明,在单次多轮运行中,Siege针对GPT-3.5-turbo和GPT-4分别实现了100%和97%的成功率,并且使用的查询次数少于Crescendo或GOAT等基线方法。这种树搜索方法提供了对模型安全措施如何在连续对话轮次中逐渐退化的深入洞察,强调了语言模型进行稳健多轮测试程序的紧迫性。
https://arxiv.org/abs/2503.10619
Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.
将大型语言模型适应多种任务可能会导致技能间的干扰,即一个技能的改进会削弱另一个技能的表现。虽然像LoRA这样的方法在权重层施加正交性约束,但它们未能完全解决隐藏状态表示中的干扰问题。我们提出了组合子空间表示微调(Compositional Subspace Representation Fine-tuning, CS-ReFT),这是一种基于表示的新方法,它学习多个正交子空间变换,每个变换专门针对不同的技能,并通过一个轻量级路由器进行组合。通过在隐藏状态下而不是权重矩阵中隔离这些子空间编辑,CS-ReFT 更有效地防止了跨任务冲突。在 AlpacaEval 基准测试上,将 CS-ReFT 应用于 Llama-2-7B 模型实现了 93.94% 的胜率,超过了 GPT-3.5 Turbo(86.30%),同时只需要使用模型参数的 0.0098%。这些发现表明,通过简单的路由器组合的专业表示编辑显著提升了多任务指令跟随的效果,并且几乎不增加额外开销。
https://arxiv.org/abs/2503.10617
Emotional Mimicry Intensity (EMI) estimation serves as a critical technology for understanding human social behavior and enhancing human-computer interaction experiences, where the core challenge lies in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods in insufficient exploitation of modal synergistic effects, noise sensitivity, and limited fine-grained alignment capabilities, this paper proposes a dual-stage cross-modal alignment framework. First, we construct vision-text and audio-text contrastive learning networks based on an improved CLIP architecture, achieving preliminary alignment in the feature space through modality-decoupled pre-training. Subsequently, we design a temporal-aware dynamic fusion module that combines Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to respectively capture the macro-evolution patterns of facial expressions and local dynamics of acoustic features. Innovatively, we introduce a quality-guided modality fusion strategy that enables modality compensation under occlusion and noisy scenarios through differentiable weight allocation. Experimental results on the Hume-Vidmimic2 dataset demonstrate that our method achieves an average Pearson correlation coefficient of 0.35 across six emotion dimensions, outperforming the best baseline by 40\%. Ablation studies further validate the effectiveness of the dual-stage training strategy and dynamic fusion mechanism, providing a novel technical pathway for fine-grained emotion analysis in open environments.
情感模仿强度(EMI)估计作为一种关键技术,用于理解人类社会行为并增强人机交互体验。其核心挑战在于动态相关性建模和多模式时序信号的鲁棒融合。为了克服现有方法在多模式协同效应利用不足、噪声敏感以及细粒度对齐能力有限等问题,本文提出了一种双阶段跨模态对齐框架。 首先,我们基于改进的CLIP架构构建了视觉-文本和音频-文本对比学习网络,在特征空间中通过解耦预训练实现初步对齐。随后,设计了一个感知时间信息的动力融合模块,该模块结合了时序卷积网络(TCN)与门控双向LSTM,分别捕捉面部表情的宏观演变模式及声学特征的局部动态变化。 创新性地引入了一种质量导向的模态融合策略,在遮挡和噪声场景中通过可微权重分配实现模态补偿。在Hume-Vidmimic2数据集上的实验结果表明,我们的方法在六个情感维度上取得了平均皮尔逊相关系数为0.35的成绩,比最佳基线方法高出40%。消融研究进一步验证了双阶段训练策略和动态融合机制的有效性,为开放环境下的细粒度情感分析提供了新的技术路径。
https://arxiv.org/abs/2503.10603
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at this https URL.
对象幻觉(OH)已被认为是大型视觉-语言模型(LVLM)中主要的可信度挑战之一。最近在大规模语言模型(LLMs)方面的进展表明,内部状态如隐藏状态编码了生成响应的“总体真实性”。然而,目前尚不清楚LVLM中的内部状态如何工作以及它们是否可以作为“按令牌”幻觉指示器,这有助于减轻OH问题。在这篇论文中,我们首先深入探讨了LVLM内部状态与OH问题之间的关系,并发现(1)LVLM内部状态是高特异性的按令牌幻觉行为指标。(2)不同的LVLM在通用的潜在子空间中共编码了幻觉模式,表明存在由各种LVLM共享的“通用真实方向”。基于这些发现,我们提出了Truthful-Guided Pre-Intervention (TruthPrInt),它首先学习LVLM解码的真实方向,然后在LVLM解码过程中应用真实指导的推理时间干预。此外,为了增强跨LVLM和跨数据集幻觉检测的迁移能力,我们提出了ComnHallu,通过构建并对齐幻觉潜在子空间来实现这一目标。 我们在广泛的实验设置中评估了TruthPrInt,包括域内和域外场景,并且针对流行的LVLMs和OH基准进行了测试。实验结果显示,TruthPrInt在性能上显著优于当前最先进的方法。代码可在提供的网址获取(原文中的链接需手动输入)。
https://arxiv.org/abs/2503.10602
Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.
尽管经典统计理论预测过度拟合会非常严重,现代的大规模过参数化神经网络依然表现出了良好的泛化能力。这种出乎意料的特性归因于所谓的“隐式偏差”,即网络倾向于收敛到那些能有效泛化的解,而非仅仅正确标记训练数据的各种可能解中的一种。我们的研究目标是从一个新的角度探索这一偏置,并重点探讨非线性激活函数如何影响其形成。 首先,我们引入了一种重新参数化方法,消除了连续权重的重缩放对称性。其次,在核(kernel)机制下,利用这种重新参数化手段来推广最近关于浅层神经网络与Radon变换之间关系的研究成果,并推导出了由广泛激活函数类别引发的隐式偏差的具体公式。 具体而言,通过将Radon变换和Fourier变换之间的联系应用到这一问题中,我们将核机制下的归纳偏置解释为一种谱半范数最小化过程,该过程通过对高频分量进行惩罚来体现激活函数的影响。最后,在自适应机制下,我们展示了局部动力学吸引子的存在性,这些吸引子促进了零输入激活功能的超平面簇的形成,从而实现了许多神经元响应函数的一致对齐。 通过模拟实验验证了上述理论结果。总体而言,我们的工作为理解过参数化神经网络泛化能力背后的机制及其与隐式偏差的关系提供了更深入的理解,并提出了设计更加高效和稳健模型的可能性路径。
https://arxiv.org/abs/2503.10587
Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.
视觉语言模型在许多感知任务上已经取得了显著进展,然而由于高质量和多样化训练数据的缺乏,它们在需要推理的任务上的进步似乎有限。为了解决推理型多模态数据集稀缺的问题,在这项工作中我们提出了VisualWebInstruct——一种利用搜索引擎创建多样且高质量的数据集的新方法。该方法涵盖了数学、物理、金融、化学等不同学科。 从精心挑选的30,000张种子图片开始,我们使用Google图像搜索来识别包含类似图片的网站。我们收集并处理了来自超过70万个独特URL源的HTML内容。通过一个包括内容提取、过滤和合成的流程,我们建立了一个大约90万对问题-答案(Q&A)的数据集,其中40%是视觉问答配对,其余的是文本问答配对。 在VisualWebInstruct上进行微调的模型表现出了显著的优势:(1) 从Llava-OV-mid训练获得绝对点数增长为10%-20%,(2) 从MAmmoTH-VL训练则获得了5%的绝对增益。我们的最佳模型MAmmoTH-VL2在10B参数类别的MMMU-Pro-std(40.7%)、MathVerse(42.6%)和DynaMath(55.7%)基准测试中表现出色,达到了当前的最佳性能。 这些令人瞩目的成果突显了我们的数据集对于增强视觉语言模型在复杂多模态任务中的推理能力的有效性。
https://arxiv.org/abs/2503.10582
With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at this http URL.
随着大型语言模型(LLMs)和视觉-语言模型(VLMs)的迅速发展,开放式词汇机器人操作系统的开发取得了显著进展。然而,许多现有的方法忽视了物体动力学的重要性,这限制了它们在更复杂、动态任务中的应用范围。在这项工作中,我们介绍了KUDA,这是一种将动力学习与基于关键点的视觉提示相结合的开放词汇操作系统,通过这种方式利用VLMs和基于学习的动力模型。我们的核心观点是,基于关键点的目标指定可以同时被VLM理解和高效地转换为用于模型规划的成本函数。 在接收到语言指令和视觉观察数据后,KUDA首先将关键点分配给RGB图像,并查询VLM以生成目标规格说明。这些抽象的关键点表示随后会被转化为成本函数,然后利用一个已学习的动力学模型进行优化,从而产生机器人的轨迹。 我们在一系列操作任务中评估了KUDA的性能,包括跨多种物体类别的自由形式语言指令、多物体交互以及变形或颗粒状物体的操作,证明了我们框架的有效性。项目的网页在此链接提供:[此HTTP URL]。
https://arxiv.org/abs/2503.10546
This work concerns the path-star task, a minimal example of searching over a graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$, which ends one of the arms and is tasked with generating the arm containing $t$. The minimal nature of this task means only a single choice needs to be made: which of the $D$ arms contains $t$? Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task's minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.
这项工作研究了路径星任务,这是一个在图上进行搜索的最小示例。图$G$呈星形结构,从起始节点$s$出发有$D$个臂(分支)。给定语言模型(LM) 图$G$、起始节点$s$以及目标节点$t$(该节点位于一个臂的末端),任务是让LM生成包含$t$的那个臂。由于这个任务的本质是最小化的,因此只需要做出一次选择:在$D$个臂中哪一个包含了$t$?解码器自回归模型在此基本任务上的表现未能超过随机猜测的概率$\frac{1}{D}$,这是因为它们学会了利用训练数据中的捷径来吸收监督信息。 我们展示了这种病态行为是由于过度的监督导致,并提出了一系列解决方案,证明了通过解码器自回归LM可以解决这一问题。我们发现该任务的最小性质使其变得困难,因为它阻止了任务分解。我们的解决方案提供了对这种病态行为及其对基于下一个标记预测训练的LM的影响的理解。
https://arxiv.org/abs/2503.10542
Support Vector Regression (SVR) and its variants are widely used to handle regression tasks, however, since their solution involves solving an expensive quadratic programming problem, it limits its application, especially when dealing with large datasets. Additionally, SVR uses an epsilon-insensitive loss function which is sensitive to outliers and therefore can adversely affect its performance. We propose Granular Ball Support Vector Regression (GBSVR) to tackle problem of regression by using granular ball concept. These balls are useful in simplifying complex data spaces for machine learning tasks, however, to the best of our knowledge, they have not been sufficiently explored for regression problems. Granular balls group the data points into balls based on their proximity and reduce the computational cost in SVR by replacing the large number of data points with far fewer granular balls. This work also suggests a discretization method for continuous-valued attributes to facilitate the construction of granular balls. The effectiveness of the proposed approach is evaluated on several benchmark datasets and it outperforms existing state-of-the-art approaches
支持向量回归(SVR)及其变体被广泛用于处理回归任务,但由于其解决方案涉及求解昂贵的二次规划问题,这限制了它在大型数据集上的应用。此外,SVR 使用 ε-不敏感损失函数,该函数对异常值敏感,从而可能对其性能产生负面影响。 我们提出了一种基于颗粒球概念的支持向量回归(GBSVR),以解决回归问题。这些颗粒球有助于简化机器学习任务中的复杂数据空间。据我们所知,它们尚未在回归问题中得到充分探索。通过将数据点依据其邻近性分组为颗粒球,GBSVR 可以减少 SVR 中的数据点数量,并降低计算成本。 此外,该工作还提出了一种离散化方法来处理连续值属性,以便于构建颗粒球。我们所提出的这种方法在多个基准数据集上进行了评估,并且其性能优于现有的最先进的方法。
https://arxiv.org/abs/2503.10539
High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. However, their relationship to IRT parameters remains underexplored. To address this gap, we conducted a study involving over 7,000 multiple-choice questions across various STEM subjects (e.g., math and biology). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors). Overall, while IWFs are useful for predicting IRT parameters--particularly for screening low-difficulty MCQs--they cannot replace traditional data-driven validation methods. Our findings highlight the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.
高质量的测试题目对于教育评估至关重要,尤其是在项目反应理论(IRT)的应用中。传统的验证方法依赖于耗时且资源密集型的试点测试来估计题目的难度和区分度。最近,基于文本特征评估试题质量的“项目编写缺陷”(IWF)评分标准作为一种通用领域的方法出现。然而,这些评分标准与IRT参数之间的关系尚未得到充分探索。为了解决这一空白,我们进行了一项研究,涵盖了超过7,000道来自STEM不同学科(如数学和生物学)的选择题。 通过自动化的方式,我们使用一个包含19个标准的IWF评分表对每一道题目进行了标注,并且分析了这些文本特征与数据驱动IRT参数之间的关系。我们的分析发现,项目编写缺陷的数量与IRT难度和区分度参数之间存在统计学上的显著联系,尤其是在生命科学和物理科学领域更为明显。我们进一步观察到,特定的IWF标准对于题目的质量影响程度不同(例如,否定表达式相对于不可信的干扰项的影响)。 总体而言,虽然IWF评分表在预测IRT参数方面具有一定的预测能力——特别是在筛查低难度的选择题时尤为有用——但它们不能替代传统基于数据驱动的方法。我们的研究结果强调了对通用领域评估标准和能够理解特定领域内容算法进行进一步研究的需求,以实现稳健的项目验证方法。
https://arxiv.org/abs/2503.10533
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
在这项研究中,我们提出了一种使用MobileNetV4和基于多尺度3D MLP-Mixer的时序聚合模块进行高效时空特征提取的方法。MobileNetV4通过其通用倒残块(UIB)作为骨干网络来从输入图像序列中抽取分层特征表示,确保计算效率的同时还提供丰富的语义编码。为了捕捉时间依赖性,我们引入了一个三级MLP-Mixer模块,该模块在多个分辨率下处理空间特征,并保持结构完整性。 在ABAW 8th竞赛中的实验结果证明了我们方法的有效性,在情感行为分析方面表现出令人鼓舞的性能。通过将高效的视觉骨干网络与有组织的时间建模机制相结合,所提出的框架实现了计算效率和预测准确度之间的平衡,使其非常适合移动设备和嵌入式计算环境下的实时应用。
https://arxiv.org/abs/2503.10530
3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
最近,三维多模态大型语言模型(MLLMs)取得了显著进展。然而,由于缺乏大量且质量优良的三维数据集,其潜力仍未被完全发挥。当前的方法试图通过从二维 MLLMs 转移知识来扩充三维指令数据,但仍然面临着模式和领域差距的问题。为此,我们引入了 PiSA-Engine(Point-Self-Augmented-Engine),这是一个新的框架,用于生成富含3D空间语义的指令点语言数据集。我们观察到现有的3D MLLMs在对点云进行注释时提供了全面的理解能力,而2D MLLMs则擅长通过提供补充信息来进行跨模态验证。通过整合现成MLLMs中全方位的2D和3D见解,PiSA-Engine能够实现高质量数据生成的连续循环过程。 我们选取PointLLM作为基准,并采用这一协同进化训练框架开发了一个增强型的三维 MLLM,称为 PointLLM-PiSA。此外,我们还指出了先前的3D基准测试中存在的局限性:这些测试往往包含粗糙的语言描述和不足的类别多样性,导致评估不准确。为了弥补这一差距,我们进一步引入了 PiSA-Bench,这是一个涵盖六个关键方面的全面3D基准测试,具有详细且多样的标签。 实验结果表明,在我们的 PiSA-Bench 上,PointLLM-PiSA 在零样本3D物体描述和生成分类方面达到了最先进的性能,分别实现了46.45% (+8.33%) 和 63.75% (+16.25%) 的显著改进。我们将发布代码、数据集和基准测试。
https://arxiv.org/abs/2503.10529