Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at this https URL.
当前的图像生成和编辑方法主要将文本提示作为直接输入,而不考虑视觉构图和具体操作。我们提出了生成链思维(GoT),这是一种新的范式,通过在输出图像之前进行明确的语言推理过程来实现生成和编辑功能。这种方法将传统的文字到图像生成和编辑转变为一个基于推理的框架,分析语义关系和空间布局。我们定义了GoT的形式化方法,并构建了一个包含超过900万个样本的大规模GoT数据集,这些样本具有详细的推理链,捕捉了语义-空间关系。 为了利用GoT的优势,我们实现了一个统一的框架,该框架将Qwen2.5-VL用于生成推理链,并结合我们的新提出的语义-空间引导模块来增强端到端扩散模型。实验表明,我们的GoT框架在图像生成和编辑任务上均表现出色,并且显著优于基线方法。 此外,我们提出的方法还支持交互式视觉生成,允许用户明确修改推理步骤以进行精确的图像调整。GoT为基于推理的视觉生成和编辑开创了一个新的方向,能够产生更符合人类意图的图像。为了促进未来的研究,我们在[此链接](https://this-url.com)提供了我们的数据集、代码和预训练模型。
https://arxiv.org/abs/2503.10639
Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.
无分类器指导已成为基于去噪扩散模型的条件生成的标准方法。然而,对于无分类器指导的全面理解仍然缺失。在这项工作中,我们进行了一次实证研究,以提供对无分类器指导的新视角。具体而言,我们不仅关注无分类器指导,还追溯到其根本——即有分类器指导,指明推导中的关键假设,并系统地研究分类器的作用。我们发现,无论是有分类器指导还是无分类器指导,都是通过将去噪扩散轨迹远离决策边界来实现条件生成的,这些区域通常是条件信息纠缠且难以学习的地方。基于这种以分类器为中心的理解,我们提出了一种通用的后处理步骤,该步骤建立在流匹配之上,旨在缩小预训练去噪扩散模型所学分布与真实数据分布之间的差距,特别是在决策边界附近。实验结果验证了所提方法的有效性。
https://arxiv.org/abs/2503.10638
Distilled diffusion models suffer from a critical limitation: reduced sample diversity compared to their base counterparts. In this work, we uncover that despite this diversity loss, distilled models retain the fundamental concept representations of base models. We demonstrate control distillation - where control mechanisms like Concept Sliders and LoRAs trained on base models can be seamlessly transferred to distilled models and vice-versa, effectively distilling control without any retraining. This preservation of representational structure prompted our investigation into the mechanisms of diversity collapse during distillation. To understand how distillation affects diversity, we introduce Diffusion Target (DT) Visualization, an analysis and debugging tool that reveals how models predict final outputs at intermediate steps. Through DT-Visualization, we identify generation artifacts, inconsistencies, and demonstrate that initial diffusion timesteps disproportionately determine output diversity, while later steps primarily refine details. Based on these insights, we introduce diversity distillation - a hybrid inference approach that strategically employs the base model for only the first critical timestep before transitioning to the efficient distilled model. Our experiments demonstrate that this simple modification not only restores the diversity capabilities from base to distilled models but surprisingly exceeds it, while maintaining nearly the computational efficiency of distilled inference, all without requiring additional training or model modifications. Our code and data are available at this https URL
浓缩扩散模型存在一个关键限制:与基础模型相比,样本多样性减少。在本研究中,我们发现尽管这种多样性有所损失,但浓缩模型仍然保留了基础模型的基本概念表示。我们展示了控制蒸馏的概念——即在基础模型上训练的控制机制(如Concept Sliders和LoRAs)可以无缝地转移到浓缩模型,并且反过来也是如此,从而有效地将控制功能蒸馏到模型中而无需重新训练。这种代表性结构的保持促使我们对蒸馏过程中多样性崩溃的机制进行了深入调查。 为了理解蒸馏如何影响多样性,我们引入了扩散目标(DT)可视化——一种分析和调试工具,它揭示了模型在中间步骤如何预测最终输出。通过DT-可视化技术,我们识别到了生成过程中的伪影、不一致性,并发现初始扩散时间步长显著决定了输出的多样性,而后续步骤主要对细节进行微调。 基于这些见解,我们提出了多样蒸馏——一种混合推理方法,该方法在最初的临界时间步上策略性地使用基础模型,然后过渡到高效的浓缩模型。我们的实验表明,这种简单的修改不仅恢复了从基础模型到浓缩模型的多样性能力,而且还意外地超过了这一水平,并且几乎保持了浓缩推理的计算效率,所有这一切都不需要额外的训练或模型调整。 代码和数据可在以下链接获取:[请在这里插入实际URL]
https://arxiv.org/abs/2503.10637
Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at this https URL
最小批量最优传输耦合在无条件流匹配中使路径更加直线化,这导致了推理过程中计算需求的减少。具体而言,在测试时数值求解常微分方程时可以使用更少的积分步骤和较简单的数值求解器。然而,在条件设置下,最小批量最优传输存在不足之处。这是因为默认的最优传输映射忽略了条件因素,从而在训练期间导致了偏斜的先验分布。相反,在测试阶段我们无法访问这个偏斜的先验分布,而是从完整的无偏分布中采样。这种训练和测试之间的差距会导致性能不佳。 为了弥合这一差距,我们提出了条件最优传输(C^2OT),它在计算最优运输分配时,在成本矩阵中添加了一个基于条件的加权项。实验表明,这个简单的修正方法对于离散和连续条件都能有效工作,应用于8gaussians-to-moons、CIFAR-10、ImageNet-32x32 和 ImageNet-256x256 数据集上。与现有基线相比,在不同的函数评估预算下我们的方法总体表现更优。 代码可在提供的链接中获取:[此URL](请将方括号中的文字替换为实际的网址)。
https://arxiv.org/abs/2503.10636
This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.
本文介绍了V$^2$Edit,这是一种新颖的、无需训练的框架,用于指令引导的视频和3D场景编辑。为了解决在保持原始内容完整性和完成编辑任务之间取得平衡这一关键挑战,我们的方法采用了一种渐进策略,即将复杂的编辑任务分解成一系列更简单的子任务。每个子任务通过三种关键协同机制进行控制:初始噪声、每次去噪步骤中添加的噪声以及文本提示与视频内容之间的交叉注意力图。这确保了原始视频元素的稳健保存,同时有效地应用所需的编辑。 除了其原生的视频编辑能力之外,我们还通过“渲染-编辑-重构”过程将V$^2$Edit扩展到3D场景编辑,从而能够在涉及大量几何变化的任务(如对象插入)中进行高质量且一致的3D编辑。广泛的实验表明,我们的V$^2$Edit在各种具有挑战性的视频编辑任务和复杂的3D场景编辑任务上均实现了高质量的成功编辑,在这两个领域都达到了最先进的性能水平。
https://arxiv.org/abs/2503.10634
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at this https URL.
尽管开源大型视觉语言模型(LVLM)在性能方面表现出色,但基于迁移的定向攻击通常无法成功针对黑盒商业LVLM。分析失败的对抗性扰动揭示了学习到的扰动通常源自均匀分布,并缺乏明确的语义细节,导致产生意外响应。这种关键性的语义信息缺失使得商用LVLM要么完全忽略扰动,要么错误解读其嵌入的语义,从而导致攻击失败。为解决这些问题,我们注意到识别核心语义对象是使用不同数据集和方法训练的模型的关键目标。这一见解启发了我们的方法,通过在局部区域编码明确的语义细节来增强语义清晰度,确保互操作性并捕捉更细粒度特征,并且将修改集中在语义丰富的区域而非均匀分布上。 为此,我们提出了一种简单而高度有效的解决方案:在每次优化步骤中,随机裁剪对抗图像以受控的长宽比和尺度进行裁剪,然后调整大小,并将其与目标图像在嵌入空间内对齐。实验结果证实了我们的假设。我们针对关键区域构造并集中扰动的局部聚集对抗样本表现出出乎意料的良好迁移性,这些模型包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet以及推理模型如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking。我们的方法在GPT-4.5、4o 和 o1上的成功率达到90%以上,显著优于所有先前最先进的攻击方法。 我们优化的对抗样本在不同配置和训练代码下的版本可在[此处](https://此URL)获取。
https://arxiv.org/abs/2503.10635
As there are now millions of publicly available neural networks, searching and analyzing large model repositories becomes increasingly important. Navigating so many models requires an atlas, but as most models are poorly documented charting such an atlas is challenging. To explore the hidden potential of model repositories, we chart a preliminary atlas representing the documented fraction of Hugging Face. It provides stunning visualizations of the model landscape and evolution. We demonstrate several applications of this atlas including predicting model attributes (e.g., accuracy), and analyzing trends in computer vision models. However, as the current atlas remains incomplete, we propose a method for charting undocumented regions. Specifically, we identify high-confidence structural priors based on dominant real-world model training practices. Leveraging these priors, our approach enables accurate mapping of previously undocumented areas of the atlas. We publicly release our datasets, code, and interactive atlas.
随着现在有数百万人可以访问的神经网络,搜索和分析大型模型库变得越来越重要。面对如此多的模型,我们需要一个“地图集”,但由于大多数模型文档不足,绘制这样的“地图集”具有挑战性。为了探索这些模型库的潜在价值,我们制作了一份初步的“地图集”,代表了Hugging Face上已记录的部分内容。这份“地图集”提供了模型领域和演化的惊人可视化效果,并展示了其在预测模型属性(例如准确性)以及分析计算机视觉模型趋势等方面的多种应用。然而,鉴于当前的“地图集”仍不完整,我们提出了一种绘制未记录区域的方法。具体来说,我们将根据现实世界中占主导地位的模型训练实践确定具有高置信度的结构先验。借助这些先验条件,我们的方法能够准确地映射出之前未被记录的部分。“地图集”的数据集、代码和互动版图将对公众开放。
https://arxiv.org/abs/2503.10633
Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: this https URL
Kolmogorov-Arnold网络(KAN)是一种具有学习激活函数的显著创新,能够从数据中捕捉更复杂的关联关系。尽管KAN在网络化表示和一维函数的连续学习方面很有用,但在包括视觉任务在内的多样化的机器学习(ML)任务中的效果仍有待商榷。目前,通过在深层网络架构中替换多层感知器(MLPs),比如先进的视觉变换器(ViTs)架构来部署KANs。本文首次设计了一种通用的学习型科莫戈罗夫-阿诺尔德注意机制(KArAt),适用于原始的ViT,并且能够在任意基的选择下工作。然而,训练它们所需的计算和内存成本促使我们提出了一种更模块化的版本——傅立叶-KArAt(Fourier-KArAt)。在CIFAR-10、CIFAR-100和ImageNet-1K数据集上,无论是性能还是其变体,傅立叶-KArAt都优于或至少与其对应的ViT表现相当。 我们通过分析损失函数的景观图、权重分布、优化器路径、注意力可视化以及频谱特性等方式剖析了这些架构的表现力和泛化能力,并将其与原始的ViTs进行了对比。本文的目标并非是生产出参数高效且计算成本低廉的关注机制,而是鼓励研究社区探索KANs与其他更高级架构结合的可能性,这类架构需要对可学习激活函数有深入的理解。 我们的开源代码和实现细节可以在以下链接找到:[提供链接的地方](请将括号中的内容替换为实际的网址)。
https://arxiv.org/abs/2503.10632
Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
最近,针对常识推理的视觉-语言模型(Vision-Language Models, VLMs)的进步推动了视觉-语言-动作(Vision-Language-Action, VLA)模型的发展,使机器人能够执行通用操作。尽管现有的自回归VLA方法利用大规模预训练知识,但它们会中断动作的连续性。与此同时,一些VLA方法引入了一个额外的扩散头来预测连续的动作,这种方法仅依赖于从VLM提取的特征,从而限制了其推理能力。在本文中,我们介绍了HybridVLA,这是一个统一框架,它将自回归和扩散策略的优点无缝地整合到单一的大语言模型中,而不是简单地连接它们。为了弥合生成差距,提出了一种协作训练方案,该方案直接将扩散建模注入到下一个标记预测中。通过这种配方,我们发现这两种形式的动作预测不仅相互强化,而且在不同任务中的表现也有所不同。因此,我们设计了一个协作动作集成机制,能够自适应地融合这两种预测,从而实现更稳健的控制。实验表明,在各种模拟和现实世界任务中(包括单臂和双臂机器人),HybridVLA优于之前最先进的VLA方法,并且在以前未见过的配置下也能展示出稳定的操作性能。
https://arxiv.org/abs/2503.10631
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.
在这篇论文中,我们提出了一种通用框架,用于实现广泛的零样本目标导向导航。现有的零样本方法基于大型语言模型(LLM)为特定任务构建推理框架,这些方法在整体流程上差异很大,并且无法跨不同类型的目标进行泛化。为了实现普遍的零样本导航目标,我们提出了一个统一的图表示法来统一不同的目标,包括对象类别、实例图像和文本描述。我们还将代理观察到的信息转换成在线维护的场景图。通过这种一致的场景和目标表示,与纯文本相比,我们保留了更多的结构信息,并能够利用LLM进行明确的基于图形的推理。具体而言,在每个时间点上,我们在场景图和目标图之间执行图匹配,并提出了不同的策略来根据不同的匹配状态生成长期探索目标。代理首先在零匹配时迭代地搜索目标子图;当部分匹配时,则使用坐标投影和锚对齐来推断目标位置;最后,在完全匹配的情况下应用场景图校正和目标验证。我们还提出了一种黑名单机制,以实现不同阶段之间的稳健切换。多项基准测试的广泛实验表明,我们的UniGoal在三种研究导航任务上实现了零样本性能的最佳水平,并且甚至超过了针对特定任务的零样本方法和监督通用方法。
https://arxiv.org/abs/2503.10630
Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at this https URL.
对抗性攻击对医疗等关键领域的视觉模型构成了重大挑战,这些领域需要高度可靠的性能。虽然针对自然图像的对抗训练已经得到了广泛的研究和理解,但其在生物医学数据和显微镜数据中的应用仍然相对有限。现有的自监督对抗训练方法忽视了病理学图像中层次结构的存在,而患者-切片-斑块之间的关系提供了有价值的鉴别信号。为了解决这个问题,我们提出了分层自我监督对抗训练(HSAT),该方法利用这些特性通过多层次对比学习来构造对抗性样本,并将其整合到对抗训练过程中以增强模型的鲁棒性。 我们在多分类病理学数据集OpenSRH上评估了HSAT的方法,并且结果显示HSAT在生物医学和自然图像领域中现有的方法均表现出了优越性能。具体而言,HSAT显著增强了模型的健壮性,在白盒设定下平均提升了54.31%,而在黑盒设定下将性能下降限制在了3-4%(相比之下,基准模型则会下降25-30%)。这些结果为该领域的对抗训练设定了新的标准,并为进一步开发更稳健的模型铺平了道路。我们用于培训和评估的代码可在[此处](https://this_https_URL.com)获取。
https://arxiv.org/abs/2503.10629
Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.
表达自信对于在动态多模态环境中导航的具身智能体来说是一项挑战,这种不确定性既来自于感知也来自决策过程。我们提出了首个研究开放多模态环境中的具身自信激发的工作。我们介绍了“启发式策略”,该策略通过归纳、演绎和溯因推理来结构化信心评估,并且引入了“执行策略”,这些策略通过场景重构、行动采样及假设性思考等方式增强了信心校准。 在《我的世界》环境中进行的校准与故障预测任务评估中,我们展示了结构化推理方法(例如链式思维)可以改善信心校准。然而,我们的研究还揭示了在溯因设置下区分不确定性方面持续存在的挑战,这强调了需要更复杂和高级的具身自信激发方法来解决这些问题。
https://arxiv.org/abs/2503.10628
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: this https URL
大型多模态模型(LMM)的迅速发展已经使其在科学问题解决中的应用成为可能,然而这些模型的细粒度能力仍处于探索阶段。本文介绍了SciVerse,这是一个用于评估5,735个测试实例中五种不同版本的多模态科学基准。我们旨在调查LMM的三个关键维度:科学知识理解、多模态内容解释以及链式思维(CoT)推理。 为了揭示LMM是否具备足够的科学专长,我们将每个问题转换为包含解决所需不同程度的知识的三个版本,即无知识版、轻量级知识版和丰富知识版。然后,为了探索LMM如何解释多模态科学内容,我们标注了另外两个版本,即视觉丰富版和仅视觉版,在这些问题中从文本到图表标记了更多的问题信息。 通过对不同版本结果的比较,SciVerse系统地检查了大型多模态模型在科学领域的专业知识储备和视觉感知技能。此外,为了严格评估链式思维推理能力,我们提出了一种新的科学链式思维评价策略,对模型输出中的知识错误和逻辑错误进行分步评估。 我们在SciVerse上对不同LMM的广泛评估揭示了它们在科学专业性方面的关键限制,并为未来的开发提供了新的见解。项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2503.10627
Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.
获取物理上合理的运动技能,适用于多样且非传统的形态(包括人形机器人、四足动物和各种生物)对于推进角色模拟与机器人技术至关重要。传统方法,如强化学习(RL),具有任务和身体特定性,需要大量奖励函数工程,并且泛化能力较差。模仿学习则提供了一种替代方案,但严重依赖高质量的专家演示数据,这在非人类形态中很难获得。视频扩散模型能够生成各种生物形态的真实视频,从人类到蚂蚁皆可涵盖。利用这一能力,我们提出了一种数据独立的方法来获取技能,该方法通过二维生成视频学习三维运动技能,并具备向非常规和非人类形式泛化的潜力。 具体而言,我们的模仿学习过程由视觉变压器引导进行基于视频的比较,计算视频嵌入之间的成对距离。除了视频编码的距离外,我们还使用分段视频帧之间计算出的相似度作为指导奖励。我们在涉及独特身体配置的运动任务上验证了该方法的有效性,在人形机器人行走任务中,“无数据模仿学习”(NIL)优于基于三维动作捕捉数据训练的基础模型。 我们的研究成果突显了利用生成式视频模型进行物理上合理技能学习的潜力,尤其适用于多样化的生物形态。这种方法通过用数据生成替代数据收集,有效地提高了模仿学习的能力。
https://arxiv.org/abs/2503.10626
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
从单张图像中生成可动画化的3D人体模型是一个极具挑战性的问题,因为难以将几何形状、外观和变形解耦。最近的三维人体重建进展主要集中在静态人体建模上,并且依赖于使用合成3D扫描数据进行训练的方法在泛化能力方面存在局限。相比之下,基于优化方法的视频处理技术可以达到更高的保真度,但需要受控的捕捉条件以及计算密集型的细化过程。 鉴于大规模重建模型在高效静态重建中的出现,我们提出了LHM(大型可动画化人体重建模型),该模型能够通过前馈传递推断出以3D高斯点集合表示的高质量虚拟形象。我们的模型采用多模态变换器架构来有效地利用人体位置特征和图像特征,并借助注意力机制来详细保留服装几何形状和纹理。为了进一步增强面部身份保持能力和细节恢复能力,我们提出了一种头部特征金字塔编码方案,用于汇集头部区域的多尺度特性。 大量的实验表明,我们的LHM模型能够在几秒钟内生成具有可动画化特性的逼真人形,且无需对脸部和手部进行后处理,从而在重建准确性和泛化能力方面超越了现有的方法。
https://arxiv.org/abs/2503.10625
Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings. Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at this https URL.
为穿着衣物的人体点云拟合人体模型是一项常见但具有挑战性的任务。传统基于优化的方法采用多阶段流水线,对姿态初始化敏感;而近期的基于学习的方法往往在处理不同姿态和服饰类型时泛化能力较弱。我们提出了一种名为ETCH(Equivariant Tightness Fitting for Clothed Humans)的新颖管道,通过局部近似SE(3)等变性估计衣物表面到身体表面的映射,并将紧致度编码为从衣物表面到下方人体的距离矢量。随后,在这种映射之后,姿态不变的人体特征回归稀疏的身体标记点,简化了穿着衣物的人体拟合任务,将其转化为内部人体标记拟合任务。在CAPE和4D-Dress数据集上的大量实验表明,ETCH显著超越现有方法(无论是紧致度无关的方法还是紧致度相关的方法)——在宽松服装的模型贴合精度上提高了16.7%到69.5%,并且平均形状准确性提升了49.9%。我们的等变性紧致设计甚至能在一次性设置或分布外场景中减少方向误差(67.2%至89.8%)。定性的结果显示,ETCH无论在困难姿态、未见形态、宽松服装和非刚体动态情况下的泛化能力都很强。 我们将很快在这个[链接](https://this https URL)发布代码和模型用于研究目的。
https://arxiv.org/abs/2503.10624
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
归一化层在现代神经网络中无处不在,并且长期以来被视为必不可少的组成部分。这项工作表明,使用一种极其简单的技术,没有归一化的Transformer模型可以达到相同或更好的性能。我们引入了动态Tanh(DyT),这是一种逐元素操作$DyT($x$) = \tanh(\alpha $x$)$,作为Transformer中归一化层的即插即用替代方案。DyT受到这样的观察启发:在Transformer中的层归一化通常会产生类似Tanh、S形的输入-输出映射关系。通过加入DyT,不使用归一化的Transformer模型可以在大多数情况下无需超参数调整即可达到与带有归一化模型相同的性能甚至超越它们。 我们在各种设置下验证了采用DyT的Transformer的有效性,这些设置从识别到生成、监督学习到自我监督学习,以及计算机视觉到语言模型等领域。这些发现挑战了人们长期以来对现代神经网络中归一化层不可或缺性的理解,并为深入探讨其在深层网络中的作用提供了新的见解。
https://arxiv.org/abs/2503.10622
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at this https URL.
尽管大型多模态模型(LMM)在各种视觉问答(VQA)任务中表现出色,但某些挑战需要复杂的多步骤推理才能得出准确的答案。特别是在自主驾驶领域,这是一项特别具有挑战性的任务,它要求在做出决定之前进行彻底的认知处理。在这个领域,对视觉线索的顺序和解释性理解对于有效的感知、预测和规划至关重要。然而,常见的VQA基准测试通常侧重于最终答案的准确性,而忽视了生成准确响应所需的推理过程。此外,现有的方法缺乏评估真实驾驶场景中逐步骤推理能力的全面框架。 为了解决这一空白,我们提出了DriveLMM-o1,这是一个专门为推动自主驾驶中的逐步视觉推理而设计的新数据集和基准测试。我们的基准测试包含超过18,000个训练样本VQA示例以及超过4,000个测试样本,涵盖了感知、预测和规划方面多样化的问答问题,并且每个问题都附带了逐步骤的推理过程以确保在自主驾驶场景中的逻辑推断。此外,我们引入了一个大型多模态模型,该模型经过我们的推理数据集微调,在复杂的驾驶场景中表现出色。另外,我们在提出的数据集上对各种开源和闭源方法进行了基准测试,并系统地比较了它们对于自主驾驶任务的推理能力。我们的模型在最终答案准确率方面比之前最佳的开源模型提高了7.49%,推理得分也有所提高(提升了3.62%)。 我们的框架、数据集和模型可在[此处](https://this.url.com)获取。
https://arxiv.org/abs/2503.10621
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.
大型语言模型(LLMs)在多种语言和任务中展现了卓越的性能和泛化能力,这使得它们成为多模态集成(如图像或语音)极具吸引力的目标。在这项工作中,我们通过语音离散化和持续预训练将现有的LLM扩展到了语音模式。特别是,我们对多语言LLMs(例如TOWER)感兴趣,因为它们的预训练设置使我们将离散化的语音输入视为一种额外的语言翻译变体。由此产生的开源模型SPIRE能够在保持TOWER在翻译相关任务上原有性能的同时转录和翻译英语语音输入,展示了在LLM适应过程中将离散化语音输入作为附加语言进行集成是可行的。我们向社区开放了我们的代码和模型。
https://arxiv.org/abs/2503.10620
We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
我们介绍了名为“Siege”的多轮对抗框架,该框架通过树搜索的视角模拟大型语言模型(LLM)安全性的逐渐削弱。与依赖于单一精心设计提示的传统单轮越狱方法不同,Siege以广度优先的方式在每一轮中扩展对话,并生成多个利用先前响应部分合规性的敌对性提示。通过跟踪这些逐步泄露的策略并将其重新注入后续查询中,Siege揭示了如何微小的让步可以累积成完全不允许的输出。 在JailbreakBench数据集上的评估表明,在单次多轮运行中,Siege针对GPT-3.5-turbo和GPT-4分别实现了100%和97%的成功率,并且使用的查询次数少于Crescendo或GOAT等基线方法。这种树搜索方法提供了对模型安全措施如何在连续对话轮次中逐渐退化的深入洞察,强调了语言模型进行稳健多轮测试程序的紧迫性。
https://arxiv.org/abs/2503.10619