As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: this https URL
随着大型多模态模型(LMMs)能力的提升,人们对评估其推理过程的兴趣也在增加,而不仅仅是关注它们的最终输出。然而,大多数基准测试仍然主要集中在英语上,忽略了具有丰富语言和文化背景的语言,例如阿拉伯语。为了解决这一不足,我们引入了全面的阿拉伯多模态推理基准(ARB),这是第一个旨在通过文本和视觉两种模式评估阿拉伯语分步推理过程的基准。ARB涵盖了包括视觉推理、文档理解、光学字符识别(OCR)、科学分析以及文化解读在内的11个不同领域,并包含了1,356个多模态样本,与之相配的是5,119个人工策划的推理步骤和相应的行为。我们对12种最先进的开源和闭源LMMs进行了评估,发现它们在一致性、忠实度和文化基础方面仍然存在持续性的挑战。 ARB为诊断代表不足的语言中的多模态推理提供了一个结构化的框架,并标志着向包容性更强、更透明且更具文化意识的人工智能系统迈进的关键一步。我们发布了基准测试、评分标准以及用于支持未来研究和可重复性的评估工具。相关代码可在以下网址获取:this https URL
https://arxiv.org/abs/2505.17021
The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
大型多模态模型(LMMs)的出现显著增强了大型语言模型(LLMs)处理和解释多样化数据模式(如图像和视频)的能力。然而,随着输入复杂性的增加,尤其是在长视频序列的情况下,所需的token数量大幅增长,导致计算成本呈二次方增长。这使得在保持性能完整性的前提下高效压缩LMMs中的视频token成为一个紧迫的研究挑战。 在本文中,我们介绍了CrossLMM,通过双交叉注意力机制将长视频序列从LMMs中解耦,从而显著减少了视觉token的数量,并且几乎不会对性能产生负面影响。具体来说,我们首先通过对预训练的视觉编码器使用池化方法实现大量token的减少。然后,在LLM层内,我们采用了一种视觉到视觉的交叉注意力机制,其中池化的视觉tokens作为查询与原始视觉token集合进行比较。这一模块使得更高效的token利用成为可能,并且保持了细粒度的信息保真度。 此外,我们引入了一个文本到视觉的交叉注意力机制,在该机制中,文本tokens通过与原始视觉tokens互动而增强,从而丰富了对文本tokens的理解。 全面的经验评估表明,尽管使用了显著较少的计算资源,我们的方法在各种基于视频的LMM基准测试上实现了可比或更优的表现。
https://arxiv.org/abs/2505.17020
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
我们介绍了一种基于强化学习的简单且可扩展的交互式后期训练范例——RIPT-VLA,该方法仅使用稀疏二元成功奖励对预训练的视觉-语言-动作(VLA)模型进行微调。现有的VLA训练流水线依赖于大量的离线专家演示数据和监督模仿学习,这限制了它们在低数据环境下的适应能力。RIPT-VLA通过启用基于动态采样和留一法优势估计的稳定策略优化算法的交互式后期训练来解决这个问题。 RIPT-VLA具有以下特点:首先,它适用于各种VLA模型,在轻量级QueST模型上提高了21.2%,并且在7B OpenVLA-OFT模型上达到了前所未有的97.5%的成功率。其次,它在计算和数据使用方面都十分高效:仅用一次演示,RIPT-VLA就使原本几乎无法工作的SFT模型(成功率仅为4%)在经过15次迭代后将成功率提高到97%。此外,我们还展示了由RIPT-VLA学习的策略能够跨不同任务和场景进行泛化,并且对初始状态背景具有鲁棒性。 这些结果凸显了RIPT-VLA作为一种通过最小监督有效提升VLA模型后期训练性能的方法的实际价值与效果。
https://arxiv.org/abs/2505.17016
Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
从互联网视频中学习潜在运动对于构建通用型机器人至关重要。然而,现有的离散潜在动作方法存在信息损失的问题,并且难以处理复杂和细微的动态变化。为此我们提出了CoMo(Continuous Motion),旨在从多样化的、大规模的互联网视频中学习更为详尽的连续运动表示。 CoMo采用了早期时间特征差分机制来防止模型崩溃并抑制静态外观噪声,从而有效避免了捷径学习问题的发生。同时,遵循信息瓶颈原则,我们将潜在运动嵌入维度进行限制,以在保留足够的与动作相关的信息和最小化无关的外观噪声之间取得更好的平衡。 此外,我们还引入了两个新的评估指标,用于更加稳健且经济地评价运动并指导运动学习方法的发展:(i)动作预测线性探测MSE;(ii)过去到当前及未来到当前运动嵌入之间的余弦相似度。这两个指标对于衡量模型在不同时间和视角下保持一致性和相关性的能力至关重要。 最关键的是,CoMo展示出了强大的零样本泛化能力,使其能够为之前未见过的视频领域生成连续伪动作。这种能力使得利用从无标签视频数据集中提取的各种伪动作进行统一策略联合学习成为可能(例如跨实体视频和显著的人类演示视频),这在必要时可以结合有限标记的机器人数据进一步增强。 广泛的实验表明,与CoMo伪动作协同训练的策略在模拟和现实世界环境中使用扩散模型和自回归架构均表现出卓越性能。
https://arxiv.org/abs/2505.17006
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.
均匀下采样一直是降低视觉骨干网络空间分辨率的事实标准。在本工作中,我们提出了一种基于内容感知的空间分组层的替代设计方案,该设计可以根据图像边界及其语义内容动态地将标记分配给一个缩小的集合中。在整个连续骨干阶段堆叠我们的分组层会产生一种层次化分割,这种分割自然出现在特征提取过程中,从而形成了我们提出的原生分割视觉变换器(Native Segmentation Vision Transformer)。我们展示了对架构进行精心设计可以使仅通过分组层就能产生强大的分割掩码,而无需额外的特定于分割的头部。这为新的原生骨干级分割范式奠定了基础,该范式可以在没有掩码监督的情况下实现强大的零样本结果,并且对于下游分割任务具有最小和高效的独立模型设计。我们的项目页面在此 [URL]。 注:原文中的项目页面链接(https URL)未给出具体网址,在实际引用时需要提供完整的网址信息。
https://arxiv.org/abs/2505.16993
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
视频虚拟试衣的目标是将视频中的主体无缝地穿上特定的衣物。主要挑战在于在动态适应主体姿势和体型的同时,保持服装的真实视觉效果。尽管现有的方法大多集中在基于图像的虚拟试穿技术上,但直接将其应用到视频中通常会导致时间上的不一致性。目前大多数视频虚拟试衣的方法通过加入时间模块来缓解这一问题,但仍忽略了人类与衣物之间的关键时空姿态互动。 为了有效解决视频中的姿势交互,在每一帧中不仅需要考虑人体和衣物姿势的空间对齐,还需要考虑到整个视频中的人体姿势的动态变化。基于此动机,我们提出了一种新的框架——动态姿态互动扩散模型(Dynamic Pose Interaction Diffusion Models, DPIDM),利用扩散模型深入探索动态姿态互动在视频虚拟试衣中的应用。 技术上,DPIDM引入了一个骨架基础的姿态适配器,将同步的人体和衣物姿势整合到去噪网络中。随后设计了一种分层注意力模块,通过基于姿态的空域和时间域注意机制来建模帧内人体与衣物的姿势互动以及跨帧长时间段内的动态变化。此外,DPIDM利用连续帧之间的正则化注意损失来增强时间一致性。 在VITON-HD、VVT 和ViViD 数据集上进行的大量实验表明了我们提出的DPIDM方法相对于基线方法的优势。值得注意的是,在VVT数据集中,DPIDM达到了VFID得分为0.506,比最先进的GPD-VVTO方法提高了60.5%。
https://arxiv.org/abs/2505.16980
Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.
尽管大型语言模型具有令人印象深刻的性能,它们在训练数据分布之外的泛化能力仍然有限,常常表现出复杂的模式插值而非真正的抽象推理(外推)。在这项工作中,我们通过信息瓶颈(IB)理论来探讨这一限制。IB 理论认为,模型的泛化能力源自输入压缩与潜在表示中保留预测信息之间的最优平衡。使用 IB 理论,我们证明解码器单独的 Transformer 在形成任务优化序列表示时存在固有局限性。基于此结果,我们进一步展示周期性的全局转换内部序列级表示(KV 缓存)是提升 Transformer 在推理任务泛化能力的关键计算步骤。 根据这些理论见解,我们提出了一种对Transformer架构进行修改的方法,即添加一个额外的模块,在周期间隔内全局重写 KV 缓存,使其容量从记忆输入前缀向编码预测未来标记最相关的特征转变。我们的模型在数学推理基准测试中取得了显著的优势,优于参数量最多高达3.5倍的标准 Transformer 模型,以及用于缓存压缩的启发式驱动剪枝机制。 这种方法可以被视为现有 KV 缓存压缩方法的原理化扩展;虽然这些方法仅专注于压缩输入表示,但往往以牺牲保留预测信息为代价,因此其能力本质上受到无约束模型能力的限制。这建立了一个基于信息理论操作Transformer内存的原则性框架,解决了单纯通过扩大规模无法克服的根本推理局限性。
https://arxiv.org/abs/2505.16950
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
人工智能(AI)正在加速科研范式的转变,不仅提升了研究效率,还推动了创新。我们推出了NovelSeek,这是一个统一的闭环多智能体框架,用于在多个科学领域中进行自主科学研究(ASR),使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。NovelSeek突出三大优势: 1. **可扩展性**:NovelSeek已在12项科研任务中展示了其适应能力,能够在多种基线代码的性能提升方面生成创新想法。 2. **交互性**:NovelSeek提供了一个接口,支持人类专家反馈和多智能体互动,在自动化端到端过程中能够无缝集成领域专业知识。 3. **效率**:相比人工努力,NovelSeek在多个科学领域中实现了显著的时间成本节约,并取得了令人瞩目的性能提升。例如,在反应产率预测方面,其性能从27.6%提升至35.4%,仅耗时12小时;在增强子活性预测上,准确度从0.52升至0.79,仅需4小时的处理时间;而在二维语义分割领域,精度提升了近三个百分点,在短短30小时内由78.8%提高到81.0%。
https://arxiv.org/abs/2505.16938
We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents' long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.
我们介绍了$\infty$-THOR,这是一个新的框架,旨在处理具身任务中的长时间跨度问题,并在具身人工智能中推进长上下文理解。$\infty$-THOR提供了以下内容: 1. 一个生成框架,用于合成可扩展、可重复且无限的长时间跨度轨迹; 2. 一个新的具身问答任务,“针在具身干草堆里”,其中遍布于延长轨迹中的多个散落线索测试代理的长上下文推理能力; 3. 一套包含复杂任务的长时间跨度数据集和基准套件,每个任务跨越数百个环境步骤,并配以真实动作序列。 为了实现这一功能,我们探索了架构调整,包括交错的目标-状态-行动建模、上下文扩展技术以及上下文并行性,以便为基于大语言模型(LLM)的代理提供极端长上下文推理和交互的能力。实验结果和分析突显了我们的基准带来的挑战,并提供了关于长时间跨度条件下训练策略及模型行为的见解。 我们这项工作为下一代能够进行稳健、长期推理与规划的具身人工智能系统奠定了基础。
https://arxiv.org/abs/2505.16928
Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.
驾驶模拟在开发可靠的自动驾驶系统中扮演着至关重要的角色,通过提供受控的、评估性的环境来实现这一目标。为了进行有意义的评估,一个高质量的驾驶模拟器必须满足几个关键要求:多模态感知能力(例如相机和激光雷达),真实场景渲染以减少观测差异;闭环评价支持自由形式轨迹行为;高度多样化的交通场景以进行全面测试;多智能体协作以捕捉交互动态;以及高计算效率以确保可负担性和扩展性。然而,现有的模拟器和基准测试未能全面满足这些基本标准。 为了解决这一差距,本文介绍了一种名为RealEngine的新颖驾驶模拟框架。该框架通过将3D场景重建与新颖视角合成技术有机结合,在驾驶环境中实现了逼真的闭环仿真。借助真实世界的多模态传感器数据,RealEngine能够分别重构背景和前景交通参与者,从而通过灵活的场景组合实现高度多样化且真实的交通场景。这种场景重建与视图合成相结合的方式使得跨多种传感模式进行照片级现实渲染成为可能,确保了感知精度和几何准确性。 基于此环境,RealEngine支持三大关键驾驶模拟类别:非反应性仿真、安全测试以及多智能体交互,共同构成了一套可靠且全面的评估基准,以评测自动驾驶系统的实际性能。
https://arxiv.org/abs/2505.16902
Shared autonomy is an enabling technology that provides users with control authority over robots that would otherwise be difficult if not impossible to directly control. Yet, standard methods make assumptions that limit their adoption in practice-for example, prior knowledge of the user's goals or the objective (i.e., reward) function that they wish to optimize, knowledge of the user's policy, or query-level access to the user during training. Diffusion-based approaches to shared autonomy do not make such assumptions and instead only require access to demonstrations of desired behaviors, while allowing the user to maintain control authority. However, these advantages have come at the expense of high computational complexity, which has made real-time shared autonomy all but impossible. To overcome this limitation, we propose Consistency Shared Autonomy (CSA), a shared autonomy framework that employs a consistency model-based formulation of diffusion. Key to CSA is that it employs the distilled probability flow of ordinary differential equations (PF ODE) to generate high-fidelity samples in a single step. This results in inference speeds significantly than what is possible with previous diffusion-based approaches to shared autonomy, enabling real-time assistance in complex domains with only a single function evaluation. Further, by intervening on flawed actions at intermediate states of the PF ODE, CSA enables varying levels of assistance. We evaluate CSA on a variety of challenging simulated and real-world robot control problems, demonstrating significant improvements over state-of-the-art methods both in terms of task performance and computational efficiency.
共享自主性是一种使技术,它让用户能够控制那些原本难以直接操控的机器人。然而,标准方法会做出一些假设,限制了它们在实际中的采用——例如,需要预先了解用户的任务目标或他们希望优化的目标(即奖励函数)、用户策略的知识或者在训练过程中询问用户的需求。基于扩散的方法则不依赖于这些假设,仅需访问期望行为的演示,并允许用户保留控制权。然而,这种优势是以计算复杂度高为代价实现的,这使得实时共享自主性变得几乎不可能实现。 为了克服这一限制,我们提出了Consistency Shared Autonomy(CSA),这是一种采用基于一致性的模型扩散方法的共享自主框架。CSA的关键在于它使用普通微分方程中的概率流蒸馏来生成单一步骤中的高保真样本。这种方法使得推理速度远超之前的扩散方式在共享自主性上的表现,能够在复杂领域实现实时辅助,仅需一次函数评估即可完成。此外,通过在PF ODE(概率流微分方程)的中间状态干预错误行为,CSA能够提供不同程度的帮助。 我们在各种具有挑战性的模拟和真实世界机器人控制问题上对CSA进行了测试,展示了其在任务性能和计算效率方面显著优于最先进的方法。 翻译如下: 共享自主性是一种关键技术,它允许用户对其难以直接操控的机器人进行控制。然而,标准的方法往往做出一些限制其实际应用的假设——例如,假定事先了解用户的任务目标或他们想要优化的目标(即奖励函数)、知晓用户策略或者在训练过程中能够获取到用户的反馈信息。基于扩散的方法不依赖于这些假设,并且只需要访问期望行为的数据展示即可进行操作,同时允许用户保留控制权。但是,这种优势是以极高的计算复杂度为代价的,这使得实时共享自主性变得几乎不可能实现。 为了克服这一限制,我们提出了Consistency Shared Autonomy(CSA),这是一种基于一致性的模型扩散方法框架下的共享自主系统。CSA的核心在于它使用了普通微分方程中的概率流蒸馏技术,在单一步骤中生成高保真的样本。这种方法显著提高了推理速度,使得在复杂环境中实现实时辅助成为可能,并且只需进行一次函数评估即可完成。此外,通过干预PF ODE(概率流微分方程)中间状态下的错误行为,CSA能够提供不同程度的帮助。 我们在一系列具有挑战性的模拟和现实世界机器人控制问题上对CSA进行了测试,结果显示其在任务性能和计算效率方面均显著优于现有的最佳方法。
https://arxiv.org/abs/2505.16892
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL
尽管视频扩散变压器(DiT)模型的生成质量十分出色,但由于其计算需求过大,实际部署受到了严重阻碍。这种低效率主要源自两个关键挑战:自注意力机制相对于标记长度呈二次复杂度增长,以及扩散模型的多步骤特性。为解决这些局限性,我们提出了Jenga,这是一种结合动态注意力裁剪与渐进式分辨率生成的新颖推理管道。我们的方法利用了以下两大洞见:(1)早期去噪阶段不需要高分辨率潜在特征,(2)后期阶段则无需密集型注意力机制。Jenga引入了一种基于块的注意力机制,在此机制中使用三维空间填充曲线动态选择相关的标记交互,并配合一种渐进式分辨率策略,在生成过程中逐步增加潜在特征的分辨率。 实验结果显示,与多种最先进的视频扩散模型相比,Jenga实现了显著的速度提升(在VBench数据集上速度提升了8.83倍且性能仅下降了0.01%),同时保持了相当的生成质量。作为即插即用解决方案,Jenga通过将推理时间从数分钟缩短至数秒,使得在现代硬件上进行高质量视频生成成为可能——而无需重新训练模型。 代码链接:[请在此处插入具体链接]
https://arxiv.org/abs/2505.16864
Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.
通过在线强化学习(RL)提升预训练策略的性能是一个重要但具有挑战性的课题。现有的在线RL微调方法需要继续使用离线预训练的价值函数Q进行稳定性和性能优化。然而,这些离线预训练的价值函数通常会低估超出离线数据集的状态-动作对,因为大多数离线RL方法都偏向保守,这在从离线转向在线设置时阻碍了进一步的探索。此外,这一需求限制了它们在只有预训练策略而没有预训练价值函数可用的情景下的适用性,比如模仿学习(IL)预训练中就是如此。 为了解决这些挑战,我们提出了一种仅使用离线预训练策略进行高效在线RL微调的方法,从而消除对预训练价值函数的依赖。我们引入了PORL(Policy-Only Reinforcement Learning Fine-Tuning),该方法在在线阶段从头开始快速初始化Q函数,以避免有害的悲观态度。我们的方法不仅实现了与先进的离线到在线RL算法和利用数据或策略先验的在线RL方法相当的性能,而且还为直接微调行为克隆(BC)策略开辟了一条新的路径。
https://arxiv.org/abs/2505.16856
We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly $n$ tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library.
我们将嵌套命名实体识别(NNER)视为一个序列标注任务,通过利用先前将构成结构线性化的研究成果,这种方法有效地简化了这种结构化预测问题,并将其转化为简单的标记分类。结合这些构成线性化和预训练编码器,我们的方法在执行精确的$n$个标签操作的同时捕捉到了嵌套实体。与效率较低的系统相比,我们的方法取得了具有竞争力的表现,并且可以使用任何现成的序列标注库进行训练。
https://arxiv.org/abs/2505.16855
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at this https URL.
检索增强生成(RAG)系统在复杂深度搜索场景中提升了大型语言模型(LLM)的能力,这些场景需要多步推理和迭代信息检索。然而,现有方法面临关键限制:缺乏高质量的训练轨迹或因模拟环境中分布不匹配以及现实世界部署计算成本过高而受阻。本文介绍了SimpleDeepSearcher,这是一个轻量级但有效的框架,通过策略性的数据工程而非复杂的训练范式来弥合这一差距。我们的方法通过在实时网络搜索环境中模拟真实的用户交互,合成高质量的训练数据,并结合多标准策划策略优化输入和输出侧的多样性和质量。我们在五个跨不同领域的基准测试中进行的实验表明,仅对871个精选样本进行微调即可显著超越基于强化学习(RL)的基础线方法。我们的工作通过系统地解决数据稀缺瓶颈,确立了SFT(即针对特定任务的精细调整)作为可行途径的地位,并为高效的深度搜索系统提供了实用见解。我们的代码可在上述链接获取。
https://arxiv.org/abs/2505.16834
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at this https URL.
GUI自动化在动态环境中面临关键挑战。大规模语言模型(MLLMs)存在两个主要问题:误判UI组件和知识过时。传统的微调方法对于应用程序特定的知识更新成本高昂。我们提出了一种名为GUI-explorer的无训练需求的GUI代理,该代理融合了两种基本机制: 1. 功能感知轨迹自主探索。为了全面覆盖所有应用功能,我们设计了一个基于分析GUI结构信息(如屏幕截图和活动层次)的功能感知任务目标生成器,自动构建探索目标。这使得系统化地进行多样化轨迹收集成为可能。 2. 过渡感知知识无监督挖掘。为建立精确的屏幕操作逻辑,我们开发了一种过渡感知知识提取器,通过分析结构化的交互三元组(观察、行动、结果)的状态转换来进行有效的屏幕操作逻辑的无监督学习。这消除了人工参与知识抽取的需求。 在SPA-Bench和AndroidWorld基准测试中,GUI-explorer的任务成功率分别达到了53.7%和47.4%,显著优于现有最优方法(SOTA)代理,并且对于新应用无需参数更新即可使用。GUI-explorer已开源并公开发布于以下链接:[https URL]。
https://arxiv.org/abs/2505.16827
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
最近在基于场景的视频生成领域的进展使得系统能够从结构化的提示中合成连贯的视觉叙述。然而,叙事中的一个关键维度——以角色驱动对话和言语——仍然相对未被充分探索。在这篇论文中,我们提出了一种模块化管道,该管道将动作级别的提示转换为基于视觉和听觉的叙述对话,从而丰富了视觉叙事,并加入了自然的声音和人物表达。我们的方法采用每场景一对输入提示作为输入,其中第一个定义背景设置,第二个指定角色的行为。虽然像Text2Story这样的故事生成模型可以产生相应的视觉场景,但我们专注于从这些提示和场景图像中生成富有表现力的对话文本。 我们应用了一个预训练的视觉-语言编码器来提取代表帧中的高层次语义特征,捕捉显著的视觉上下文。这个特征随后与结构化提示相结合,并用来指导大型语言模型合成自然且角色一致的对话。为了确保在整个故事中的场景之间保持上下文一致性,我们引入了递归叙事库,使得每一次对话生成都基于之前场景积累下来的对话历史。这种方法使角色能够以反映其不断变化的目标和互动的方式进行交谈。 最后,我们将每个语句渲染成富有表现力且符合角色的语音,从而产生完整的有声视频叙述。我们的框架无需额外训练,并展示了在各种故事设置中的适用性,包括幻想冒险和日常生活片段等场景。
https://arxiv.org/abs/2505.16819
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: this https URL
近年来,具身人工智能(Embodied AI)得到了快速发展,但主要仍局限于实验室环境之中。现实世界中的种种限制因素制约了其应用范围。传统上,图像质量评估(IQA)方法被用来预测人类对失真图像的偏好;然而,并不存在能够评估图像在具身任务中可用性的IQA方法,即机器人感知质量的评估手段。为了为未来具身场景提供准确可靠的指标,我们首先提出了“面向具身AI的图像质量评估”这一主题。 具体而言: 1. 基于默顿体系和元认知理论,构建了感知-认知-决策-执行流程,并定义了一套全面的主观评分收集方法。 2. 创建了包含超过36,000对参考/失真图像、由视觉语言模型/视觉语言行动模型/现实世界机器人提供的超过5百万条精细注释的具身-IQA数据库。 3. 在具身-IQA上训练并验证主流IQA方法的表现,展示了需要开发更准确的质量指标来适应具身AI的需求。 我们真诚地希望通过评估工作能够促进在复杂失真环境下的现实世界中应用具身人工智能。项目页面:[此链接](this https URL)
https://arxiv.org/abs/2505.16815
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
将视觉-语言模型(VLMs)集成到自主驾驶系统中,展示了在解决学习复杂性、可解释性和常识推理等关键挑战方面的潜力。然而,现有的方法常常因计算需求而难以实现高效整合和实时决策。为此,本文介绍了SOLVE框架,它通过结合VLM与端到端(E2E)模型来增强自主车辆的规划能力。我们的方法强调在特征级别上通过共享视觉编码器进行知识分享,从而使VLM和E2E组件能够全面互动。 我们提出了一种轨迹链式思考(T-CoT)范例,该范例逐步细化轨迹预测,减少不确定性并提高准确性。SOLVE采用时间解耦策略实现高效合作,在确保高质量的VLM输出的同时,也能达到E2E模型的实时性能要求。在nuScenes数据集上进行评估后,我们的方法显著提高了轨迹预测的准确性,为构建更稳健和可靠的自主驾驶系统铺平了道路。
https://arxiv.org/abs/2505.16805