Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
偏好优化对于扩散模型和流匹配模型依赖于既具有判别性又计算效率高的奖励函数。视觉-语言模型(VLMs)已成为主要的奖励提供者,通过利用其丰富的多模态先验知识来指导对齐过程。然而,它们的计算成本和内存消耗可能相当大,并且在像素空间中通过奖励优化潜在扩散生成器会导致域不匹配的问题,从而增加对齐难度。在这篇论文中,我们提出了DiNa-LRM(Diffusion-Native Latent Reward Model),这是一种原生的潜在奖励模型,直接在噪声扩散状态上定义偏好学习过程。我们的方法引入了经过校准的Thurstone似然函数,并根据扩散噪音依赖性来估计不确定性。DiNa-LRM利用了一个预训练的潜在扩散主干网络和一个条件于时间步长的奖励头部,并支持推理时的噪声集成,从而提供了一种测试时扩缩和稳健奖励的原生机制。在图像对齐基准上,DiNa-LRM显著优于现有的基于扩散模型的奖励基线,在计算成本大幅减少的情况下达到了与最先进的视觉-语言模型相当的表现水平。在偏好优化方面,我们展示了DiNa-LRM改进了偏好优化动力学过程,使得模型能够更快速且资源效率更高地进行对齐。 总结来说,这项工作提出了一种新颖的方法来解决扩散模型和流匹配模型中的奖励函数设计问题,并证明其在实际应用中具有显著的性能优势。
https://arxiv.org/abs/2602.11146
Neural PDE surrogates are often deployed in data-limited or partially observed regimes where downstream decisions depend on calibrated uncertainty in addition to low prediction error. Existing approaches obtain uncertainty through ensemble replication, fixed stochastic noise such as dropout, or post hoc calibration. Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on the training split for fit, while low-dimensional uncertainty controls are optimized on the regularization split to reduce train-test mismatch, yielding regime-adaptive uncertainty without per-regime noise tuning. The framework can learn continuous noise levels at the output head, within hidden features, or within operator-specific components such as spectral modes. We instantiate the approach in Fourier Neural Operators and evaluate on APEBench sweeps over observed fraction and training-set size. Across these sweeps, the learned predictive distributions are better calibrated on held-out splits and the resulting uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.
神经PDE代理通常在数据有限或部分观测的情况下部署,这时下游决策不仅依赖于低预测误差,还取决于校准后的不确定性。现有方法通过集合复制、固定的随机噪声(如dropout)或事后校准来获取不确定性。 交叉正则化的不确定性学习法则是通过保留的正则化分割中的梯度,在训练期间获得不确定性参数。这种方法在训练分割中优化预测器以适应拟合,同时在正则化分割中优化低维不确定性控制,减少训练集和测试集之间的不匹配,从而实现针对不同情况自适应的不确定性估计,并且无需为每种情况调整噪声。 该框架可以在输出端、隐藏特征或特定操作组件(如频谱模式)内部学习连续的噪声水平。我们在傅立叶神经算子中实现了这种方法,并在APEBench上进行了评估,涵盖了观测比例和训练集大小的变化范围。在这些建模范围内,所学得的概率分布对保留分割具有更好的校准效果,并且由此产生的不确定性区域集中在一步空间诊断中的高误差区域。
https://arxiv.org/abs/2602.11090
We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at this https URL.
我们提出了一种名为PuriLight的轻量级且高效的框架,用于自监督单目深度估计,以解决计算效率和细节保留这两大挑战。尽管最近在自监督深度估计领域的进展已经减少了对地面真实数据监管的依赖,但现有的方法仍然受到庞大架构影响实际应用或轻量化模型牺牲结构精度的限制。这些双重局限强调了开发既轻量又精确的架构的重要性。 我们的框架通过一个三阶段的架构解决了这些问题,该架构整合了三个新的模块:用于局部特征提取的Shuffle-Dilation Convolution(SDC)模块、用于分层特征增强的Rotation-Adaptive Kernel Attention(RAKA)模块以及用于全局特征净化的Deep Frequency Signal Purification(DFSP)模块。这些模块通过有效协作,使PuriLight能够同时实现轻量级且精确的特征提取和处理。 广泛的实验表明,在保持卓越计算效率的同时,PuriLight在使用最少训练参数的情况下达到了最先进的性能水平。代码可在以下网址获取:[此URL]。
https://arxiv.org/abs/2602.11066
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
人类对话是由一系列隐含的思维链组织起来,表现为有时间顺序的语言行为。捕捉这种感知路径是构建自然全双工交互系统的关键。我们介绍了一个框架,该框架将这一过程建模为多层次感知,并通过思想图(GoT)来推理对话行为。我们的方法使用分层标记方案形式化了意图到行动的路径,预测高层次的交流意图和低层次的语言行为,以学习它们之间的因果关系和时间依赖性。为了训练这个系统,我们开发了一个高质量的数据集,该数据集将可控、事件丰富的对话与人类标注标签配对。 GoT框架结构化连续流式的预测为一个不断演变的图,使变压器能够预测下一个语言行为、生成简洁的理由为其决策,并动态地改进其推理过程。在合成和真实的全双工对话上的实验表明,该框架提供了强大的行为检测能力,产生了可解释的推理链,并为进一步基准测试全双工语音对话系统的会话推理奠定了基础。
https://arxiv.org/abs/2602.11065
Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.
图是跨领域的基础,但在没有深厚专业知识的情况下难以使用。大型语言模型(LLM)承诺提供易于使用的自然语言(NL)图分析功能,但它们在处理大规模属性图时表现不佳:这些数据集庞大、高度异构、结构复杂且不断变化。为了解决这一问题,我们设计了一种新的抽象方法,用于执行复杂的多查询分析操作。其核心思想是通过一个语义目录进行规划,而不是直接从自然语言生成脆弱的图查询。这个语义目录描述了图表模式和图表操作。 具体而言,这种方法将LLM规划与广泛推理之间的语义平面与确定性、数据库级别的查询执行分离到执行平面上,后者能够处理整个数据集以及工具实施。这种设计即使对于小上下文的LLM也能在令牌效率和任务效果方面取得显著提升。我们以此抽象为基础开发了第一个增强型图分析框架GraphSeek。 GraphSeek实现了更高的成功几率(例如,比改进后的LangChain高出86%),并指出了下一代可负担且易于使用的图分析技术的发展方向,这些技术将大型语言模型的推理与大规模复杂属性图上的数据库级执行相结合。
https://arxiv.org/abs/2602.11052
Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
带有可验证奖励的强化学习(RLVR)能提高大型语言模型(LLM)的推理能力,但越来越多的证据表明存在探索上限:它经常重新加权现有的解轨迹,而不是发现新的策略,这在大量采样预算下(例如pass-at-256)限制了收益。为了解决这一局限性,我们提出了PSN-RLVR方法,该方法通过扰动策略参数来诱导时间上一致、轨迹级别的探索,在保持长期思维链一致性方面优于动作空间噪声。为了缓解由此产生的采样更新不匹配问题,我们引入了截断的重要采样(TIS)。为了避免昂贵的基于KL的方法自适应噪音控制,我们提出了一种计算效率高的实时自适应噪音调度器,该调度器由结合语义多样性与归一化自我确信度的轻量级代理驱动。在广泛使用的RLVR方法GRPO上实现PSN-RLVR后,在多个数学推理基准和模型族中,PSN-GRPO能一致地扩展有效推理能力边界,从而在大量采样预算下获得更高的pass-at-k,并超越了此前探索导向的RLVR方法(如基于pass-at-k风格训练的方法),同时保持正交性并可组合以获取额外收益。
https://arxiv.org/abs/2602.02555
Developing world models that understand complex physical interactions is essential for advancing robotic planning and this http URL, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic this http URL address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video this http URL framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual this http URL simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization this http URL, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.
开发能够理解复杂物理交互的世界模型对于推进机器人规划和强化学习至关重要。然而,现有方法在数据稀缺且动态环境复杂的条件下(如充满接触的情况)往往难以准确建模。为了应对这些挑战,我们提出了ContactGaussian-WM,这是一种基于物理的刚体世界模型,具备从稀疏而富含接触信息的视频中直接学习复杂物理规律的能力。该框架主要由两个核心组成部分构成:(1)统一高斯表示法,用于视觉外观和碰撞几何;(2)一个端到端可微的学习框架,通过闭式形式的物理引擎来推断稀疏视觉数据中的物理属性。 模拟与真实世界评估表明,ContactGaussian-WM在学习复杂场景时优于现有最先进的方法,并展示了强大的泛化能力。此外,我们在下游应用中展示我们框架的实际效用,包括数据合成和实时模型预测控制(MPC)。
https://arxiv.org/abs/2602.11021
Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
单目深度估计是计算机视觉中的一个核心问题,它在机器人技术、增强现实(AR)和自动驾驶等领域都有广泛应用。然而,驱动现代Transformer架构的自注意力机制仍然难以理解。我们向密集预测变换器(Dense Prediction Transformer, DPT)引入了受奇异值分解(SVD)启发的关注机制(SVDA),这是首个为密集预测任务提供谱结构化表述的关注机制。 SVDA通过在标准化查询-键交互中嵌入一个可学习的对角矩阵,将方向对齐与频谱调制解耦开来,从而生成内在可解释而非后验近似的关注图。实验表明,在KITTI和NYU-v2数据集上,SVDA能够保持或略微提升预测准确性,并且仅增加了轻微的计算开销。 更重要的是,SVDA解锁了六个频谱指标,这些指标可以量化熵、秩、稀疏性、对齐度、选择性和鲁棒性。这揭示了训练过程中注意力在跨数据集和深度维度上如何组织的一致模式,这些都是标准Transformer中无法获得的见解。通过将注意机制的作用从不透明机制转变为可量化的描述符,SVDA重新定义了单目深度估计中的可解释性,并为透明密集预测模型提供了原则性的途径。
https://arxiv.org/abs/2602.11005
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
开发高效的GPU内核对于扩展现代AI系统至关重要,但由于复杂的硬件架构和对专业优化技能的需求,这仍然是一个复杂任务。尽管大型语言模型(LLMs)在生成通用顺序代码方面表现出强大的能力,但在生成GPU代码时却面临重大挑战,原因是高质量标记训练数据的稀缺、合成解决方案生成中的编译器偏见以及跨硬件世代泛化的能力有限。这些因素阻碍了监督微调(SFT)成为改进当前LLM的有效可扩展方法。相比之下,强化学习(RL)提供了一种数据效率高且适应性强的替代方案,但需要访问相关工具、仔细选择训练问题,并建立一个稳健的评估环境。 我们提出了Makora的环境和工具,用于前沿模型的强化学习微调,并报告了在Triton代码生成方面对GPT-5进行微调的结果。在单次尝试设置下,我们的微调模型将内核正确性从43.7%提高到77.0%(+33.3个百分点),并将优于PyTorch TorchInductor的问题比例从14.8%提升至21.8%(+7个百分点),与基线GPT-5相比,同时在KernelBench上超过了之前的最先进模型。当集成到完整的编程代理中时,它可以解决扩展后的KernelBench套件中的97.4%的问题,并且在72.9%的问题上优于PyTorch TorchInductor编译器,几何平均速度提升达到了2.12倍。 我们的工作证明了,在数据可用性限制传统监督学习能力的高度专业化技术领域中,使用强化学习进行有针对性的后训练能够解锁LLM的能力。这为AI辅助加速器编程开辟了新的途径。
https://arxiv.org/abs/2602.11000
Graphs are ubiquitous, and learning on graphs has become a cornerstone in artificial intelligence and data mining communities. Unlike pixel grids in images or sequential structures in language, graphs exhibit a typical non-Euclidean structure with complex interactions among the objects. This paper argues that Riemannian geometry provides a principled and necessary foundation for graph representation learning, and that Riemannian graph learning should be viewed as a unifying paradigm rather than a collection of isolated techniques. While recent studies have explored the integration of graph learning and Riemannian geometry, most existing approaches are limited to a narrow class of manifolds, particularly hyperbolic spaces, and often adopt extrinsic manifold formulations. We contend that the central mission of Riemannian graph learning is to endow graph neural networks with intrinsic manifold structures, which remains underexplored. To advance this perspective, we identify key conceptual and methodological gaps in existing approaches and outline a structured research agenda along three dimensions: manifold type, neural architecture, and learning paradigm. We further discuss open challenges, theoretical foundations, and promising directions that are critical for unlocking the full potential of Riemannian graph learning. This paper aims to provide a coherent viewpoint and to stimulate broader exploration of Riemannian geometry as a foundational framework for future graph learning research.
图形无处不在,图上的学习已经成为人工智能和数据挖掘领域中的基石。与图像中的像素网格或语言中的序列结构不同,图展示了对象之间复杂相互作用的非欧几里得结构。本文认为黎曼几何为图表示学习提供了一个合理且必要的基础,并主张将黎曼图学习视为一种统一范式而非孤立技术的集合。虽然近期的研究已经探索了图学习与黎曼几何的整合,但大多数现有方法仅限于一类狭隘的流形,特别是双曲空间,并经常采用外在流形公式。我们认为黎曼图学习的核心使命是赋予图神经网络内在流形结构,这一点仍然鲜为人知。 为了推进这一视角,我们指出了现有方法中概念和方法论的关键差距,并概述了一个沿三个维度展开的研究议程:流形类型、神经架构以及学习范式。进一步地,本文还讨论了开放性挑战、理论基础及有前景的方向,这些对于解锁黎曼图学习的全部潜力至关重要。 本文旨在提供一个连贯的观点,并刺激对黎曼几何作为未来图学习研究基础框架进行更广泛探索的兴趣。
https://arxiv.org/abs/2602.10982
The evolution of Large Language Models (LLMs) has shifted mobile computing from App-centric interactions to system-level autonomous agents. Current implementations predominantly rely on a "Screen-as-Interface" paradigm, which inherits structural vulnerabilities and conflicts with the mobile ecosystem's economic foundations. In this paper, we conduct a systematic security analysis of state-of-the-art mobile agents using Doubao Mobile Assistant as a representative case. We decompose the threat landscape into four dimensions - Agent Identity, External Interface, Internal Reasoning, and Action Execution - revealing critical flaws such as fake App identity, visual spoofing, indirect prompt injection, and unauthorized privilege escalation stemming from a reliance on unstructured visual data. To address these challenges, we propose Aura, an Agent Universal Runtime Architecture for a clean-slate secure agent OS. Aura replaces brittle GUI scraping with a structured, agent-native interaction model. It adopts a Hub-and-Spoke topology where a privileged System Agent orchestrates intent, sandboxed App Agents execute domain-specific tasks, and the Agent Kernel mediates all communication. The Agent Kernel enforces four defense pillars: (i) cryptographic identity binding via a Global Agent Registry; (ii) semantic input sanitization through a multilayer Semantic Firewall; (iii) cognitive integrity via taint-aware memory and plan-trajectory alignment; and (iv) granular access control with non-deniable auditing. Evaluation on MobileSafetyBench shows that, compared to Doubao, Aura improves low-risk Task Success Rate from roughly 75% to 94.3%, reduces high-risk Attack Success Rate from roughly 40% to 4.4%, and achieves near-order-of-magnitude latency gains. These results demonstrate Aura as a viable, secure alternative to the "Screen-as-Interface" paradigm.
大型语言模型(LLMs)的发展已经将移动计算从以应用程序为中心的交互模式转变为系统级自主代理。当前的实现主要依赖于“屏幕作为界面”的范式,这种范式继承了结构性脆弱性,并且与移动生态系统的经济基础相冲突。在这篇论文中,我们对最先进的移动代理进行了全面的安全分析,其中Doubao手机助手作为一个代表案例。我们将威胁环境分解为四个维度:代理身份、外部接口、内部推理和行动执行,揭示了一些关键的缺陷,例如假的应用程序身份、视觉欺骗、间接提示注入以及由于依赖于非结构化视觉数据而导致未经授权的权限提升。 为了应对这些挑战,我们提出了Aura——一种清洁的起点安全代理操作系统(Agent Universal Runtime Architecture)。Aura通过采用一种结构化的、原生代理交互模型替代脆弱的图形用户界面抓取来解决这些问题。Aura采用了中心辐射式拓扑架构,其中特权系统代理协调意图,沙箱应用程序代理执行特定领域的任务,而代理内核管理所有通信。代理内核实施了四项防御支柱:(i) 通过全球代理注册表实现加密身份绑定;(ii) 多层语义防火墙进行语义输入清理;(iii) 通过感知污点的内存和计划轨迹对齐来确保认知完整性;以及(iv) 具有不可否认审计的日志记录机制。 在MobileSafetyBench上的评估表明,与Doubao相比,Aura将低风险任务成功率从大约75%提高到94.3%,高风险攻击成功率从大约40%降低至4.4%,并实现了接近量级的延迟改善。这些结果证明了Aura作为“屏幕为界面”范式的一种可行且安全的替代方案的有效性。
https://arxiv.org/abs/2602.10915
The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at this https URL.
世界模型在驾驶场景中的综合理解能力显著提高了端到端自主驾驶框架的规划精度。然而,静态区域的冗余建模以及与轨迹深度交互不足阻碍了世界模型发挥其全部效果。为此,本文提出了一种名为“时间残差世界模型”(Temporal Residual World Model, TR-World)的方法,该方法专注于动态对象建模。通过计算场景表示的时间残差,可以提取动态对象的信息而不依赖于检测和跟踪过程。TR-World仅将时间残差作为输入,从而更精确地预测动态对象的未来空间分布。结合当前BEV特征中的静态物体信息与预测结果,则可以获得准确的未来BEV特征。 此外,我们还提出了一种名为“未来引导轨迹细化”(Future-Guided Trajectory Refinement, FGTR)模块的方法,它在先前轨迹(根据当前场景表示预测得出)和未来的BEV特征之间进行交互。这一模块不仅能利用未来道路条件来优化轨迹,还可以为未来BEV特征提供稀疏的空间-时间监督,从而防止世界模型崩溃。 在nuScenes和NAVSIM数据集上进行的全面实验表明,我们的方法(命名为ResWorld)实现了最先进的规划性能。相关代码可在此网址获取:[此https URL]。
https://arxiv.org/abs/2602.10884
Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (this https URL).
系统评价和元分析依赖于将叙述性文章转化为结构化、以数值为基础的研究记录。尽管在大型语言模型(LLMs)方面取得了快速进展,但仍不清楚这些模型是否能够满足这一过程的结构性要求,因为该过程的重点在于在整个文档中保持角色、方法和效应量分配的一致性,而不仅仅是识别孤立的实体。 我们提出了一种结构化诊断框架,用于评估基于大语言模型的事实提取过程,通过逐步增加关系复杂性和数值复杂性的模式约束查询来进行。这种方法能够精确地识别出超越原子级抽取后的失败点。使用跨越五个科学领域的手动整理语料库、统一查询套件和评价协议,我们对两个最先进的LLMs在单文档输入和长上下文多文档输入的情况下进行了评估。 无论是在模型还是领域中,对于单一属性的查询,性能表现仍然处于适中的水平,但一旦任务需要变量之间稳定绑定、角色分配、统计方法以及效应量之间的绑定关系时,性能会急剧下降。完整的元分析关联组被提取出来后可靠度接近于零,而长上下文输入进一步加剧了这种失败。 下游聚合过程放大了上游的任何小错误,这使得整个语料库级别的统计数据变得不可靠。我们的分析表明,这些限制并非源自实体识别错误,而是由于系统性的结构崩溃,包括角色反转、跨分析绑定漂移、密集结果部分中的实例压缩以及数值误分配等现象,显示出当前的大语言模型缺乏进行自动化元分析所需的结构性忠实度、关系绑定和数值基础。 代码和数据公开发布于GitHub(此链接)。
https://arxiv.org/abs/2602.10881
Semantic segmentation of 3D point clouds is important for many applications, such as autonomous driving. To train semantic segmentation models, labeled point cloud segmentation datasets are essential. Meanwhile, point cloud labeling is time-consuming for annotators, which typically involves tuning the camera viewpoint and selecting points by lasso. To reduce the time cost of point cloud labeling, we propose a viewpoint recommendation approach to reduce annotators' labeling time costs. We adapt Fitts' law to model the time cost of lasso selection in point clouds. Using the modeled time cost, the viewpoint that minimizes the lasso selection time cost is recommended to the annotator. We build a data labeling system for semantic segmentation of 3D point clouds that integrates our viewpoint recommendation approach. The system enables users to navigate to recommended viewpoints for efficient annotation. Through an ablation study, we observed that our approach effectively reduced the data labeling time cost. We also qualitatively compare our approach with previous viewpoint selection approaches on different datasets.
三维点云的语义分割对于许多应用(如自动驾驶)非常重要。为了训练语义分割模型,标注良好的点云分割数据集是必不可少的。然而,点云标注对注释者来说非常耗时,通常需要调整相机视角并使用套索工具选择点。为减少点云标注的时间成本,我们提出了一种视角推荐方法来降低注释者的工作量。我们将Fitts定律应用于建模点云中套索选择的操作时间成本,并利用该模型确定最小化套索选择时间成本的视角。我们构建了一个用于三维点云语义分割的数据标注系统,该系统集成了我们的视角推荐方法。系统允许用户导航至推荐视角以实现高效的注释工作。通过消融研究,我们观察到这种方法显著降低了数据标注的时间成本,并且在不同数据集上与先前的视角选择方法进行了定性比较。
https://arxiv.org/abs/2602.10871
Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual-native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high-dimensional observations, we introduce Information-Aware Credit Assignment (ICA), a post-hoc method that estimates each retrieved snapshot's contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO-based training pipeline, our approach consistently outperforms text-based baselines on diverse information-seeking benchmarks, providing evidence that visual snapshot grounding with information-level credit assignment alleviates the credit-assignment bottleneck in open-ended web environments. The code and datasets will be released in this https URL.
尽管通过强化学习训练的信息寻求代理表现出色,但在开放式的网络环境中进行学习仍然受到低信噪比反馈的严重限制。基于文本的解析器经常忽略布局语义并引入无结构噪音,而长时域训练通常依赖于稀疏的结果奖励,这会掩盖哪些检索行为实际上是有意义的。我们提出了一种以视觉为基础的搜索框架,该框架将网页表示为视觉快照,使代理能够利用布局线索快速定位重要的证据,并抑制干扰信息。为了从这些高维度观察中有效地学习,我们引入了信息感知信用分配(ICA),这是一种事后方法,通过后验分析估算每个检索到的快照对最终结果的贡献,并向关键搜索阶段传递密集的学习信号。 结合基于GRPO的训练流水线,我们的方法在各种信息寻求基准测试上持续优于文本基线,证明了视觉快照定位与信息级别信用分配可以缓解开放网络环境中的信用分配瓶颈。代码和数据集将在提供的链接中发布。
https://arxiv.org/abs/2602.10863
Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.
烟雾分割对于野火管理和工业安全应用至关重要。传统基于可见光的方法由于光谱信息不足而面临局限性,尤其是在处理云层干扰和半透明烟雾区域时表现不佳。为了解决这些问题,我们引入了高光谱成像技术来进行烟雾分割,并发布了首个专门用于烟雾分割的高光谱数据集(HSSDataset),该数据集中包含了来自20个真实场景中超过18,000帧图像、精心标注的样本。然而,不同的光谱波段在不同空间区域中的区分能力各不相同,这需要采用自适应带宽策略来解决。 为此,我们将这一问题分解为三个技术挑战:光谱相互作用污染、有限的光谱模式建模和复杂的加权路由问题。我们提出了一种原型混合(MoP)网络,该网络包括以下三个方面: 1. 波段分割以实现光谱隔离。 2. 基于原型的光谱表示法,用于识别多样化的光谱模式。 3. 双层路由器系统,旨在自适应地根据空间信息调整波段权重。 此外,我们还构建了一个多光谱数据集(MSSDataset),其中包含了RGB-红外图像。通过广泛的实验验证了在高光谱和多光谱模态下该方法的优越性能,这标志着基于光谱的烟雾分割领域的新范式已初步形成。
https://arxiv.org/abs/2602.10858
Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.
边缘设备上的动作识别面临着对延迟、内存、存储和能耗的严格限制。虽然辅助模态(如骨架信息和深度信息)可以提升识别性能,但它们通常需要额外的传感器或计算成本高昂的姿态估计管道,从而限制了其在边缘设备中的实用性。为此,在这项工作中,我们提出了一种专为高效设备端推理设计的紧凑型RGB-only网络。我们的方法基于X3D风格的骨干网,并通过添加时间平移(Temporal Shift)进行了增强,同时引入选择性的时间适应和无参数注意力机制。在NTU RGB+D 60和120基准测试中的大量实验表明了我们模型在准确性和效率之间的良好平衡。此外,在Jetson Orin Nano上的部署级性能分析验证了我们的方法相比现有的RGB-based动作识别技术具有更小的设备占用空间和更具实用性的资源利用方式。
https://arxiv.org/abs/2602.10818
Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
基于块的编程环境(如Scratch)在低代码教育中扮演着核心角色,然而评估通过图形用户界面(GUI)构建程序的人工智能代理的能力仍然鲜有研究。我们引入了ScratchWorld这一基准测试,旨在评估多模态GUI代理在Scratch中的构造式编程任务上的表现。 该基准测试基于“使用-修改-创造”教学框架设计,涵盖了83个精心挑选的任务,跨越四个不同的问题类别:创建(Create)、调试(Debug)、扩展(Extend)和计算(Compute)。为了严格诊断代理失败的原因,基准测试采用了两种互补的交互模式:原始模式需要进行细致入微的拖放操作以直接评估视觉-运动控制能力;而复合模式则使用高层次语义API来区分程序推理与GUI执行。为确保可靠的评估,我们提出了一种基于执行的评估协议,在此协议中通过浏览器环境内的运行时测试验证所构建的Scratch程序的功能正确性。 跨多种先进多模态语言模型和GUI代理进行广泛实验后发现,存在显著的认知行动差距:尽管具有强大的规划能力,但在精细粒度的GUI操作方面仍面临持续挑战。
https://arxiv.org/abs/2602.10814
Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.
这段文本描述了一个关于Vision-Language-Action (VLA) 驱动系统如何增强端到端(E2E)规划的研究。研究通过在RecogDrive中使用全功能视觉语言模型(VLM)和仅基于视觉的骨干网络,来探讨其带来的变化。 **问题1(RQ1)**: 在骨干层面上,VLM能够引入额外的空间维度,超越仅依赖视觉信息的骨干网络。 **问题2(RQ2)**: 这个独特的空间维度导致在某些长尾场景中行为不同:VLM倾向于更加激进,而ViT则更为保守。两者各自在大约2-3%的测试场景上取得胜利;使用一个能够选择每种情景下更好的轨迹的Oracle机制,可以得到PDMS评分的上限为93.58。 **问题3(RQ3)**: 为了充分利用这一观察结果,研究者提出了HybridDriveVLA模型。该模型运行ViT和VLM两个分支,并通过一个学习到的评分器来选择它们终点轨迹中更好的一条路径,从而将PDMS提高到了92.10。 **最终解决方案(DualDriveVLA)**: 实际上采用了一种快速-慢速策略:默认情况下使用ViT,在评分器的信心水平下降至一定阈值以下时才调用VLM。仅在大约15%的场景中调用VLM,即可实现PDMS评分为91.00,同时提高吞吐量3.2倍。 最后,研究团队提到他们将公开代码以供社区使用。
https://arxiv.org/abs/2602.10719
Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: this https URL.
长期对话记忆是基于大型语言模型的对话系统的核心能力,然而现有的基准和评估协议主要集中在表面级别的事实性回忆。在实际互动中,适当的回应往往依赖于隐含约束,如用户状态、目标或价值观等,并且这些不是之后会明确查询的信息。为了评估这种情景,我们引入了**LoCoMo-Plus**,这是一个用于评估存在提示与触发语义断开情况下的认知记忆能力的基准测试,模型必须在长时间对话上下文中保留并应用潜在约束条件。 此外,我们还展示了传统的字符串匹配指标和显式的任务类型指令与此类场景不相符合,并提出了一种基于约束一致性的统一评估框架。跨多种骨干模型、检索方法及内存系统的实验表明,在这种情况下认知记忆仍然具有挑战性,并揭示了现有基准测试未能捕捉到的失败案例。 我们的代码和评估框架可公开获取:[此链接](this https URL)。
https://arxiv.org/abs/2602.10715