If human experience is any guide, operating effectively in unstructured environments -- like homes and offices -- requires robots to sense the forces during physical interaction. Yet, the lack of a versatile, accessible, and easily customizable tactile sensor has led to fragmented, sensor-specific solutions in robotic manipulation -- and in many cases, to force-unaware, sensorless approaches. With eFlesh, we bridge this gap by introducing a magnetic tactile sensor that is low-cost, easy to fabricate, and highly customizable. Building an eFlesh sensor requires only four components: a hobbyist 3D printer, off-the-shelf magnets (<$5), a CAD model of the desired shape, and a magnetometer circuit board. The sensor is constructed from tiled, parameterized microstructures, which allow for tuning the sensor's geometry and its mechanical response. We provide an open-source design tool that converts convex OBJ/STL files into 3D-printable STLs for fabrication. This modular design framework enables users to create application-specific sensors, and to adjust sensitivity depending on the task. Our sensor characterization experiments demonstrate the capabilities of eFlesh: contact localization RMSE of 0.5 mm, and force prediction RMSE of 0.27 N for normal force and 0.12 N for shear force. We also present a learned slip detection model that generalizes to unseen objects with 95% accuracy, and visuotactile control policies that improve manipulation performance by 40% over vision-only baselines -- achieving 91% average success rate for four precise tasks that require sub-mm accuracy for successful completion. All design files, code and the CAD-to-eFlesh STL conversion tool are open-sourced and available on this https URL.
如果以人类经验为参考,要在无结构环境中(如家庭和办公室)有效地操作机器人需要让机器人能够感知物理互动中的力。然而,由于缺乏一种灵活、易得且易于定制的触觉传感器,导致了在机器人操控领域出现了碎片化、特定于某种传感器的解决方案,并且在许多情况下采取了忽视力感应的无感器方法。通过推出eFlesh,我们填补了这一空白,这是一种成本低廉、容易制造和高度可定制的磁性触觉传感器。 构建一个eFlesh传感器只需要四个组件:一台业余爱好者级别的3D打印机、现成的磁铁(不到5美元)、所需形状的CAD模型以及一块磁力计电路板。该传感器由排列整齐的参数化微结构组成,允许用户调节传感器的几何形状及其机械响应。我们提供了一个开源设计工具,可将凸形OBJ/STL文件转换为用于制造的3D打印STL格式。 这种模块化的设计框架使用户能够创建适用于特定应用的传感器,并根据任务需求调整灵敏度。我们的传感器特性实验展示了eFlesh的能力:接触定位均方根误差(RMSE)为0.5毫米,法向力预测RMSE为0.27牛顿,剪切力预测RMSE为0.12牛顿。 我们还提出了一种通过学习来检测滑动的模型,该模型能够以95%的准确率应用于未知物体,并且视觉触觉控制策略比单纯基于视觉的基础线提高了40%的操作性能——实现了平均成功率为91%的成绩,这四个精确任务需要毫米级精度才能完成。 所有设计文件、代码以及CAD到eFlesh STL转换工具均开源并可在此链接访问:[https URL]。
https://arxiv.org/abs/2506.09994
Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: this https URL
图像恢复的目标是修复受损的图像。然而,现有的基于扩散的方法虽然在自然图像恢复方面取得了巨大成功,但在处理含有文本区域的受损图像时往往难以准确重构这些文本区域。这些方法经常会生成看似合理但实际上错误的文字样图案,我们称之为“图文幻觉”。在这篇论文中,我们提出了图文感知图像恢复(TAIR),这是一个新的任务,要求同时恢复视觉内容和文本准确性。 为了应对这一挑战,我们构建了SA-Text,一个大规模基准数据集,包含10万张高质量的场景图片,并密集标注有各种复杂多样的文本实例。此外,我们提出了一种多任务扩散框架,称为TeReDiff,该框架将扩散模型中的内部特征整合到文字识别模块中,使得两个组件能够从联合训练中受益。这允许提取丰富的文本表示,在后续去噪步骤中将其用作提示。 大量的实验表明,我们的方法在图像恢复方面持续优于最先进的方法,并且在提高文本识别准确性方面取得了显著的成果。请参见我们的项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2506.09993
We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
我们提出了Chain-of-Action(CoA),这是一种新颖的基于轨迹自回归建模的视动策略范式。与传统的向前预测下一步动作的方法不同,CoA通过任务特定目标进行显式的反向推理来生成整个轨迹,使用行动层面的Chain-of-Thought (CoT) 过程。这一过程在单一的自回归结构中得到统一:(1) 第一个标记对应于编码了任务特定目标的稳定关键帧动作;(2) 随后的动作标记条件于初始的关键帧和先前预测的动作,以自回归方式生成。 这种反向行动推理强制执行全局到局部的结构,使每个局部动作都能紧密地受到最终目标的约束。为了进一步实现这一行动推理结构,CoA采用了四种互补的设计:连续的动作令牌表示;动态停止,用于可变长度轨迹生成;逆时间集成;以及多令牌预测,以平衡动作块建模和全局结构之间的关系。 结果表明,CoA在保持视动策略灵活性和简洁性的同时,具有强大的空间泛化能力。从实验上看,我们在60个RLBench任务和8个真实世界操作任务中观察到,CoA达到了最先进的性能水平。
https://arxiv.org/abs/2506.09990
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: this https URL
我们研究了通过预测人类手部与场景物理互动产生的声音,来使3D场景重建具有交互性的问题。具体来说,首先记录一个人用手在三维场景中操作物体的视频。然后使用这些动作-声音对来训练一个校正流模型,以将3D手部轨迹映射到相应的音频上。在测试阶段,用户可以向该模型查询其他动作(参数化为一系列手部姿态)并估算其对应的声响。实验结果显示,我们生成的声音能够准确传达材料特性和动作,并且这些声音经常与人类观察者难以区分真正的实际声音。项目页面:[此处应填写实际的网址链接]
https://arxiv.org/abs/2506.09989
Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair -- a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.
现有的评估视频语言模型在时空理解与推理能力方面的基准容易受到基于表面视觉或文本线索的捷径解决方案的影响,从而导致评分膨胀。为了解决准确评估模型性能所面临的挑战,本文引入了Minimal Video Pairs(MVP)基准测试,这是一个简单的、具有意识的捷径解决方案的视频问答基准测试,用于评估视频语言模型对物理世界的理解能力。该基准包含55,000个高质量的选择题形式的视频问答示例,着重于物理世界理解方面。这些样本来自九种不同的视频数据来源,包括第一人称视角的视频、第三人称视角的视频、机器人互动数据以及认知科学中的直观物理学基准测试。 为了减少依赖表面视觉或文本线索和偏差的捷径解决方案的影响,MVP中的每个样本都有一个最小变化对——即一个与原始视频在视觉上非常相似但答案相反的问题。要正确回答问题,模型必须为最小变化对中的两个示例都提供正确的答案;因此,仅依靠视觉或文本偏见进行推理的模型将表现得不如随机猜测好。 人类在MVP上的表现为92.9%,而最先进的开源视频语言模型的表现仅为40.2%(相比之下,随机性能为25%)。
https://arxiv.org/abs/2506.09987
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
近年来,端到端的人体动画在接收丰富多模态条件(如文本、图像和音频)方面取得了显著进展。然而,大多数现有的方法只能为单一主体进行动画制作,并且以全局方式注入这些条件,忽略了同一视频中可能存在多种概念以及丰富的多人互动及人与物体的互动场景。这种全局假设阻止了对多个包含人类和物体的概念实施精确而有针对性的身份控制,从而限制了其应用范围。在这项工作中,我们摒弃单一实体的假设,并引入了一个新颖的框架,该框架强制执行来自不同模态条件到每个身份时空脚印上的特定区域绑定。给定多种概念的参考图像,我们的方法能够自动推断布局信息,通过利用一个掩码预测器来匹配去噪视频与每种参考外观之间的视觉线索。此外,我们还把局部音频条件注入其对应的区域,在迭代过程中确保模态间的布局对齐匹配。这种设计使得生成高质量且可控的多概念以人类为中心的视频成为可能。实证结果和消融研究验证了我们的显式布局控制在处理多模态条件下相比隐式方法和其他现有方法的有效性。
https://arxiv.org/abs/2506.09984
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
现代人工智能面临的一个主要挑战是通过观察来学习理解世界和采取行动。本文探讨了一种结合互联网规模视频数据与少量交互数据(机器人轨迹)的自监督方法,以开发能够在物理世界中理解和规划模型。我们首先在一个包含超过100万小时互联网视频图像数据集上对一个无动作的动作联合嵌入预测架构V-JEPA 2进行预训练。V-JEPA 2在运动理解方面表现出色(Something-Something v2的top-1准确率为77.3),并且在人类行为预测方面达到最先进的水平(Epic-Kitchens-100上recall-at-5为39.7,超过了之前特定任务模型的表现)。此外,在将V-JEPA 2与大型语言模型对齐后,我们展示了在视频问答任务上的最先进性能,规模达80亿参数级(例如,在PerceptionTest上得分84.0,在TempCompass上得分为76.9)。最后,我们展示如何通过后续训练一个隐含的动作条件世界模型V-JEPA 2-AC来将自监督学习应用于机器人规划任务,该模型使用来自Droid数据集的不到62小时未标记的机器人视频进行训练。我们将V-JEPA 2-AC零样本部署到两个实验室中的Franka机械臂上,并能够利用图像目标规划实现物体的选择和放置功能。值得注意的是,这一成果是在没有从这些环境中收集任何机器人数据以及没有特定任务培训或奖励的情况下完成的。这项工作展示了如何通过网络规模的数据和少量机器人交互数据进行自监督学习,从而开发出一种能够在物理世界中进行规划的世界模型。
https://arxiv.org/abs/2506.09985
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
如何可靠地模拟未来驾驶场景,尤其是在广泛的自我驾驶行为范围内?最近开发的基于真实世界驾驶数据(主要由安全专家轨迹组成)构建的驾驶世界模型,在遵循危险或非专业行为方面存在困难,因为这些行为在该类数据中极为罕见。这种限制使其仅适用于诸如策略评估之类的任务。为了解决这个问题,我们通过用来自驾驶模拟器(例如CARLA)收集的各种非专业人士数据来丰富真实世界的驾驶演示,并构建了一个基于异构语料库训练的可控世界模型。从一个具有扩散变换架构的视频生成器开始,我们设计了几种策略以有效整合条件信号并提高预测控制性和保真度。由此产生的模型ReSim能够在各种行动(包括危险非专家行为)下可靠地模拟多样的开放世界驾驶场景。为了缩小高保真仿真与需要奖励信号来评估不同动作的应用之间的差距,我们引入了一个Video2Reward模块,该模块可以从ReSim的模拟未来中估计奖励。我们的ReSim范式实现了高达44%的视觉保真度提升,将专家和非专业人士行为的可控性提高了超过50%,并且在NAVSIM上的规划和策略选择性能分别提升了2%和25%。
https://arxiv.org/abs/2506.09981
Computing stabilizing and optimal control actions for legged locomotion in real time is difficult due to the nonlinear, hybrid, and high dimensional nature of these robots. The hybrid nature of the system introduces a combination of discrete and continuous variables which causes issues for numerical optimal control. To address these challenges, we propose a layered architecture that separates the choice of discrete variables and a smooth Model Predictive Controller (MPC). The layered formulation allows for online flexibility and optimality without sacrificing real-time performance through a combination of gradient-free and gradient-based methods. The architecture leverages a sampling-based method for determining discrete variables, and a classical smooth MPC formulation using these fixed discrete variables. We demonstrate the results on a quadrupedal robot stepping over gaps and onto terrain with varying heights. In simulation, we demonstrate the controller on a humanoid robot for gap traversal. The layered approach is shown to be more optimal and reliable than common heuristic-based approaches and faster to compute than pure sampling methods.
实时计算四足机器人行走时的稳定性和最优控制动作非常困难,这是因为这些机器人的非线性、混合特性和高维度特性。这种系统的混合性质引入了离散变量和连续变量的组合,这给数值优化控制带来了问题。为了解决这些问题,我们提出了一种分层架构,该架构将离散变量的选择与平滑模型预测控制器(MPC)分开处理。这种分层形式化方法通过结合无梯度法和基于梯度的方法,在不牺牲实时性能的前提下实现了在线灵活性和最优性。该架构利用一种采样方法来确定离散变量,并使用这些固定离散变量进行经典的平滑MPC计算。 我们在四足机器人跨越间隙并行走于不同高度地形上的实验中展示了这种方法的效果。在仿真环境中,我们还展示了一种人形机器人的跳跃过隙控制效果。研究表明,分层方法比常见的基于启发式的方法更优且可靠,并且其计算速度也快于纯采样法。 这种方法的优势在于它能够在复杂的动态环境中实现高效的实时控制和优化性能,尤其是在处理那些需要频繁调整离散变量的混合系统中。
https://arxiv.org/abs/2506.09979
We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.
我们介绍了一个名为CausalVQA的基准数据集,该数据集用于视频问答(Video Question Answering, VQA),其中包含探查模型对物理世界因果关系理解的问题-答案对。现有的VQA基准要么侧重于实际视频表面感知的理解,要么专注于使用模拟环境创建的狭窄物理推理问题。CausalVQA通过提出基于现实场景、挑战性的五类问题(反事实、假设性、预期、计划和描述性),填补了这一重要空白,这些问题聚焦于模型预测不同行动和事件可能结果的能力。我们设计了质量控制机制以防止模型利用简单的捷径,要求模型的答案必须建立在深层视觉理解之上而非语言线索。 我们发现当前前沿的多模态模型在该基准上的表现显著低于人类水平,特别是在预期和假设性问题上。这表明目前系统面临着如何充分利用时空推理能力、物理原理的理解以及对可能替代方案的理解来做出准确预测的挑战,尤其是在现实世界场景中。
https://arxiv.org/abs/2506.09943
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $O(1/\epsilon^2)$.
信息不对称是多代理系统中普遍存在的一种特征,尤其在经济学和社会科学领域表现得尤为明显。在这种背景下,各主体根据私有信息调整行为以最大化自身收益。这种策略性行为往往由于混淆变量的引入而变得复杂。同时,在目标环境中进行实验的难度也带来了知识迁移的重大挑战,这需要将知识从数据更容易获取的环境转移到其他场景中。在此背景下,本文探讨了在线学习中的一个基本问题:我们能否利用非独立同分布(non-i.i.d.)的动作来了解混淆变量,即使在这种情况下仍需实现知识迁移?为此,我们提出了一种样本效率高的算法,旨在准确识别信息不对称条件下的系统动态,并在强化学习框架内有效应对知识转移的挑战,在一个在线策略互动模型下进行。我们的方法可以证明,能够在具有紧致样本复杂度$O(1/\epsilon^2)$的情况下,实现$\epsilon$-最优策略的学习。
https://arxiv.org/abs/2506.09940
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results can be found at this https URL.
尽管视觉语言行动模型(VLAs)在各种操作任务中展示了有前景的机器人行为,但当部署到全新的任务时,它们的成功率有限。为了使这些策略能够安全地与其环境互动,我们需要一个故障检测器,能够在关键时刻发出警报,让机器人可以停止、回溯或请求帮助。然而,现有的故障检测器仅在特定的一个或几个任务上进行训练和测试,而VLAs则需要检测器能在未见过的任务和新环境中泛化并识别故障。 在这篇论文中,我们引入了多任务故障检测问题,并提出了SAFE——一个为包括VLAs在内的通才机器人策略设计的故障检测器。通过对VLA特征空间的分析,我们发现VLAs对任务的成功与失败拥有足够的高层次知识,这种知识在不同任务间具有通用性。基于这一洞察,我们将SAFE设计成能够从VLA内部特性学习,并预测一个单一标量值来表示任务失败的可能性。SAFE是在成功和失败的情况下进行训练的,且评估时使用的是未见过的任务。此外,SAFE与不同的策略架构兼容。 我们在模拟环境和现实世界中对OpenVLA、$\pi_0$以及$\pi_0$-FAST进行了广泛的测试。我们将SAFE与各种基线进行了比较,并展示了它在故障检测性能上取得了最先进的成果,且使用一致预测实现了准确性与检测时间的最佳平衡。更多定性结果可以在[该链接](https://thisisnotalink.com)中找到。 请注意,提供的URL实际为示例链接,请根据实际情况替换为正确的网址。
https://arxiv.org/abs/2506.09937
One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at this https URL
视觉-语言-行动(VLA)模型相对于传统的机器人模仿学习的一个承诺是,它们可以利用大型视觉-语言模型(VLMs)的广泛泛化能力来生成多功能、通用型的机器人策略。然而,目前对VLA的评估仍然不足。传统的人类演示模仿学习基准测试由于缺乏语言指令而不适用;而新兴的一些将语言融入考量中的VLA基准测试往往包含的任务有限,并且没有意图深入研究VLM预训练究竟在多大程度上促进了下游机器人策略的泛化能力提升。与此同时,许多研究依赖于由不同机构独立设计的真实世界机器人设置,这为再现性和可访问性设置了障碍。 为了填补这一空白,我们介绍了一套统一的基于模拟任务的评估工具包,包含50项跨10个子类别的任务,这些类别涵盖了语言指令、视觉和物体。我们系统地在该工具包上对几种最先进的VLA架构进行了评测,以理解它们的泛化能力。我们的结果显示,虽然VLM骨干网络赋予了VLA强大的感知理解和高层次规划能力——即所谓的良好意图,但这种优势并不能可靠地转化为精确的动作执行:当面临分布外(OOD)观察时,尽管策略往往表现出一致的目标意图,但在动作执行方面却常常表现不佳。此外,在行动数据上的微调会侵蚀原始VLM的通用推理能力。 为了作为未来VLA研究的标准基准,并推动跨越感知与行动鸿沟的研究进展,我们公开了我们的任务套件和评估代码。更多信息(包括源代码)可在此链接访问:[此URL应为原文中提供的具体网址,请自行查找]
https://arxiv.org/abs/2506.09930
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at this https URL.
高光谱图像(HSI)聚类的任务是在没有标注信息的情况下,将相似的像素归为同一类别,这是一项重要但具有挑战性的任务。对于大规模HSIs来说,大多数方法依赖于超像素分割,并基于图神经网络(GNNs)进行超像素级别的聚类。然而,现有的GNN无法充分利用输入HSI的光谱信息,不准确的超像素拓扑图可能在信息聚合过程中导致不同类别语义之间的混淆。 为了解决这些问题,我们首先提出了一种结构-光谱图卷积算子(SSGCO),它专门针对具有图结构的HSI超像素设计,通过同时提取空间和光谱特征来提高其表示质量。其次,我们提出了一个证据引导自适应边学习模块(EGAEL),该模块能够根据需要预测并细化超像素拓扑图中的边缘权重。我们将所提出的方法集成到对比学习框架中以实现聚类,在此框架下,表示学习与聚类可以同时进行。 实验表明,我们的方法在四个HSI数据集上将聚类精度分别提高了2.61%,6.06%,4.96%和3.15%,优于所有比较的方法。我们提供的代码可以在给定的URL中找到(原文中的具体链接被替换为“this https URL”)。
https://arxiv.org/abs/2506.09920
Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
大型语言模型(LLMs)已经提升了对话式人工智能助手的能力。然而,系统性地评估这些助手在完成任务时如何应用个性化——即根据个人用户的偏好进行调整——仍然是一个挑战。现有的个性化基准测试主要集中在闲聊、非对话型任务或狭窄的领域上,无法捕捉到个性化任务导向辅助服务的复杂性。为此,我们引入了PersonaLens,这是一个全面的评估体系,用于评价面向任务的人工智能助手在个性化方面的表现。 我们的评估体系包括配备了丰富偏好和互动历史的多样用户档案,以及两个专门针对LLM(大型语言模型)设计的代理:一个与AI助手进行真实任务导向对话的用户代理;另一个使用“将LLM作为评判者”模式来评估个性化的质量、响应质量和任务成功率的评判员代理。通过与当前各种大型语言模型助手在多种任务上的广泛实验,我们揭示了它们在个性化能力方面的显著差异,并为推动对话式人工智能系统的进步提供了重要的见解。
https://arxiv.org/abs/2506.09902
In this paper, we propose a novel hierarchical framework for robot navigation in dynamic environments with heterogeneous constraints. Our approach leverages a graph neural network trained via reinforcement learning (RL) to efficiently estimate the robot's cost-to-go, formulated as local goal recommendations. A spatio-temporal path-searching module, which accounts for kinematic constraints, is then employed to generate a reference trajectory to facilitate solving the non-convex optimization problem used for explicit constraint enforcement. More importantly, we introduce an incremental action-masking mechanism and a privileged learning strategy, enabling end-to-end training of the proposed planner. Both simulation and real-world experiments demonstrate that the proposed method effectively addresses local planning in complex dynamic environments, achieving state-of-the-art (SOTA) performance. Compared with existing learning-optimization hybrid methods, our approach eliminates the dependency on high-fidelity simulation environments, offering significant advantages in computational efficiency and training scalability. The code will be released as open-source upon acceptance of the paper.
在这篇论文中,我们提出了一种新的层次化框架,用于在具有异构约束的动态环境中进行机器人导航。我们的方法利用了一个通过强化学习(RL)训练的图神经网络,以高效地估计机器人的代价函数(cost-to-go),这被形式化为局部目标推荐。随后,采用一个考虑了运动学限制的空间-时间路径搜索模块来生成参考轨迹,以便解决用于显式约束执行的非凸优化问题。更重要的是,我们引入了一个增量动作屏蔽机制和一种特权学习策略,使提出的规划器能够进行端到端训练。 模拟实验和真实世界实验均表明,提出的方法有效解决了复杂动态环境中的局部路径规划问题,并达到了最先进的(SOTA)性能水平。与现有的学习-优化混合方法相比,我们的方法消除了对高保真仿真环境的依赖,在计算效率和训练可扩展性方面具有显著优势。 该论文一旦被接受,代码将以开源形式发布。
https://arxiv.org/abs/2506.09859
Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.
嵌入式导航是更广泛的嵌入式人工智能研究领域中的重要基石。然而,先前的导航研究被划分为不同的任务/能力(例如,ObjNav、ImgNav和VLN),这些任务在目标和模态方面有所不同,导致数据集和方法的设计往往是独立进行的。在这项工作中,我们朝着能够遵循包含任意多模态与多功能组合的自由形式指令的一般化导航代理迈进了一步。为了实现这一目标,我们提出了一套大规模基准测试及相应的方法,称为OctoNav-Bench和OctoNav-R1。 具体来说,OctoNav-Bench具备连续环境特性,并通过设计的注释流程构建而成。我们在该环境中精心制作了指令-轨迹配对数据集,其中指令以多样化的自由形式呈现,并且其模态与能力可以是任意组合。此外,在OctoNav-Bench内,我们还构造了一个“思考前行动”(TBA-CoT)的数据集来提供背后的操作思维过程。 对于OctoNav-R1,我们基于大规模语言模型(MLLMs)构建了它,并将其改编为一种视觉-语言-动作类型(VLA)的模型,该模型仅根据2D视觉观测数据就能产生低级别的行动。此外,为了适应这种多任务处理需求,我们设计了一个包含三个阶段的混合训练范式(HTP),即:Action-/TBA-SFT、Nav-GPRO和在线强化学习阶段。每个阶段都包含了专门设计的学习策略与奖励机制。 特别是,在TBA-SFT和Nav-GRPO的设计中,受到了OpenAI-o1及DeepSeek-R1的启发,这些模型展示了通过“思考前行动”方式产生出色推理能力的特点。因此,我们旨在研究如何在嵌入式导航领域实现“思考前行动”,以此提高模型向一般化发展的推理能力。具体而言,我们提出了TBA-SFT来利用TBA-CoT数据集对模型进行微调,以作为冷启动阶段,并通过Nav-GPRO进一步提升其思维能力。 最终,OctoNav-R1在与先前方法的比较中显示出了优越的性能表现。
https://arxiv.org/abs/2506.09839
Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual's genuine emotional state. Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction. However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs. Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis. This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses. Extensive experiments validate the dataset's reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance. To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity. It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion. The dataset will be publicly available upon acceptance of this paper.
微表情(ME)是揭示个体真实情绪状态的微妙、短暂的非语言信号。其分析在医疗保健、刑事调查和人机交互等领域中引起了极大的关注,因其具有广泛的应用前景。然而,现有的微表情研究主要局限于单一视觉模态,忽略了其他生理模态传达的情感信息,导致微表情识别与检测性能远低于实际应用需求。因此,探索微表情视觉特征与其他生理信号(PS)之间的跨模态关联机制,并开发多模态融合框架,是推进微表情分析的关键一步。 本研究引入了一个新颖的微表情数据集MMME(Multimodal Micro-expression),首次实现了面部动作信号(微表情)、中枢神经系统信号(EEG)和外周生理信号(PPG、RSP、SKT、EDA以及ECG)的同步采集。通过克服现有微表情语料库的局限,MMME包含634个微表情、2,841个宏表情(MaEs)和2,890次多模态生理信号同步数据记录,为研究微表情神经机制和进行基于多模态融合分析奠定了坚实基础。广泛实验验证了该数据集的可靠性,并提供了基准测试以评估微表情分析的性能,证明整合微表情与生理信号显著提升了识别与检测性能。 据我们所知,MMME是迄今为止在模态多样性方面最全面的微表情数据集。它为探索微表情神经机制和揭示视觉-生理协同效应提供了关键的数据支持,并推动了从单一视觉分析向多模态融合的研究范式转变。该数据集将在本论文被接受后公开发布。
https://arxiv.org/abs/2506.09834
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at this https URL.
大型推理模型(LRMs)如o1和DeepSeek-R1在处理长链思维过程的自然语言推理方面展现了显著的进步,然而它们在应对复杂数学运算时仍然效率低下或准确性不足。通过计算工具(例如计算库和符号求解器)来解决这些局限性具有前景,但这也引入了一个技术挑战:代码解释器(CI)带来了超出模型内部文本表示的外部知识,因此直接结合使用的效果不佳。本文介绍了一种名为CoRT的后训练框架,旨在教导LRMs有效且高效地利用CI。 作为第一步,我们通过Hint-Engineering解决了数据稀缺问题,这一方法通过在适当位置策略性插入不同的提示来合成代码集成推理数据,并以此优化LRM-CI交互过程。我们手动创建了30个高质量样本,在此基础上,我们将从1.5B到32B参数的模型进行了监督微调、拒绝微调和强化学习后训练。 我们的实验结果表明,采用Hint-Engineering的模型在DeepSeek-R1-Distill-Qwen-32B和DeepSeek-R1-Distill-Qwen-1.5B上分别取得了4%和8%的绝对改进,在五个具有挑战性的数学推理数据集上的表现尤为突出。此外,相对于纯自然语言模型,Hint-Engineering的模型在32B模型中使用的令牌减少了约30%,而在1.5B模型中则减少了一半左右。 该研究的相关代码与模型可通过此链接获取:[请在此处插入实际提供的URL]
https://arxiv.org/abs/2506.09820
Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
神经前端是自动语音识别(ASR)系统中传统固定特征提取管道的有吸引力的替代方案,因为它们可以直接训练以适应声学模型。然而,与经典方法相比,它们的表现通常较差,我们发现这主要是由于过度拟合的风险增加所致。因此,本工作研究了用于具有可学习特征提取前端的ASR模型的正则化方法。 首先,我们探讨了音频扰动方法,并表明对于可学习特征可以实现更大的相对改进。此外,我们发现了标准SpecAugment在这些前端使用中的两个限制,并提出了短时傅里叶变换(STFT)域内的掩码作为简单有效的修改来解决这些问题。最后,结合这两种正则化方法能够有效缩小传统固定特征和可学习特征之间性能差距。
https://arxiv.org/abs/2506.09804