Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
在视觉语言模型(VLMs)中,空间推理依然脆弱,尤其是在语义依赖于微妙的时间或几何线索时。我们引入了一个合成基准测试,该测试考察两种互补技能:情境意识(识别互动是否具有危害性或无害)和空间意识(追踪谁对谁做了什么,并推断相对位置和运动)。通过最小化视频配对,我们测试了三个挑战:区分暴力行为与无害活动、在不同视角下绑定施暴者角色以及判断细微的轨迹对齐。尽管我们在零训练设置中评估了最近的VLMs,但该基准适用于任何视频分类模型。结果表明,在各个任务上的表现仅略高于随机猜测水平。一种简单的辅助手段——稳定颜色线索——部分减少了施暴者角色混淆,但仍未能解决根本弱点。通过发布数据和代码,我们旨在提供可重复诊断工具,并促进对轻量级空间先验的探索,以补充大规模预训练。
https://arxiv.org/abs/2601.15780
As environmental disasters happen more frequently and severely, seeking the source of pollutants or harmful particulates using plume tracking becomes even more important. Plume tracking on small quadrotors would allow these systems to operate around humans and fly in more confined spaces, but can be challenging due to poor sensitivity and long response times from gas sensors that fit on small quadrotors. In this work, we present an approach to complement chemical plume tracking with airflow source-seeking behavior using a custom flow sensor that can sense both airflow magnitude and direction on small quadrotors < 100 g. We use this sensor to implement a modified version of the `Cast and Surge' algorithm that takes advantage of flow direction sensing to find and navigate towards flow sources. A series of characterization experiments verified that the system can detect airflow while in flight and reorient the quadrotor toward the airflow. Several trials with random starting locations and orientations were used to show that our source-seeking algorithm can reliably find a flow source. This work aims to provide a foundation for future platforms that can use flow sensors in concert with other sensors to enable richer plume tracking data collection and source-seeking.
随着环境灾害频繁且严重地发生,利用烟羽追踪来寻找污染物或有害颗粒物的源头变得愈加重要。在小型四旋翼无人机上进行烟羽追踪可以让这些系统在人类周围操作,并能进入更狭小的空间,但可能会因为适合这类小型无人机的气体传感器灵敏度低和响应时间长而具有挑战性。在这项工作中,我们提出了一种使用定制的流量传感器来补充化学烟羽追踪的方法,该传感器能够同时检测小型四旋翼无人机(重量小于100克)上的气流大小和方向。利用这种传感器,我们实现了一个修改版的“投掷与推进”算法,该算法利用气流方向感应来寻找并导航至气流源头。一系列特性实验验证了系统能够在飞行中检测到气流,并重新调整四旋翼无人机的方向以对准气流源。通过使用随机起始位置和姿态进行多次试验,展示了我们的寻源算法能够可靠地找到气流源。这项工作旨在为未来平台奠定基础,这些平台可以结合流量传感器和其他传感器的使用,从而实现更丰富的烟羽追踪数据收集及源头寻找功能。
https://arxiv.org/abs/2601.15607
Targeted drug delivery in the gastrointestinal (GI) tract using magnetic robots offers a promising alternative to systemic treatments. However, controlling these robots is a major challenge. Stationary magnetic systems have a limited workspace, while mobile systems (e.g., coils on a robotic arm) suffer from a "model-calibration bottleneck", requiring complex, pre-calibrated physical models that are time-consuming to create and computationally expensive. This paper presents a compact, low-cost mobile magnetic manipulation platform that overcomes this limitation using Deep Reinforcement Learning (DRL). Our system features a compact four-electromagnet array mounted on a UR5 collaborative robot. A Soft Actor-Critic (SAC)-based control strategy is trained through a sim-to-real pipeline, enabling effective policy deployment within 15 minutes and significantly reducing setup time. We validated the platform by controlling a 7-mm magnetic capsule along 2D trajectories. Our DRL-based controller achieved a root-mean-square error (RMSE) of 1.18~mm for a square path and 1.50~mm for a circular path. We also demonstrated successful tracking over a clinically relevant, 30 cm * 20 cm workspace. This work demonstrates a rapidly deployable, model-free control framework capable of precise magnetic manipulation in a large workspace,validated using a 2D GI phantom.
在消化道(GI)中使用磁性机器人进行靶向药物输送为全身治疗提供了一种有前景的替代方案。然而,控制这些机器人的难度是一个主要挑战。固定式磁场系统工作空间有限,而移动系统(例如安装在机械臂上的线圈)则受到“模型校准瓶颈”的困扰,需要复杂的预校准物理模型,这类模型耗时且计算成本高昂。本文介绍了一种紧凑、低成本的移动磁性操作平台,该平台利用深度强化学习(DRL)克服了这一局限性。 我们的系统配备了一个安装在UR5协作机器人上的四电磁铁阵列。通过仿真到现实的管道,基于软演员评论家(Soft Actor-Critic, SAC) 的控制策略得到了训练,使得有效的政策部署可以在15分钟内完成,并显著减少了设置时间。我们使用7毫米磁胶囊沿二维轨迹进行操作来验证该平台的有效性。我们的DRL控制器在方形路径和圆形路径上分别实现了均方根误差(RMSE)为1.18毫米和1.50毫米的精确度。此外,我们还展示了在一个临床相关的30厘米*20厘米工作空间内成功的轨迹跟踪能力。 这项研究展示了一个能够快速部署、无模型控制框架,在一个大的工作空间中实现了精确磁性操作,并使用二维GI仿体进行了验证。
https://arxiv.org/abs/2601.15545
A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model's parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model's initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real-world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.
缓解大型语言模型(LLM)中过时或错误信息的常见解决方案是在上下文中提供更新的事实,或者通过知识编辑进行修改。然而,当这些方法未能覆盖掉模型参数性知识时,会导致知识冲突,并影响推理过程中的准确性。当前针对此问题的基准测试主要集中在单一的知识更新和事实回忆上,而忽略了评估这些更新如何影响下游推理。 本文介绍了一种新的基准测试——TRACK(Testing Reasoning Amid Conflicting Knowledge),用于研究当新知识与模型初始参数性知识发生冲突时,LLM 如何通过多步推理传播新知识。TRACK 跨越三个推理密集型场景(WIKI、CODE 和 MATH),引入了多个现实中的冲突,以模拟实际世界的复杂性。 在 TRACK 上的测试结果显示,为推理提供更新的事实有时会导致性能下降,甚至比不提供任何更新事实的情况更差,并且随着提供的更新事实增多,这种性能下滑会加剧。我们发现这一失败既源于模型无法忠实整合新信息,也由于即使知识被整合后,推理过程依然存在缺陷。 TRACK 提供了一个严格的基准测试平台,用于衡量和指导未来关于多步推理中传播冲突性知识的研究进展。
https://arxiv.org/abs/2601.15495
Autonomous drone racing represents a major frontier in robotics research. It requires an Artificial Intelligence (AI) that can run on board light-weight flying robots under tight resource and time constraints, while pushing the physical system to its limits. The state of the art in this area consists of a system with a stereo camera and an inertial measurement unit (IMU) that beat human drone racing champions in a controlled indoor environment. Here, we present MonoRace: an onboard drone racing approach that uses a monocular, rolling-shutter camera and IMU that generalizes to a competition environment without any external motion tracking system. The approach features robust state estimation that combines neural-network-based gate segmentation with a drone model. Moreover, it includes an offline optimization procedure that leverages the known geometry of gates to refine any state estimation parameter. This offline optimization is based purely on onboard flight data and is important for fine-tuning the vital external camera calibration parameters. Furthermore, the guidance and control are performed by a neural network that foregoes inner loop controllers by directly sending motor commands. This small network runs on the flight controller at 500Hz. The proposed approach won the 2025 Abu Dhabi Autonomous Drone Racing Competition (A2RL), outperforming all competing AI teams and three human world champion pilots in a direct knockout tournament. It set a new milestone in autonomous drone racing research, reaching speeds up to 100 km/h on the competition track and successfully coping with problems such as camera interference and IMU saturation.
自主无人机竞速代表了机器人研究的一个重要前沿。它需要能够在资源和时间受限的条件下运行于轻量级飞行机器人上的人工智能,同时将物理系统推向极限。该领域的最新技术包括一个配备了立体摄像头和惯性测量单元(IMU)的系统,在受控室内环境中击败了人类无人机竞速冠军。在这里,我们介绍了MonoRace:一种使用单目滚动快门相机和IMU、适用于无外部运动跟踪系统的竞争环境中的机载无人机竞速方法。这种方法的特点是结合神经网络栅栏分割与无人机模型的稳健状态估计,并且包括一个离线优化过程,该过程利用已知的栅栏几何形状来微调任何状态估算参数。此离线优化完全基于飞行数据,对于精细调整重要的外部相机校准参数至关重要。 此外,引导和控制由神经网络执行,该网络通过直接发送电机命令而跳过了内部回路控制器。这个小型网络在飞行控制器上以每秒500赫兹的频率运行。所提出的方法赢得了2025年阿布扎比自主无人机竞速比赛(A2RL),超过了所有参赛的人工智能团队和三名世界冠军飞行员,后者在直接淘汰赛中被击败。它在自主无人机竞速研究方面设立了新的里程碑,在竞赛赛道上达到了高达100公里/小时的速度,并成功应对了诸如相机干扰和IMU饱和等问题。
https://arxiv.org/abs/2601.15222
Everyday communication is dynamic and multisensory, often involving shifting attention, overlapping speech and visual cues. Yet, most neural attention tracking studies are still limited to highly controlled lab settings, using clean, often audio-only stimuli and requiring sustained attention to a single talker. This work addresses that gap by introducing a novel dataset from 24 normal-hearing participants. We used a mobile electroencephalography (EEG) system (44 scalp electrodes and 20 cEEGrid electrodes) in an audiovisual (AV) paradigm with three conditions: sustained attention to a single talker in a two-talker environment, attention switching between two talkers, and unscripted two-talker conversations with a competing single talker. Analysis included temporal response functions (TRFs) modeling, optimal lag analysis, selective attention classification with decision windows ranging from 1.1s to 35s, and comparisons of TRFs for attention to AV conversations versus side audio-only talkers. Key findings show significant differences in the attention-related P2-peak between attended and ignored speech across conditions for scalp EEG. No significant change in performance between switching and sustained attention suggests robustness for attention switches. Optimal lag analysis revealed narrower peak for conversation compared to single-talker AV stimuli, reflecting the additional complexity of multi-talker processing. Classification of selective attention was consistently above chance (55-70% accuracy) for scalp EEG, while cEEGrid data yielded lower correlations, highlighting the need for further methodological improvements. These results demonstrate that mobile EEG can reliably track selective attention in dynamic, multisensory listening scenarios and provide guidance for designing future AV paradigms and real-world attention tracking applications.
日常交流是动态且多感官的,通常涉及到注意力转移、重叠言语和视觉线索。然而,大多数神经注意追踪研究仍然局限于高度控制的实验室环境中,使用清洁、通常是单一音频的刺激,并要求对单个说话者保持持续关注。这项工作通过引入一个由24名正常听力参与者组成的新型数据集来填补这一空白。我们采用了一种便携式脑电图(EEG)系统(包括44个头皮电极和20个cEEGrid电极),在一个视听(AV)实验范式中设置了三种条件:在双说话者环境中持续关注单一个体的讲话、在两个说话者之间转换注意力,以及与竞争性单一说话者进行未脚本化的双人对话。分析包括时间响应函数(TRFs)建模、最优延迟分析、选择性注意分类(决策窗口从1.1秒到35秒),以及视听对话中对侧单个说话者的TRF比较。 关键发现表明,头皮EEG在所有条件下,被关注和未被关注的讲话之间存在显著差异。在注意力转换与持续注意力之间的表现没有明显变化,这反映了注意力切换时的强大适应性。最优延迟分析揭示了对话中的峰值比单一说话者AV刺激更窄,反映出多说话者处理的额外复杂性。选择性注意分类对于头皮EEG一直高于偶然水平(55-70%准确率),而cEEGrid数据则显示更低的相关性,这强调了进一步方法改进的需求。 这些结果表明便携式EEG能够可靠地追踪动态、多感官倾听场景中的选择性注意力,并为未来视听范式的开发和现实世界注意跟踪应用提供了指导。
https://arxiv.org/abs/2601.15097
Driver distraction remains a leading contributor to motor vehicle crashes, necessitating rigorous evaluation of new in-vehicle technologies. This study assessed the visual and cognitive demands associated with an advanced Large Language Model (LLM) conversational agent (Gemini Live) during on-road driving, comparing it against handsfree phone calls, visual turn-by-turn guidance (low load baseline), and the Operation Span (OSPAN) task (high load anchor). Thirty-two licensed drivers completed five secondary tasks while visual and cognitive demands were measured using the Detection Response Task (DRT) for cognitive load, eye-tracking for visual attention, and subjective workload ratings. Results indicated that Gemini Live interactions (both single-turn and multi-turn) and hands-free phone calls shared similar levels of cognitive load, between that of visual turn-by-turn guidance and OSPAN. Exploratory analysis showed that cognitive load remained stable across extended multi-turn conversations. All tasks maintained mean glance durations well below the well-established 2-second safety threshold, confirming low visual demand. Furthermore, drivers consistently dedicated longer glances to the roadway between brief off-road glances toward the device during task completion, particularly during voice-based interactions, rendering longer total-eyes-off-road time findings less consequential. Subjective ratings mirrored objective data, with participants reporting low effort, demands, and perceived distraction for Gemini Live. These findings demonstrate that advanced LLM conversational agents, when implemented via voice interfaces, impose cognitive and visual demands comparable to established, low-risk hands-free benchmarks, supporting their safe deployment in the driving environment.
驾驶员分心仍然是导致机动车事故的主要因素,因此需要严格评估新的车载技术。本研究评估了先进的大型语言模型(LLM)对话代理(Gemini Live)在实际驾驶过程中所引发的视觉和认知需求,并将其与免提电话通话、可视转向导航(低负荷基准线任务)以及操作跨度任务(OSPAN高负荷参照任务)进行了比较。32名持照驾驶员完成了五项辅助任务,使用检测反应任务(DRT)测量认知负荷,眼动追踪技术评估视觉注意力,并通过主观工作量评分进行评价。 研究结果表明,在单轮和多轮对话中与Gemini Live的互动以及免提电话通话的认知负荷相似,介于可视转向导航任务和OSPAN任务之间。探索性分析还显示,长时间连续的多轮对话中认知负荷保持稳定。所有任务在执行过程中平均注视时间均远低于公认的2秒安全阈值,这表明其视觉需求较低。此外,在完成任务期间,驾驶员倾向于将视线更长地集中在路面上,并且在短暂离开路面看向设备后才进行下一次任务,特别是在语音交互时尤其明显,因此总的时间离眼(Total Eyes Off Road, TEOR)发现的影响较小。 参与者对Gemini Live的主观评价与客观数据一致,报告中认知努力、需求和感知分心均为较低。这些研究结果表明,当通过语音界面实现先进的LLM对话代理时,其引发的认知和视觉需求与已建立的低风险免提标准相当,从而支持其在驾驶环境中的安全部署。
https://arxiv.org/abs/2601.15034
Pure pursuit and its variants are widely used for mobile robot path tracking owing to their simplicity and computational efficiency. However, many conventional approaches do not explicitly account for velocity and acceleration constraints, resulting in discrepancies between commanded and actual velocities that result in overshoot and degraded tracking performance. To address this problem, this paper proposes dynamic window pure pursuit (DWPP), which fundamentally reformulates the command velocity computation process to explicitly incorporate velocity and acceleration constraints. Specifically, DWPP formulates command velocity computation in the velocity space (the $v$-$\omega$ plane) and selects the command velocity as the point within the dynamic window that is closest to the line $\omega = \kappa v$. Experimental results demonstrate that DWPP avoids constraint-violating commands and achieves superior path-tracking accuracy compared with conventional pure pursuit methods. The proposed method has been integrated into the official Nav2 repository and is publicly available (this https URL).
纯 Pursuit 及其变体由于简单且计算效率高,被广泛应用于移动机器人路径跟踪中。然而,许多传统方法并未明确考虑速度和加速度的约束条件,导致指令速度与实际速度之间的差异,进而产生超调现象并降低跟踪性能。为了解决这个问题,本文提出了一种动态窗口纯 Pursuit (DWPP) 方法,该方法从根本上重新设计了命令速度计算过程,以显式地整合速度和加速度限制。 具体而言,DWPP 在速度空间($v-\omega$ 平面)中进行指令速度的计算,并选择在动态窗口内最接近直线 $\omega = \kappa v$ 的点作为指令速度。实验结果显示,相比传统的纯 Pursuit 方法,DWPP 能够避免违反约束条件的情况发生,并且实现了更优的路径跟踪精度。所提出的方法已被集成到官方 Nav2 仓库中,并对公众开放(此 URL)。
https://arxiv.org/abs/2601.15006
Humanoid robots must adapt their contact behavior to diverse objects and tasks, yet most controllers rely on fixed, hand-tuned impedance gains and gripper settings. This paper introduces HumanoidVLM, a vision-language driven retrieval framework that enables the Unitree G1 humanoid to select task-appropriate Cartesian impedance parameters and gripper configurations directly from an egocentric RGB image. The system couples a vision-language model for semantic task inference with a FAISS-based Retrieval-Augmented Generation (RAG) module that retrieves experimentally validated stiffness-damping pairs and object-specific grasp angles from two custom databases, and executes them through a task-space impedance controller for compliant manipulation. We evaluate HumanoidVLM on 14 visual scenarios and achieve a retrieval accuracy of 93%. Real-world experiments show stable interaction dynamics, with z-axis tracking errors typically within 1-3.5 cm and virtual forces consistent with task-dependent impedance settings. These results demonstrate the feasibility of linking semantic perception with retrieval-based control as an interpretable path toward adaptive humanoid manipulation.
人形机器人必须适应与各种对象和任务相关的接触行为,然而大多数控制器依赖于固定的、手动调整的阻抗增益和夹爪设置。本文介绍了一种名为HumanoidVLM的新框架,这是一个基于视觉-语言驱动的检索系统,可以让Unitree G1人形机器人直接从第一视角RGB图像中选择适用于特定任务的笛卡尔阻抗参数和夹爪配置。该系统结合了用于语义任务推理的视觉-语言模型以及一个基于FAISS的检索增强生成(RAG)模块,后者能够从两个自定义数据库中检索实验验证过的刚度-阻尼对及对象特有抓取角度,并通过任务空间阻抗控制器执行这些参数以实现柔顺操作。我们在14个视觉场景下评估了HumanoidVLM系统,取得了93%的检索准确率。实际环境中的实验显示出了稳定的交互动态性,Z轴跟踪误差通常在1-3.5厘米范围内,虚拟力也与任务依赖性的阻抗设置一致。 这些结果证明了将语义感知与基于检索控制相结合来实现适应性强的人形机器人操作是可行的,并且提供了一条解释明确的发展路径。
https://arxiv.org/abs/2601.14874
Soft robotic instruments could navigate delicate, tortuous anatomy more safely than rigid tools, but clinical adoption is limited by insufficient tip functionalization and real-time feedback at the tissue interface. Few sensing and therapeutic modules are compact, robust, and adaptable enough to measure, and respond to, subtle physiological cues during intraluminal procedures. We present a 1.47 mm diameter modular soft robotic catheter that integrates sensing, actuation, and therapy while retaining the compliance needed for safe endoluminal navigation. Validated across multiple in vivo settings, we emphasize its utility in endoscopic retrograde cholangiopancreatography (ERCP), a highly technical procedure and a key access route to the pancreas, an organ that is fragile, difficult to instrument, and central to diseases such as pancreatic cancer. Our architecture supports up to four independently controlled functional units, allowing customizable combinations of anchoring, manipulation, sensing, and targeted drug delivery. In a live porcine model, we demonstrate semi-autonomous deployment into the pancreatic duct and 7.5 cm of endoscopic navigation within it, a region currently inaccessible with standard catheters. A closed-loop autonomous/shared-control system that combines a learned model, magnetic actuation, onboard shape sensing, and visual marker tracking further improves cannulation accuracy. Together, these results establish a scalable platform for multifunctional soft robotic catheters and a new paradigm for complex endoluminal interventions, with potential to reduce radiation exposure, shorten training, and accelerate clinical translation of soft robotic technologies.
软体机器人器械可以在复杂和脆弱的解剖结构中比刚性工具更安全地导航,但其临床应用受到末端功能化不足和实时反馈缺乏的限制。目前很少有传感和治疗模块既紧凑又足够坚固和灵活,以在内窥镜检查过程中响应细微的生理变化。我们提出了一种直径为1.47毫米的模块化软体机器人导管,该导管集成了传感、驱动和治疗功能,并保留了安全进入内腔所需的柔韧性。这种导管已在多种体内场景中进行了验证,并特别强调其在内镜逆行胰胆管造影术(ERCP)中的应用价值。ERCP是一项技术要求很高的手术程序,也是到达脆弱且难以器械操作的胰腺的主要途径,而胰腺对于诸如胰腺癌等疾病至关重要。 我们的架构支持多达四个独立控制的功能单元,可以自定义组合锚定、操控、传感和靶向药物输送等功能。在活体猪模型中,我们展示了这种导管能够半自主地部署到胰腺导管,并且能够在7.5厘米的内窥镜导航距离内工作,而这部分区域目前无法通过标准导管进入。 结合了学习模型、磁力驱动、内置形状感知以及视觉标记追踪等技术的闭环自主/共享控制系统进一步提高了插管精度。总的来说,这些成果为多功能软体机器人导管建立了一个可扩展平台,并且开创了一种新的复杂内腔介入手术范式,有潜力减少辐射暴露、缩短培训时间并加速软体机器人技术在临床的应用转化。
https://arxiv.org/abs/2601.14837
Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.
多模态对象跟踪通过融合多种互补输入(如热成像、深度和事件数据)来实现卓越的性能,因此吸引了广泛的关注。尽管目前的一般用途多模态追踪器主要通过提示学习将各种模态追踪任务(即RGB-热红外、RGB-深度或RGB-事件追踪)统一起来,但它们仍忽视了对时空线索的有效捕捉。为此,我们引入了一种基于响尾蛇式状态空间模型的新型多模态跟踪框架,称为UBATrack。我们的UBATrack包含两个简单而有效的模块:时空响尾蛇适配器(STMA)和动态多模态特征混合器。前者利用了响尾蛇对长序列建模的能力,在适配器调优方式下联合建模跨模式依赖关系以及时空视觉线索。后者进一步增强了在多个特征维度上的多模态表示能力,从而提高了跟踪的鲁棒性。这样,UBATrack消除了成本高昂的全参数微调需求,从而提升了多模态追踪算法的训练效率。实验表明,UBATrack在RGB-T、RGB-D和RGB-E追踪基准测试上优于现有最佳方法,在LasHeR、RGBT234、RGBT210、DepthTrack、VOT-RGBD22以及VisEvent数据集上取得了卓越的结果。 这段文字概述了最新的多模态跟踪技术UBATrack,强调其在处理多种传感器输入时的高效性与准确性,并通过实验结果证明该框架的有效性和优越性。
https://arxiv.org/abs/2601.14799
Optimizing scientific computing algorithms for modern GPUs is a labor-intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)-assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome-based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM-agent system that reframes GPU-oriented algorithm optimization as an In-Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory-conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt-based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first-class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross-lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi-island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: this https URL
优化现代GPU上的科学计算算法是一个耗时且迭代的过程,涉及反复修改代码、基准测试和调优,这些都在复杂的硬件和软件堆栈上进行。近期的工作探索了大型语言模型(LLM)辅助的进化方法用于自动代码优化,但这些方法主要依赖于基于结果的选择和随机变异,未能充分利用在迭代优化过程中生成的丰富轨迹信息。我们提出了一种名为PhyloEvolve的LLM代理系统,它将面向GPU的算法优化重新定义为一个情境强化学习(ICRL)问题。这种形式化使得能够在不重新训练模型的情况下复用优化经验中的轨迹条件。 PhyloEvolve整合了算法蒸馏和基于提示的决策转换器到一个迭代的工作流程中,将算法修改序列和性能反馈作为首要的学习信号。为了组织优化历史,我们引入了一种使用进化树表示的方法来捕捉算法变体之间的继承、分歧和重组情况,从而支持回溯、跨谱系转移以及可重复性。 该系统结合了精英轨迹池化、多岛并行探索和容器化执行以在异构硬件上平衡探索与利用。我们在包括偏微分方程求解器、流形学习及光谱图算法在内的科学计算工作负载中评估了PhyloEvolve,证明其在运行时间、内存效率和正确性方面相对于基线方法和进化方法都有持续改进。 代码可以在 [此处](https://this-url.com) 获取。
https://arxiv.org/abs/2601.14523
Most of the traditional Applicant Tracking Systems (ATS) depend on strict matching using keywords, where candidates that are highly qualified are many times disqualified because of minor semantic differences. In this article, the two-stage process of developing a more comprehensive resume assessment system based on a small language model that is trained with fewer than 600M parameters is introduced and fine-tuned by using GRPO with a uniquely designed reward function. The initial stage is Supervised Fine-Tuning (SFT), which is used to create a strong base model with the ability to perceive resumes beyond superficial overlap of keywords. This SFT model is further optimized in the second step with Reinforcement Learning (RL) via GRPO with the help of multi-component-based rewarding, which will not be considered as a commission of tokens matching. In the initial RL experiments, we found a severe difficulty in the shape of reward hacking: overly aggressive penalty terms resulted in unstable training dynamics and prohibitively negative model behavior. This was solved by trial-and-error refinement of the reward and careful training hyperparameter tuning, which led to a stable and controlled process of gentle polishing. The GRPO-refined model shows high real-life performance, as it shows an accuracy of 91% on unseen data used for testing. It has a high recall of 0.85 on the SELECTED class with a perfect precision of 1.0, which highlights its high reliability for identifying qualified applicants. These findings demonstrate that an appropriately structured two-step fine-tuning pipeline can effectively be used to transfer a small language model into human-like candidate evaluation, surpassing the shortcomings of both traditional ATS systems and unrefined uses of reinforcement learning.
大多数传统的申请跟踪系统(ATS)依赖于严格的关键词匹配,因此即使候选人非常合格,也可能由于细微的语义差异而被拒之门外。本文介绍了一种基于一个小规模语言模型开发更全面简历评估系统的两阶段过程,该模型训练参数少于600M,并通过使用带有独特设计奖励函数的GRPO进行微调。第一阶段是监督微调(SFT),用于创建一个能够超越关键词表面重叠感知简历的强大基础模型。在第二步中,该SFT模型进一步通过多组件奖励机制的强化学习(RL)经由GRPO优化,不再简单地被视为匹配令牌的任务。 在最初的RL实验中,我们发现了一个严重的问题:过度惩罚导致训练不稳定和负面行为过于消极。这一问题通过反复试验调整奖励设计以及谨慎调优训练超参数得以解决,从而实现了一种稳定且可控的温和精炼过程。经过GRPO优化后的模型显示出91%的未见数据测试准确率,在选定类别的召回率为0.85,并具有完美的精度为1.0,这突出显示了其在识别合格申请人方面的高可靠性。 这些发现表明,一个适当结构化的两步微调流水线可以有效地将小规模语言模型转化为类似人类评估候选人的能力,超越了传统ATS系统和未经优化的强化学习应用中的不足之处。
https://arxiv.org/abs/2511.16073
Geometric mechanics provides valuable insights into how biological and robotic systems use changes in shape to move by mechanically interacting with their environment. In high-friction environments it provides that the entire interaction is captured by the ``motility map''. Here we compare methods for learning the motility map from motion tracking data of a physical robot created specifically to test these methods by having under-actuated degrees of freedom and a hard to model interaction with its substrate. We compared four modeling approaches in terms of their ability to predict body velocity from shape change within the same gait, across gaits, and across speeds. Our results show a trade-off between simpler methods which are superior on small training datasets, and more sophisticated methods, which are superior when more training data is available.
几何力学为生物和机器人系统如何通过改变形状并与环境机械交互以实现移动提供了宝贵的见解。在高摩擦环境中,它表明整个交互过程可以通过“运动图”来捕捉。在这里,我们比较了从物理机器人的运动追踪数据中学习运动图的方法,该机器人专门设计用于测试这些方法,具有欠驱动自由度和难以建模的与底物互动特性。我们从同一步态、跨步态以及跨速度范围这三个方面比较了四种模型在预测身体速度方面的表现。我们的研究结果表明,在较小训练数据集的情况下,较为简单的模型更为优越;而在有更多训练数据可用时,则是更复杂的模型更具优势。
https://arxiv.org/abs/2601.13777
Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
长视频理解对视觉-语言模型构成了重大挑战,因为需要处理极长的上下文窗口。现有的解决方案通常依赖于简单的分块策略和检索增强生成方法,这会导致信息碎片化并丧失全局连贯性。我们提出了HAVEN框架,这是一种统一的方法来解决长视频理解问题,通过整合音视频实体一致性以及层级视频索引与代理搜索机制,实现连贯且全面的推理。 首先,HAVEN通过将视觉和听觉流中的实体级表示进行集成以保持语义的一致性,并按全局摘要、场景、片段和实体级别来组织内容,形成一个结构化的层次体系。然后,它采用一种代理搜索机制,能够在这些层级之间动态地检索并推理,从而实现连贯的叙事重构以及细粒度的实体跟踪。 广泛的实验表明,我们的方法在时间一致性、实体一致性和检索效率方面表现出色,在LVBench数据集上实现了84.1%的整体准确率,并且在具有挑战性的推理类别中达到了80.1%,确立了新的技术基准。这些结果突显出结构化和多模态推理对于理解长视频内容的全面性与上下文一致性至关重要。 总的来说,HAVEN框架展示了一种处理长视频理解的新方法,这种方法不仅提高了信息检索的有效性和连贯性,还提升了实体识别的准确性,在复杂的视听数据处理中展示了显著的优势。
https://arxiv.org/abs/2601.13719
Existing dynamic Theory of Mind (ToM) benchmarks mostly place language models in a passive role: the model reads a sequence of connected scenarios and reports what people believe, feel, intend, and do as these states change. In real social interaction, ToM is also used for action: a speaker plans what to say in order to shift another person's mental-state trajectory toward a goal. We introduce SocialMindChange, a benchmark that moves from tracking minds to changing minds in social interaction. Each instance defines a social context with 4 characters and five connected scenes. The model plays one character and generates dialogue across the five scenes to reach the target while remaining consistent with the evolving states of all participants. SocialMindChange also includes selected higher-order states. Using a structured four-step framework, we construct 1,200 social contexts, covering 6000 scenarios and over 90,000 questions, each validated for realism and quality. Evaluations on ten state-of-the-art LLMs show that their average performance is 54.2% below human performance. This gap suggests that current LLMs still struggle to maintain and change mental-state representations across long, linked interactions.
现有的动态心智理论(ToM)基准测试大多将语言模型置于被动角色:模型阅读一系列相关的场景,并报告人们在这些状态变化过程中的信念、感受、意图和行为。然而,在真实的社交互动中,ToM也被用于行动:说话者会计划说什么以使另一个人的心理状态轨迹朝着某个目标发展。我们引入了SocialMindChange这一基准测试,它从追踪心理转变为改变心理的社交互动。每个实例定义了一个包含4个角色和社会背景的场景,并且有五个相连的情节。模型扮演其中一个角色,在这五个情节中生成对话,以达到既定的目标同时保持与所有参与者不断变化状态的一致性。SocialMindChange还包含了选定的高阶状态。通过一个结构化的四步框架,我们构建了1,200个社会背景,涵盖了6,000多个场景和超过90,000个问题,并且每个场景都经过验证以确保其现实性和质量。对十个最先进的大型语言模型(LLM)的评估显示,它们的平均表现比人类的表现低54.2%。这个差距表明当前的LLM仍然难以在长时间、相互关联的互动中保持和改变心理状态表示。
https://arxiv.org/abs/2601.13687
This paper presents a deep reinforcement learning (DRL) based controller for collective navigation of unmanned aerial vehicle (UAV) swarms in communication-denied environments, enabling robust operation in complex, obstacle-rich environments. Inspired by biological swarms where informed individuals guide groups without explicit communication, we employ an implicit leader-follower framework. In this paradigm, only the leader possesses goal information, while follower UAVs learn robust policies using only onboard LiDAR sensing, without requiring any inter-agent communication or leader identification. Our system utilizes LiDAR point clustering and an extended Kalman filter for stable neighbor tracking, providing reliable perception independent of external positioning systems. The core of our approach is a DRL controller, trained in GPU-accelerated Nvidia Isaac Sim, that enables followers to learn complex emergent behaviors - balancing flocking and obstacle avoidance - using only local perception. This allows the swarm to implicitly follow the leader while robustly addressing perceptual challenges such as occlusion and limited field-of-view. The robustness and sim-to-real transfer of our approach are confirmed through extensive simulations and challenging real-world experiments with a swarm of five UAVs, which successfully demonstrated collective navigation across diverse indoor and outdoor environments without any communication or external localization.
本文提出了一种基于深度强化学习(DRL)的控制器,用于在无通信环境中无人机群(UAV)集体导航,使得能够在复杂、障碍物密集的环境下稳健运行。受生物群落中知情个体通过非显式沟通引导群体行为的启发,我们采用隐性领导-跟随框架。在此范式下,只有领导者拥有目标信息,而跟随者无人机仅利用机载LiDAR传感器感知环境来学习稳健策略,无需任何代理间通信或明确识别领导者身份。我们的系统使用LiDAR点聚类和扩展卡尔曼滤波器进行稳定的邻居跟踪,提供可靠的感知能力,独立于外部定位系统。 我们方法的核心是经过GPU加速的Nvidia Isaac Sim训练的DRL控制器,该控制器使跟随者能够仅利用本地感知学习复杂的涌现行为——平衡群集与障碍物规避。这使得无人机群能够在没有通信和外部定位的情况下隐性地跟随领导者,并且稳健地解决诸如遮挡及有限视场等感知挑战。 通过广泛的模拟以及具有五个UAV的真实世界实验,证实了我们方法的鲁棒性和仿真到现实转移的能力。这些实验成功展示了无人机群在各种室内和室外环境中集体导航的能力,而无需任何通信或外部定位。
https://arxiv.org/abs/2601.13657
Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.
当前的视觉语言模型(VLMs)在定量空间推理方面表现不佳,因为它们的架构破坏了进行计数和测量所需的像素级信息。视觉编码器通过图像块嵌入压缩图像,减少了空间索引,并且丧失了准确计数所需要的精确像素级跟踪能力。 我们提出了两种方法来解决这一根本限制。首先,我们介绍了SQuID(卫星定量智能数据集),这是一个包含2000对卫星图像问答配对的基准测试集,这些问题的答案既有数值范围也有分类答案,专门用于评估定量空间推理的能力。该数据集涵盖了三个难度级别,并且注释是从人类标签及其学习变化自动生成的。 其次,我们提出了QVLM(定量视觉语言模型),这是一种代码生成架构,通过将语言理解和视觉分析解耦来保持像素精度。QVLM不是将图像编码为嵌入式表示,而是生成可执行代码:首先调用一个分割模型以获取像素级掩模,然后直接操作这些掩模,在推理过程中保留空间索引。 我们的实验表明,当使用GPT-5作为编码器时,QVLM在SQuID上的准确率为42.0%,而提示图像和问题对的VLM模型仅达到28.1%。这项工作揭示了对于定量空间推理而言,架构解耦能够提高定量任务的准确性。
https://arxiv.org/abs/2601.13401
Eco-driving strategies have demonstrated substantial potential for improving energy efficiency and reducing emissions, especially at signalized intersections. However, evaluations of eco-driving methods typically rely on simplified simulation or experimental conditions, where certain assumptions are made to manage complexity and experimental control. This study introduces a unified framework to evaluate eco-driving strategies through the lens of two complementary criteria: control robustness and environmental resilience. We define formal indicators that quantify performance degradation caused by internal execution variability and external environmental disturbances, respectively. These indicators are then applied to assess multiple eco-driving controllers through real-world vehicle experiments. The results reveal key tradeoffs between tracking accuracy and adaptability, showing that optimization-based controllers offer more consistent performance across varying disturbance levels, while analytical controllers may perform comparably under nominal conditions but exhibit greater sensitivity to execution and timing variability.
生态驾驶策略已显示出显著的潜力,可以提高能源效率并减少排放,尤其是在有信号灯的交叉路口。然而,对生态驾驶方法的评估通常依赖于简化的模拟或实验条件,在这些条件下会做出一些假设来管理复杂性和实验控制。本研究介绍了一个统一框架,用于通过两个互补标准——控制系统鲁棒性和环境适应性——来评估生态驾驶策略。我们定义了正式指标,分别量化由于内部执行变化和外部环境干扰导致的性能下降。然后将这些指标应用于通过真实车辆试验来评估多个生态驾驶控制器。结果揭示了跟踪精度与适应能力之间的关键权衡:基于优化的控制器在不同的干扰水平下表现出更为一致的性能,而分析型控制器可能在正常条件下表现相当,但在执行和时间变异性方面更加敏感。
https://arxiv.org/abs/2601.13389
Tactile sensing provides a promising sensing modality for object pose estimation in manipulation settings where visual information is limited due to occlusion or environmental effects. However, efficiently leveraging tactile data for estimation remains a challenge due to partial observability, with single observations corresponding to multiple possible contact configurations. This limits conventional estimation approaches largely tailored to vision. We propose to address these challenges by learning an inverse tactile sensor model using denoising diffusion. The model is conditioned on tactile observations from a distributed tactile sensor and trained in simulation using a geometric sensor model based on signed distance fields. Contact constraints are enforced during inference through single-step projection using distance and gradient information from the signed distance field. For online pose estimation, we integrate the inverse model with a particle filter through a proposal scheme that combines generated hypotheses with particles from the prior belief. Our approach is validated in simulated and real-world planar pose estimation settings, without access to visual data or tight initial pose priors. We further evaluate robustness to unmodeled contact and sensor dynamics for pose tracking in a box-pushing scenario. Compared to local sampling baselines, the inverse sensor model improves sampling efficiency and estimation accuracy while preserving multimodal beliefs across objects with varying tactile discriminability.
触觉感知为在视觉信息受限(由于遮挡或环境影响)的操作环境中进行物体姿态估计提供了有前景的感知方式。然而,由于部分可观测性和单一观测对应多种可能接触配置的问题,有效地利用触觉数据进行估算仍是一项挑战。这限制了传统主要针对视觉设计的估算方法的应用范围。我们提出通过使用去噪扩散来学习逆向触觉传感器模型的方法来解决这些问题。该模型基于分布式的触觉传感器的触觉观测,在模拟环境中通过基于符号距离场(Signed Distance Field, SDF)的几何传感器模型进行训练。在推理过程中,利用SDF的距离和梯度信息执行单步投影以强制实施接触约束。对于在线姿态估计,我们通过提案方案将逆向模型与粒子滤波器集成,该方案结合了生成假设和先前信念中的粒子。 我们在模拟和平面姿态估算的真实世界设置中验证了我们的方法,在没有视觉数据或严格的初始姿势先验的情况下进行实验。我们进一步评估了在盒子推动场景的位姿跟踪中对未建模接触和传感器动力学鲁棒性的性能。与局部采样基准相比,逆向传感器模型提高了采样效率和估计精度,并且在整个具有不同触觉分辨能力的对象上保持多模态信念。
https://arxiv.org/abs/2601.13250