Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.
多模态对象跟踪通过融合多种互补输入(如热成像、深度和事件数据)来实现卓越的性能,因此吸引了广泛的关注。尽管目前的一般用途多模态追踪器主要通过提示学习将各种模态追踪任务(即RGB-热红外、RGB-深度或RGB-事件追踪)统一起来,但它们仍忽视了对时空线索的有效捕捉。为此,我们引入了一种基于响尾蛇式状态空间模型的新型多模态跟踪框架,称为UBATrack。我们的UBATrack包含两个简单而有效的模块:时空响尾蛇适配器(STMA)和动态多模态特征混合器。前者利用了响尾蛇对长序列建模的能力,在适配器调优方式下联合建模跨模式依赖关系以及时空视觉线索。后者进一步增强了在多个特征维度上的多模态表示能力,从而提高了跟踪的鲁棒性。这样,UBATrack消除了成本高昂的全参数微调需求,从而提升了多模态追踪算法的训练效率。实验表明,UBATrack在RGB-T、RGB-D和RGB-E追踪基准测试上优于现有最佳方法,在LasHeR、RGBT234、RGBT210、DepthTrack、VOT-RGBD22以及VisEvent数据集上取得了卓越的结果。 这段文字概述了最新的多模态跟踪技术UBATrack,强调其在处理多种传感器输入时的高效性与准确性,并通过实验结果证明该框架的有效性和优越性。
https://arxiv.org/abs/2601.14799
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
目前最强的视频语言模型(VLMs)仍然是专有的。最强大的开源权重模型要么依赖于来自专有 VLM 的合成数据,从而进行蒸馏处理,要么不披露其训练数据或方法。因此,开源社区缺乏改进现有最佳视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解能力,还需要具备指针定位或像素跟踪的能力,而即使是专有模型也缺乏这种能力。 我们介绍了 Molmo2,这是一个新的 VLM 家族,在开源模型中处于领先位置,并在单张图片、多张图片和视频任务中的点驱动型定位方面展示了卓越的新功能。我们的关键贡献是一系列共 9 个新数据集(包括 7 个视频数据集及 2 个多图像数据集),其中包括用于预训练的详细描述性视频字幕的数据集,一个用于微调的自由形式视频问答数据集,一个新的具有复杂查询的对象跟踪数据集以及一项创新性的视频指针数据集,所有这些数据都是在不使用封闭 VLM 的情况下收集的。我们还提供了一种针对此数据集使用的训练方法,采用高效的打包和消息树编码方案,并展示出双向视觉标记注意力及新的标记权重策略可以提高性能。 我们的顶级 8B 模型在短视频、计数和描述等任务上优于同类开源权重和数据模型,在长视频上的表现也具有竞争力。在视频定位方面,Molmo2 显著超越现有的开源重量级模型(如 Qwen3-VL 的准确率为 35.5 对比 29.6 在视频计数中),并且在某些任务上甚至超过了专有模型(例如,在视频指针上的 F1 得分为 38.4 对比 Gemini 3 Pro 的 20.0,以及在视频跟踪上的 J&F 得分为 56.2 对比 Gemini 3 Pro 的 41.1)。
https://arxiv.org/abs/2601.10611
Satellite videos provide continuous observations of surface dynamics but pose significant challenges for multi-object tracking (MOT), especially under unstabilized conditions where platform jitter and the weak appearance of tiny objects jointly degrade tracking performance. To address this problem, we propose DeTracker, a joint detection-and-tracking framework tailored for unstabilized satellite videos. DeTracker introduces a Global--Local Motion Decoupling (GLMD) module that explicitly separates satellite platform motion from true object motion through global alignment and local refinement, leading to improved trajectory stability and motion estimation accuracy. In addition, a Temporal Dependency Feature Pyramid (TDFP) module is developed to perform cross-frame temporal feature fusion, enhancing the continuity and discriminability of tiny-object representations. We further construct a new benchmark dataset, SDM-Car-SU, which simulates multi-directional and multi-speed platform motions to enable systematic evaluation of tracking robustness under varying motion perturbations. Extensive experiments on both simulated and real unstabilized satellite videos demonstrate that DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU and 47.3% MOTA on real satellite video data.
卫星视频提供了对地表动态的连续观测,但在未稳定条件下进行多目标跟踪(MOT)时面临重大挑战。平台抖动和小目标外观微弱的问题共同导致了跟踪性能下降。为了应对这一问题,我们提出了DeTracker,这是一种专门针对未稳定的卫星视频设计的检测与追踪框架。DeTracker引入了一个全局-局部运动解耦(GLMD)模块,通过全局对齐和局部细化来明确区分卫星平台运动和真实目标运动,从而提高轨迹稳定性和运动估计精度。此外,还开发了一种时序依赖特征金字塔(TDFP)模块,用于跨帧的时序特征融合,增强小对象表示的一致性与可识别性。 我们进一步构建了一个新的基准数据集SDM-Car-SU,该数据集模拟了多方向和多种速度下的平台运动,从而能够在不同的运动扰动下系统地评估跟踪鲁棒性。在模拟的和真实的未稳定卫星视频上的广泛实验表明,DeTracker显著优于现有的方法,在SDM-Car-SU数据集中达到了61.1% MOTA(多项评估追踪准确度),而在真实卫星视频数据集中的表现也达到47.3% MOTA。
https://arxiv.org/abs/2601.09240
Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training--utilizing only one template and one search image per sequence--which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).
最近基于Transformer的轻量级目标跟踪技术取得了显著进展,在各项基准测试中确立了新的标准。这些进步得益于注意力机制提供的全局接受域和强大的特征提取能力。尽管已有方法取得了成就,但它们在训练过程中普遍采用稀疏采样策略——每条序列仅使用一个模板图像和一个搜索图像——这无法全面探索视频中的时空信息。这种局限性限制了性能,并导致轻量级跟踪器与高性能跟踪器之间的差距。 为了弥合这一鸿沟同时保持实时效率,我们提出了一种新的框架STDTrack,它首次将可靠的时空依赖关系引入到轻量级目标跟踪中。我们的方法通过密集视频采样最大化时空信息的利用。为指导逐帧特征提取,我们引入了可随时间传播的时空令牌。为了确保全面的目标状态表示,我们设计了一个多帧信息融合模块(MFIFM),该模块使用历史上下文来增强当前依赖关系。 MFIFM操作于由我们构建的时空令牌维护器(STM)中的特征上,其中基于质量的更新机制保证了信息的可靠性。考虑到跟踪目标之间的尺度变化,我们开发了一种多尺度预测头以动态适应不同大小的对象。广泛的实验显示,在六个基准测试中实现了当前最佳结果。值得注意的是,在GOT-10k数据集上,STDTrack在运行速度为每秒192帧(GPU)和41帧(CPU)的情况下,其性能可媲美某些高性能但非实时的目标跟踪器(例如MixFormer)。
https://arxiv.org/abs/2601.09078
Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
传统的多目标跟踪(MOT)系统在定位和关联方面达到了很高的精度,有效地回答了“在哪里”和“谁”的问题。然而,这些系统往往像自闭症观察者一样工作,能够追踪几何路径,但却对物体行为背后的语义上的“是什么”和“为什么”视而不见。为了弥合几何感知与认知推理之间的差距,我们提出了**LLMTrack**,这是一种新的端到端框架,用于语义多目标跟踪(SMOT)。我们的设计哲学借鉴了仿生学原理,将强大的定位功能与深度理解分离,使用Grounding DINO作为“眼睛”以及LLaVA-OneVision多模态大型模型作为“大脑”。我们引入了一个时空融合模块,该模块汇聚实例级交互特征和视频级别上下文,使大规模语言模型能够理解和解析复杂的轨迹。此外,为了有效地将庞大的模型适应到跟踪领域,我们设计了一种分阶段的三步训练策略:视觉对齐、时间微调以及通过LoRA进行语义注入。 在BenSMOT基准测试中的广泛实验表明,LLMTrack达到了最先进的性能,在实例描述、交互识别和视频摘要方面显著优于现有方法,同时保持了强大的跟踪稳定性。
https://arxiv.org/abs/2601.06550
The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.
第三届感知测试挑战赛作为全天研讨会与IEEE/CVF国际计算机视觉大会(ICCV)2025年会议同期举行。其主要目标是评估当前最先进的视频模型,并衡量跨模态感知技术的进步。今年的研讨会还新增了两个客座赛道:KiVA(图像理解挑战)和Physic-IQ(视频生成挑战)。在这份报告中,我们将总结来自主赛道感知测试挑战赛的结果,详细介绍了现有任务以及基准测试中的新添加内容。在此次迭代中,我们强调了任务统一的重要性,因为这对目前最先进的跨模态模型提出了更具有挑战性的考验。该挑战包括五个整合赛道:统一体视频问答、统一分目标和点跟踪、统一时长动作和声音定位、基于背景的视频问答以及长达一小时的视频问答,还有一个分析与解释性赛道仍在接受提交中。值得注意的是,统一体视频问答赛道引入了一个新颖的子集,它将传统的感知任务(如点跟踪和时间动作定位)重新表述为多选题视频问答问题,视频-语言模型可以原生地处理这些问题。而统一分目标和点跟踪合并了原有的对象跟踪与点跟踪任务,统一时长动作和声音定位则整合了原有的时间动作定位与时长声源定位赛道。因此,我们要求参赛者使用统一的方法而非针对特定任务定制的管道工程解决方案。通过提出这样的统一挑战,感知测试2025突显了现有模型在通过统一接口处理多样化的感知任务时面临的显著困难。
https://arxiv.org/abs/2601.06287
Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at this https URL.
在机器人技术中,追踪动态环境中移动的对象是一个核心挑战。最近的研究在这个领域取得了显著的进展;然而,许多现有的方法由于依赖于重型基础模型而仍然不够高效。为了解决这一限制,我们提出了一种轻量级的方法——LOST-3DSG(轻量级开放词汇3D场景图),旨在追踪真实世界环境中动态对象。 我们的方法采用基于word2vec和句子嵌入的语义方法进行实体追踪,这使得它能够实现开放词汇表表示同时避免了存储密集型CLIP视觉特征的需求。结果表明,与依赖于高维视觉嵌入的方法相比,LOST-3DSG在性能上更胜一筹。 我们通过使用TIAGo机器人在真实3D环境中进行的定性和定量实验来评估我们的方法。实验结果证明了LOST-3DSG在动态对象追踪中的有效性和高效性。该项目的相关代码和补充材料可以在项目网站(此处链接为原文中提到的具体URL,您可以直接访问该网址获取详细信息)上公开获得。 原文翻译: Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at this https URL.
https://arxiv.org/abs/2601.02905
Automated analysis of volumetric medical imaging on edge devices is severely constrained by the high memory and computational demands of 3D Convolutional Neural Networks (CNNs). This paper develops a lightweight computer vision framework that reconciles the efficiency of 2D detection with the necessity of 3D context by reformulating volumetric Computer Tomography (CT) data as sequential video streams. This video-viewpoint paradigm is applied to the time-sensitive task of Intracranial Hemorrhage (ICH) detection using the Hemorica dataset. To ensure operational efficiency, we benchmarked multiple generations of the YOLO architecture (v8, v10, v11 and v12) in their Nano configurations, selecting the version with the highest mAP@50 to serve as the slice-level backbone. A ByteTrack algorithm is then introduced to enforce anatomical consistency across the $z$-axis. To address the initialization lag inherent in video trackers, a hybrid inference strategy and a spatiotemporal consistency filter are proposed to distinguish true pathology from transient prediction noise. Experimental results on independent test data demonstrate that the proposed framework serves as a rigorous temporal validator, increasing detection Precision from 0.703 to 0.779 compared to the baseline 2D detector, while maintaining high sensitivity. By approximating 3D contextual reasoning at a fraction of the computational cost, this method provides a scalable solution for real-time patient prioritization in resource-constrained environments, such as mobile stroke units and IoT-enabled remote clinics.
自动化分析边缘设备上的体积医学影像受到三维卷积神经网络(CNN)高内存和计算需求的严重限制。本文开发了一种轻量级计算机视觉框架,通过将体积型计算机断层扫描(CT)数据重新表述为顺序视频流,解决了二维检测效率与三维上下文必要性的矛盾。这一基于视频视角的方法被应用于具有时间敏感性任务的颅内出血(ICH)检测,并使用Hemorica数据集进行验证。 为了确保运行效率,我们在Nano配置下对YOLO架构的不同版本(v8、v10、v11和v12)进行了基准测试,并选择了mAP@50值最高的版本作为二维切片级别的主干网络。随后引入了一种ByteTrack算法,以强制执行沿z轴的解剖学一致性。为了解决视频跟踪器初始化时固有的滞后问题,我们提出了一种混合推理策略和时空一致性过滤机制,用以区分真正的病灶与瞬态预测噪声。 在独立测试数据上的实验结果表明,所提出的框架作为一个严格的时间验证工具,在保持高灵敏度的同时,相较于基准二维检测器将检测精度从0.703提升至0.779。通过大幅降低计算成本来近似三维上下文推理,该方法为资源受限环境中的实时患者优先级划分提供了可扩展的解决方案,例如移动中风单位和物联网远程诊所。
https://arxiv.org/abs/2601.02521
As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.
随着多目标跟踪(MOT)任务向更加通用和多模态场景发展,现有MOT方法的刚性架构及其针对特定任务的设计越来越难以适应多样化的任务需求,并限制了其灵活性以应对新的追踪形式。大多数方法依赖于固定输出头和专用追踪管道,这使得它们很难扩展到更复杂或指令驱动的任务中。为了解决这些局限性,我们提出了AR-MOT,这是一种新型的自回归范式,它将MOT作为大型语言模型(LLM)框架中的序列生成任务来定义。这种设计使模型能够通过灵活的序列构建输出结构化结果,并且无需任何特定于任务的头部即可实现这一点。 为了增强区域级别的视觉感知能力,我们引入了一个基于预训练检测器的对象标记器(Object Tokenizer)。为了解决全局特征与局部特征之间的不匹配问题,我们提出了一个区域感知对齐(RAA)模块;同时,为了支持长期跟踪,我们设计了一个时间记忆融合(TMF)模块,该模块可以缓存历史对象标记。 AR-MOT展示了很强的可扩展性潜力,因为新的模态或指令可以通过简单地修改输出序列格式而不改变模型架构来集成。在MOT17和DanceTrack上的广泛实验验证了我们方法的有效性,实现了与最新技术相媲美的性能,并为更通用、灵活的MOT系统奠定了基础。
https://arxiv.org/abs/2601.01925
Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on this https URL
现有的RGB-事件视觉对象跟踪方法主要依赖于传统的特征级融合,未能充分利用事件相机的独特优势。特别是,高动态范围和运动敏感特性经常被忽视,而低信息区域则进行了统一处理,导致骨干网络产生了不必要的计算开销。为了解决这些问题,我们提出了一种新的追踪框架,在频域中进行早期融合,以有效地聚合来自事件模态的高频信息。具体来说,RGB和事件模态通过快速傅里叶变换从空间域转换到频域,并解耦了它们的振幅和相位成分。通过振幅和相位注意力机制选择性地将高频事件信息融合到RGB模态中,从而增强特征表示的同时大幅减少骨干网络计算量。此外,运动引导的空间稀疏化模块利用了事件相机的运动敏感特性来捕捉目标运动线索与空间概率分布之间的关系,过滤掉低信息区域,并增强相关的目标特征。最终,一组稀疏的相关目标特征被输入到骨干网络中进行学习,而追踪头则预测最终的目标位置。 在三个广泛使用的RGB-事件跟踪基准数据集(FE108、FELT和COESOT)上进行了大量实验,证明了我们方法的高性能和高效率。本文的源代码将在以下网址发布:[此URL]
https://arxiv.org/abs/2601.01022
Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.
在农业环境中进行多目标跟踪(MOT)面临着诸多挑战,包括重复模式、相似物体外观、突然的光照变化以及频繁的遮挡。当前领域的追踪器依赖于对象的运动而非外观来进行关联,但它们难以维持目标在经历频繁且强烈的遮挡时的身份一致性。由于物体外观的高度相似性,基于外观的关联在农业场景中变得复杂。 为了解决这个问题,我们提出了一种新的MOT框架——CropTrack,该框架结合了外观和运动信息。CropTrack集成了增强重排序的外观关联、一对一多对多的外观冲突解决策略以及指数移动平均原型特征库来改进基于外观的关联。在公开的农业MOT数据集上进行评估时,CropTrack展示了持续的身份保持能力,并且优于传统的仅基于运动的方法。与最先进的方法相比,CropTrack在识别F1和关联准确度得分方面实现了显著提升,并且身份切换的数量较少。
https://arxiv.org/abs/2512.24838
\noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: this https URL. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}
记忆已成为现代基于分割框架中实现稳健视觉目标跟踪的核心机制。最近,基于Segment Anything Model 2 (SAM2) 的方法通过改进过去观测值的存储和重用方式展示了强大的性能。然而,现有的方法以特定于方法的方式解决内存限制问题,这使得在基于SAM的追踪中的内存设计原则仍然不太清楚。此外,这些记忆机制如何转移到更强、下一代的基础模型如Segment Anything Model 3 (SAM3) 上仍然是未知的。 在这项工作中,我们提出了一个系统性的针对SAM基础视觉目标跟踪的记忆中心研究。首先,我们分析了代表性的基于SAM2的追踪器,并显示大多数方法主要在短期记忆帧的选择方式上有所不同,但共享着共同的目标为中心的表示法。基于这一见解,在SAM3框架内忠实重新实现了这些记忆机制,并在十个多样化的基准数据集上进行了大规模评估,从而能够独立于骨干网络强度地控制分析内存设计。依据我们的实证发现,我们提出了一种统一的混合记忆框架,该框架明确将内存分解为短期外观记忆和长期干扰解决记忆。这种分解使得以模块化且原理性的方式整合现有内存策略成为可能。 广泛的实验表明,在SAM2和SAM3基础网络上,所提出的框架在长时间遮挡、复杂运动及大量干扰场景下始终提高了稳健性。代码可在以下网址获得:此https URL。**这是一个预印本。一些结果正在最终确定中,并可能在未来修订版中更新。**
https://arxiv.org/abs/2512.22624
Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $\pi^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.
多视角3D几何网络提供了一种强大的先验知识,但其运行速度过慢,无法满足实时应用的需求。我们提出一种新的方法来适应这些模型以实现在线使用,从而能够从单目RGB视频中实现实时的6自由度姿态跟踪和场景或物体的在线重建。我们的方法快速选择并管理一组关键帧图像集,通过$\pi^3$(全双向注意力机制)映射场景或对象。随后,我们缓存全局自注意力模块的关键值对(KV对),并将它们作为唯一的场景表示用于在线追踪。这种方法在推理阶段最多可实现15倍的速度提升,并且不会出现漂移或灾难性遗忘的问题。 我们的缓存策略是模型无关的,可以应用于其他现成的多视角网络而不需重新训练。我们在KV-Tracker上展示了这一方法对场景级别跟踪和更具挑战性的在线物体跟踪与重建任务的有效性(无需深度测量或对象先验)。在TUM RGB-D、7-Scenes、Arctic 和 OnePose数据集上的实验显示,我们的系统具有强大的性能,并且能保持高达约27 FPS的高帧率。
https://arxiv.org/abs/2512.22581
Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{this https URL}{this https URL}.
多目标跟踪旨在通过在视频帧之间关联检测结果来维护对象的身份。文献中存在两种主要范式:基于检测的跟踪方法,这些方法计算效率高但依赖于手工设计的相关启发式规则;以及端到端的方法,它们从数据中学习相关性,代价是更高的计算复杂度。我们提出了一种名为“Track-Detection Link Prediction”(TDLP)的新方法,这是一种基于检测的跟踪技术,它通过在轨迹和检测之间的链接预测来进行帧内关联,即通过对每个轨道在未来每一帧中的正确延续进行预测来实现这一目标。TDLP主要针对几何特征如边界框进行架构设计,同时可选地整合其他线索,包括姿态和外观信息。与基于启发式的方法不同的是,TDLP直接从数据中学习关联性而不依赖于手工规则,并且在计算效率方面仍保持模块化特性,相较于端到端跟踪器而言更为高效。 通过多个基准测试的广泛实验表明,TDLP在基于检测的跟踪方法和端到端方法之间始终超越了最先进的性能。最后,我们进行了详细的分析比较链接预测与度量学习为基础的相关性,并证明了对于处理异构特征(例如检测边界框)时,链接预测更为有效。我们的代码可以在\href{this https URL}{此URL}获取。
https://arxiv.org/abs/2512.22105
Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.
理解动态三维驾驶场景中自然语言指向的对象对于交互式自主系统来说是至关重要的。在实际应用中,许多指称表达通过近期运动或短期互动来描述目标,这些信息无法仅凭静态外观或几何结构解析出来。本文研究了基于时间的语言驱动的3D定位问题,其目标是在利用多帧观察的基础上识别当前帧中的被指对象。 为此,我们提出了TrackTeller,这是一种整合LiDAR-图像融合、语言条件解码以及时间推理于一体的统一架构的时序多模态定位框架。TrackTeller构建了一个与文本语义对齐的共享UniScene表示,生成具有语言感知能力的3D提案,并利用运动历史和短期动态来优化定位决策。 在NuPrompt基准测试上的实验表明,TrackTeller能够显著提升基于语言的跟踪性能,相对于强大的基线方法,在平均多目标跟踪准确度方面有70%的相对改进,同时降低了误报频率(False Alarm Frequency),具体而言是减少了3.15到3.4倍。
https://arxiv.org/abs/2512.21641
CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.
基于CCTV的车辆追踪系统在跨多个摄像头环境持续连接同一辆车轨迹时面临结构性限制。特别是,由于摄像头之间的间隔和有限的视野(FOV),会产生盲点区域,导致物体ID切换及轨迹丢失,从而降低实时路径预测的可靠性。本文提出了SPOT(轨迹上的空间预测)方法,这是一种基于地图引导的大规模语言模型代理,能够在没有前期训练的情况下,在多CCTV环境中的盲点区域内追踪车辆。所提出的方法将道路结构(路标)和CCTV安装信息以文档的形式表示在二维空间坐标上,并通过分块技术组织这些信息,以便于实时查询和推理。 此外,该方法利用CCTV图像中观察到的物体相对位置及视野信息,将车辆的位置转换为实际世界坐标系统。结合地图的空间信息、车辆行驶方向、速度以及驾驶模式,在交叉路口级别进行束搜索以确定车辆在盲点区域之后最有可能进入的候选CCTV位置。 基于虚拟城市环境中CARLA模拟器的实验结果表明,所提出的方法能够准确预测下一个出现的CCTV(即使在盲点区域内),比现有的技术更有效地保持连续的车辆轨迹。
https://arxiv.org/abs/2512.20975
We present a system for learning generalizable hand-object tracking controllers purely from synthetic data, without requiring any human demonstrations. Our approach makes two key contributions: (1) HOP, a Hand-Object Planner, which can synthesize diverse hand-object trajectories; and (2) HOT, a Hand-Object Tracker that bridges synthetic-to-physical transfer through reinforcement learning and interaction imitation learning, delivering a generalizable controller conditioned on target hand-object states. Our method extends to diverse object shapes and hand morphologies. Through extensive evaluations, we show that our approach enables dexterous hands to track challenging, long-horizon sequences including object re-arrangement and agile in-hand reorientation. These results represent a significant step toward scalable foundation controllers for manipulation that can learn entirely from synthetic data, breaking the data bottleneck that has long constrained progress in dexterous manipulation.
我们提出了一种系统,该系统能够仅通过合成数据学习通用的手部-物体跟踪控制器,无需任何人类演示。我们的方法做出了两个关键贡献:(1)HOP(手部-物体规划器),可以生成多样化的手部-物体轨迹;(2)HOT(手部-物体跟踪器),它通过强化学习和交互模仿学习实现从合成数据到物理世界的转换,提供一个根据目标手部-物体状态条件的通用控制器。我们的方法适用于各种形状的对象和不同的手形结构。经过广泛的评估,我们展示了该方法使灵巧的手能够追踪包括对象重新排列和灵巧的手内再定向在内的复杂、长时间序列的能力。这些结果代表了向可以完全从合成数据中学习的大规模基础控制方案发展的重大一步,打破了长期以来限制灵巧操作进步的数据瓶颈。
https://arxiv.org/abs/2512.19583
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at this https URL.
医学超声视频广泛用于医疗检查、疾病诊断和手术规划。高保真病变区域和目标器官的分割是计算机辅助手术工作流程中的关键组成部分。由于超声视频中的低对比度水平和噪声背景,会导致器官边界误分割,从而可能引发小对象丢失并增加边界分割误差。在长视频中进行对象跟踪仍然是一个重要的研究挑战。为了解决这些难题,我们提出了一种基于记忆库的波let滤波与融合网络,该网络采用了编码器-解码器结构以有效提取细粒度的空间特征,并整合高频(HF)信息。具体来说,在编码器部分提出了基于内存的波let卷积方法,同时捕捉类别、细节信息并利用相邻信息。通过级联的波let压缩来融合多尺度频域特性并扩大每个卷积层的感受野。设计了一个使用交叉注意力和记忆压缩机制的长短期记忆库,用于在长时间视频中跟踪对象。为了充分利用特征图中的边界敏感高频细节,在解码器中通过自适应波let滤波器设计了具有HF感知功能的特征融合模块。 在四个超声视频数据集(两个甲状腺结节、甲状腺腺体和心脏数据集)上进行广泛的基准测试,与现有方法相比,我们的方法在分割指标方面表现出显著改进。特别是,我们方法能够更准确地分割小甲状腺结节,证明了其对长时间视频中小超声对象的有效性。 代码可在[此处](https://this-is-the-code-location.com)获取(请将URL替换为实际链接)。
https://arxiv.org/abs/2512.15066
Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
精准畜牧业需要对社会行为进行客观评估,以支持群福利监测。然而,大多数现有方法通过使用静态接近阈值来推断互动关系,这在复杂的农场环境中无法区分亲和与对抗性行为。这种限制制约了自动化社交网络分析在商业环境中的解释能力。 我们提出了一种基于姿态的计算框架,用于分类交互行为,该框架超越了基于接近度的经验法则,通过建模解剖关键点的空间时间几何结构来进行互动识别。我们的方法不依赖于像素级别的外观或简单的距离测量,而是从关键点轨迹中编码特定于交互的动作签名,从而能够区分社交互动的性质。 此框架被实现为一个端到端计算机视觉流水线,包括YOLOv11对象检测(mAP@0.5: 96.24%),监督下的个体识别(98.24%准确性),ByteTrack多目标跟踪(81.96%的准确率),ZebraPose进行27点解剖关键点估计,以及一个支持向量机分类器,该分类器基于姿态衍生的距离动态性训练。在从商业乳牛场收集的注释互动剪辑上,仅使用姿态信息,分类器在区分亲和与对抗行为方面达到了77.51%的准确性。 对比只依赖接近度的方法,我们的方法在行为识别上的表现有显著提高,特别是在亲和行为上。这些结果为基于视觉自动化地推断社交互动的概念验证提供了依据,并且能够在普通硬件上实现近实时性能以构建了解交互的社交网络。
https://arxiv.org/abs/2512.14998
In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.
在智能交通系统(ITS)中,多目标跟踪主要依赖于基于帧的摄像头。然而,在光线昏暗和高速运动条件下,这些摄像头的表现通常较差。事件相机因其低延迟、高动态范围和高时间分辨率的特点,具备解决这些问题的巨大潜力。与基于帧的视觉相比,关于基于事件的视觉的研究较少。为填补这一研究空白,我们引入了一个专门针对基于事件的ITS的初步试点数据集,该数据集涵盖了车辆和行人检测及跟踪的内容。我们基于此数据集建立了一个追踪基准测试,并采用了一种专门的功能提取器,实现了卓越的性能。
https://arxiv.org/abs/2512.14595