Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
传统的多目标跟踪(MOT)系统在定位和关联方面达到了很高的精度,有效地回答了“在哪里”和“谁”的问题。然而,这些系统往往像自闭症观察者一样工作,能够追踪几何路径,但却对物体行为背后的语义上的“是什么”和“为什么”视而不见。为了弥合几何感知与认知推理之间的差距,我们提出了**LLMTrack**,这是一种新的端到端框架,用于语义多目标跟踪(SMOT)。我们的设计哲学借鉴了仿生学原理,将强大的定位功能与深度理解分离,使用Grounding DINO作为“眼睛”以及LLaVA-OneVision多模态大型模型作为“大脑”。我们引入了一个时空融合模块,该模块汇聚实例级交互特征和视频级别上下文,使大规模语言模型能够理解和解析复杂的轨迹。此外,为了有效地将庞大的模型适应到跟踪领域,我们设计了一种分阶段的三步训练策略:视觉对齐、时间微调以及通过LoRA进行语义注入。 在BenSMOT基准测试中的广泛实验表明,LLMTrack达到了最先进的性能,在实例描述、交互识别和视频摘要方面显著优于现有方法,同时保持了强大的跟踪稳定性。
https://arxiv.org/abs/2601.06550
The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.
第三届感知测试挑战赛作为全天研讨会与IEEE/CVF国际计算机视觉大会(ICCV)2025年会议同期举行。其主要目标是评估当前最先进的视频模型,并衡量跨模态感知技术的进步。今年的研讨会还新增了两个客座赛道:KiVA(图像理解挑战)和Physic-IQ(视频生成挑战)。在这份报告中,我们将总结来自主赛道感知测试挑战赛的结果,详细介绍了现有任务以及基准测试中的新添加内容。在此次迭代中,我们强调了任务统一的重要性,因为这对目前最先进的跨模态模型提出了更具有挑战性的考验。该挑战包括五个整合赛道:统一体视频问答、统一分目标和点跟踪、统一时长动作和声音定位、基于背景的视频问答以及长达一小时的视频问答,还有一个分析与解释性赛道仍在接受提交中。值得注意的是,统一体视频问答赛道引入了一个新颖的子集,它将传统的感知任务(如点跟踪和时间动作定位)重新表述为多选题视频问答问题,视频-语言模型可以原生地处理这些问题。而统一分目标和点跟踪合并了原有的对象跟踪与点跟踪任务,统一时长动作和声音定位则整合了原有的时间动作定位与时长声源定位赛道。因此,我们要求参赛者使用统一的方法而非针对特定任务定制的管道工程解决方案。通过提出这样的统一挑战,感知测试2025突显了现有模型在通过统一接口处理多样化的感知任务时面临的显著困难。
https://arxiv.org/abs/2601.06287
Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at this https URL.
在机器人技术中,追踪动态环境中移动的对象是一个核心挑战。最近的研究在这个领域取得了显著的进展;然而,许多现有的方法由于依赖于重型基础模型而仍然不够高效。为了解决这一限制,我们提出了一种轻量级的方法——LOST-3DSG(轻量级开放词汇3D场景图),旨在追踪真实世界环境中动态对象。 我们的方法采用基于word2vec和句子嵌入的语义方法进行实体追踪,这使得它能够实现开放词汇表表示同时避免了存储密集型CLIP视觉特征的需求。结果表明,与依赖于高维视觉嵌入的方法相比,LOST-3DSG在性能上更胜一筹。 我们通过使用TIAGo机器人在真实3D环境中进行的定性和定量实验来评估我们的方法。实验结果证明了LOST-3DSG在动态对象追踪中的有效性和高效性。该项目的相关代码和补充材料可以在项目网站(此处链接为原文中提到的具体URL,您可以直接访问该网址获取详细信息)上公开获得。 原文翻译: Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at this https URL.
https://arxiv.org/abs/2601.02905
Automated analysis of volumetric medical imaging on edge devices is severely constrained by the high memory and computational demands of 3D Convolutional Neural Networks (CNNs). This paper develops a lightweight computer vision framework that reconciles the efficiency of 2D detection with the necessity of 3D context by reformulating volumetric Computer Tomography (CT) data as sequential video streams. This video-viewpoint paradigm is applied to the time-sensitive task of Intracranial Hemorrhage (ICH) detection using the Hemorica dataset. To ensure operational efficiency, we benchmarked multiple generations of the YOLO architecture (v8, v10, v11 and v12) in their Nano configurations, selecting the version with the highest mAP@50 to serve as the slice-level backbone. A ByteTrack algorithm is then introduced to enforce anatomical consistency across the $z$-axis. To address the initialization lag inherent in video trackers, a hybrid inference strategy and a spatiotemporal consistency filter are proposed to distinguish true pathology from transient prediction noise. Experimental results on independent test data demonstrate that the proposed framework serves as a rigorous temporal validator, increasing detection Precision from 0.703 to 0.779 compared to the baseline 2D detector, while maintaining high sensitivity. By approximating 3D contextual reasoning at a fraction of the computational cost, this method provides a scalable solution for real-time patient prioritization in resource-constrained environments, such as mobile stroke units and IoT-enabled remote clinics.
自动化分析边缘设备上的体积医学影像受到三维卷积神经网络(CNN)高内存和计算需求的严重限制。本文开发了一种轻量级计算机视觉框架,通过将体积型计算机断层扫描(CT)数据重新表述为顺序视频流,解决了二维检测效率与三维上下文必要性的矛盾。这一基于视频视角的方法被应用于具有时间敏感性任务的颅内出血(ICH)检测,并使用Hemorica数据集进行验证。 为了确保运行效率,我们在Nano配置下对YOLO架构的不同版本(v8、v10、v11和v12)进行了基准测试,并选择了mAP@50值最高的版本作为二维切片级别的主干网络。随后引入了一种ByteTrack算法,以强制执行沿z轴的解剖学一致性。为了解决视频跟踪器初始化时固有的滞后问题,我们提出了一种混合推理策略和时空一致性过滤机制,用以区分真正的病灶与瞬态预测噪声。 在独立测试数据上的实验结果表明,所提出的框架作为一个严格的时间验证工具,在保持高灵敏度的同时,相较于基准二维检测器将检测精度从0.703提升至0.779。通过大幅降低计算成本来近似三维上下文推理,该方法为资源受限环境中的实时患者优先级划分提供了可扩展的解决方案,例如移动中风单位和物联网远程诊所。
https://arxiv.org/abs/2601.02521
As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.
随着多目标跟踪(MOT)任务向更加通用和多模态场景发展,现有MOT方法的刚性架构及其针对特定任务的设计越来越难以适应多样化的任务需求,并限制了其灵活性以应对新的追踪形式。大多数方法依赖于固定输出头和专用追踪管道,这使得它们很难扩展到更复杂或指令驱动的任务中。为了解决这些局限性,我们提出了AR-MOT,这是一种新型的自回归范式,它将MOT作为大型语言模型(LLM)框架中的序列生成任务来定义。这种设计使模型能够通过灵活的序列构建输出结构化结果,并且无需任何特定于任务的头部即可实现这一点。 为了增强区域级别的视觉感知能力,我们引入了一个基于预训练检测器的对象标记器(Object Tokenizer)。为了解决全局特征与局部特征之间的不匹配问题,我们提出了一个区域感知对齐(RAA)模块;同时,为了支持长期跟踪,我们设计了一个时间记忆融合(TMF)模块,该模块可以缓存历史对象标记。 AR-MOT展示了很强的可扩展性潜力,因为新的模态或指令可以通过简单地修改输出序列格式而不改变模型架构来集成。在MOT17和DanceTrack上的广泛实验验证了我们方法的有效性,实现了与最新技术相媲美的性能,并为更通用、灵活的MOT系统奠定了基础。
https://arxiv.org/abs/2601.01925
Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on this https URL
现有的RGB-事件视觉对象跟踪方法主要依赖于传统的特征级融合,未能充分利用事件相机的独特优势。特别是,高动态范围和运动敏感特性经常被忽视,而低信息区域则进行了统一处理,导致骨干网络产生了不必要的计算开销。为了解决这些问题,我们提出了一种新的追踪框架,在频域中进行早期融合,以有效地聚合来自事件模态的高频信息。具体来说,RGB和事件模态通过快速傅里叶变换从空间域转换到频域,并解耦了它们的振幅和相位成分。通过振幅和相位注意力机制选择性地将高频事件信息融合到RGB模态中,从而增强特征表示的同时大幅减少骨干网络计算量。此外,运动引导的空间稀疏化模块利用了事件相机的运动敏感特性来捕捉目标运动线索与空间概率分布之间的关系,过滤掉低信息区域,并增强相关的目标特征。最终,一组稀疏的相关目标特征被输入到骨干网络中进行学习,而追踪头则预测最终的目标位置。 在三个广泛使用的RGB-事件跟踪基准数据集(FE108、FELT和COESOT)上进行了大量实验,证明了我们方法的高性能和高效率。本文的源代码将在以下网址发布:[此URL]
https://arxiv.org/abs/2601.01022
Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.
在农业环境中进行多目标跟踪(MOT)面临着诸多挑战,包括重复模式、相似物体外观、突然的光照变化以及频繁的遮挡。当前领域的追踪器依赖于对象的运动而非外观来进行关联,但它们难以维持目标在经历频繁且强烈的遮挡时的身份一致性。由于物体外观的高度相似性,基于外观的关联在农业场景中变得复杂。 为了解决这个问题,我们提出了一种新的MOT框架——CropTrack,该框架结合了外观和运动信息。CropTrack集成了增强重排序的外观关联、一对一多对多的外观冲突解决策略以及指数移动平均原型特征库来改进基于外观的关联。在公开的农业MOT数据集上进行评估时,CropTrack展示了持续的身份保持能力,并且优于传统的仅基于运动的方法。与最先进的方法相比,CropTrack在识别F1和关联准确度得分方面实现了显著提升,并且身份切换的数量较少。
https://arxiv.org/abs/2512.24838
\noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: this https URL. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}
记忆已成为现代基于分割框架中实现稳健视觉目标跟踪的核心机制。最近,基于Segment Anything Model 2 (SAM2) 的方法通过改进过去观测值的存储和重用方式展示了强大的性能。然而,现有的方法以特定于方法的方式解决内存限制问题,这使得在基于SAM的追踪中的内存设计原则仍然不太清楚。此外,这些记忆机制如何转移到更强、下一代的基础模型如Segment Anything Model 3 (SAM3) 上仍然是未知的。 在这项工作中,我们提出了一个系统性的针对SAM基础视觉目标跟踪的记忆中心研究。首先,我们分析了代表性的基于SAM2的追踪器,并显示大多数方法主要在短期记忆帧的选择方式上有所不同,但共享着共同的目标为中心的表示法。基于这一见解,在SAM3框架内忠实重新实现了这些记忆机制,并在十个多样化的基准数据集上进行了大规模评估,从而能够独立于骨干网络强度地控制分析内存设计。依据我们的实证发现,我们提出了一种统一的混合记忆框架,该框架明确将内存分解为短期外观记忆和长期干扰解决记忆。这种分解使得以模块化且原理性的方式整合现有内存策略成为可能。 广泛的实验表明,在SAM2和SAM3基础网络上,所提出的框架在长时间遮挡、复杂运动及大量干扰场景下始终提高了稳健性。代码可在以下网址获得:此https URL。**这是一个预印本。一些结果正在最终确定中,并可能在未来修订版中更新。**
https://arxiv.org/abs/2512.22624
Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $\pi^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.
多视角3D几何网络提供了一种强大的先验知识,但其运行速度过慢,无法满足实时应用的需求。我们提出一种新的方法来适应这些模型以实现在线使用,从而能够从单目RGB视频中实现实时的6自由度姿态跟踪和场景或物体的在线重建。我们的方法快速选择并管理一组关键帧图像集,通过$\pi^3$(全双向注意力机制)映射场景或对象。随后,我们缓存全局自注意力模块的关键值对(KV对),并将它们作为唯一的场景表示用于在线追踪。这种方法在推理阶段最多可实现15倍的速度提升,并且不会出现漂移或灾难性遗忘的问题。 我们的缓存策略是模型无关的,可以应用于其他现成的多视角网络而不需重新训练。我们在KV-Tracker上展示了这一方法对场景级别跟踪和更具挑战性的在线物体跟踪与重建任务的有效性(无需深度测量或对象先验)。在TUM RGB-D、7-Scenes、Arctic 和 OnePose数据集上的实验显示,我们的系统具有强大的性能,并且能保持高达约27 FPS的高帧率。
https://arxiv.org/abs/2512.22581
Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{this https URL}{this https URL}.
多目标跟踪旨在通过在视频帧之间关联检测结果来维护对象的身份。文献中存在两种主要范式:基于检测的跟踪方法,这些方法计算效率高但依赖于手工设计的相关启发式规则;以及端到端的方法,它们从数据中学习相关性,代价是更高的计算复杂度。我们提出了一种名为“Track-Detection Link Prediction”(TDLP)的新方法,这是一种基于检测的跟踪技术,它通过在轨迹和检测之间的链接预测来进行帧内关联,即通过对每个轨道在未来每一帧中的正确延续进行预测来实现这一目标。TDLP主要针对几何特征如边界框进行架构设计,同时可选地整合其他线索,包括姿态和外观信息。与基于启发式的方法不同的是,TDLP直接从数据中学习关联性而不依赖于手工规则,并且在计算效率方面仍保持模块化特性,相较于端到端跟踪器而言更为高效。 通过多个基准测试的广泛实验表明,TDLP在基于检测的跟踪方法和端到端方法之间始终超越了最先进的性能。最后,我们进行了详细的分析比较链接预测与度量学习为基础的相关性,并证明了对于处理异构特征(例如检测边界框)时,链接预测更为有效。我们的代码可以在\href{this https URL}{此URL}获取。
https://arxiv.org/abs/2512.22105
Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.
理解动态三维驾驶场景中自然语言指向的对象对于交互式自主系统来说是至关重要的。在实际应用中,许多指称表达通过近期运动或短期互动来描述目标,这些信息无法仅凭静态外观或几何结构解析出来。本文研究了基于时间的语言驱动的3D定位问题,其目标是在利用多帧观察的基础上识别当前帧中的被指对象。 为此,我们提出了TrackTeller,这是一种整合LiDAR-图像融合、语言条件解码以及时间推理于一体的统一架构的时序多模态定位框架。TrackTeller构建了一个与文本语义对齐的共享UniScene表示,生成具有语言感知能力的3D提案,并利用运动历史和短期动态来优化定位决策。 在NuPrompt基准测试上的实验表明,TrackTeller能够显著提升基于语言的跟踪性能,相对于强大的基线方法,在平均多目标跟踪准确度方面有70%的相对改进,同时降低了误报频率(False Alarm Frequency),具体而言是减少了3.15到3.4倍。
https://arxiv.org/abs/2512.21641
CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.
基于CCTV的车辆追踪系统在跨多个摄像头环境持续连接同一辆车轨迹时面临结构性限制。特别是,由于摄像头之间的间隔和有限的视野(FOV),会产生盲点区域,导致物体ID切换及轨迹丢失,从而降低实时路径预测的可靠性。本文提出了SPOT(轨迹上的空间预测)方法,这是一种基于地图引导的大规模语言模型代理,能够在没有前期训练的情况下,在多CCTV环境中的盲点区域内追踪车辆。所提出的方法将道路结构(路标)和CCTV安装信息以文档的形式表示在二维空间坐标上,并通过分块技术组织这些信息,以便于实时查询和推理。 此外,该方法利用CCTV图像中观察到的物体相对位置及视野信息,将车辆的位置转换为实际世界坐标系统。结合地图的空间信息、车辆行驶方向、速度以及驾驶模式,在交叉路口级别进行束搜索以确定车辆在盲点区域之后最有可能进入的候选CCTV位置。 基于虚拟城市环境中CARLA模拟器的实验结果表明,所提出的方法能够准确预测下一个出现的CCTV(即使在盲点区域内),比现有的技术更有效地保持连续的车辆轨迹。
https://arxiv.org/abs/2512.20975
We present a system for learning generalizable hand-object tracking controllers purely from synthetic data, without requiring any human demonstrations. Our approach makes two key contributions: (1) HOP, a Hand-Object Planner, which can synthesize diverse hand-object trajectories; and (2) HOT, a Hand-Object Tracker that bridges synthetic-to-physical transfer through reinforcement learning and interaction imitation learning, delivering a generalizable controller conditioned on target hand-object states. Our method extends to diverse object shapes and hand morphologies. Through extensive evaluations, we show that our approach enables dexterous hands to track challenging, long-horizon sequences including object re-arrangement and agile in-hand reorientation. These results represent a significant step toward scalable foundation controllers for manipulation that can learn entirely from synthetic data, breaking the data bottleneck that has long constrained progress in dexterous manipulation.
我们提出了一种系统,该系统能够仅通过合成数据学习通用的手部-物体跟踪控制器,无需任何人类演示。我们的方法做出了两个关键贡献:(1)HOP(手部-物体规划器),可以生成多样化的手部-物体轨迹;(2)HOT(手部-物体跟踪器),它通过强化学习和交互模仿学习实现从合成数据到物理世界的转换,提供一个根据目标手部-物体状态条件的通用控制器。我们的方法适用于各种形状的对象和不同的手形结构。经过广泛的评估,我们展示了该方法使灵巧的手能够追踪包括对象重新排列和灵巧的手内再定向在内的复杂、长时间序列的能力。这些结果代表了向可以完全从合成数据中学习的大规模基础控制方案发展的重大一步,打破了长期以来限制灵巧操作进步的数据瓶颈。
https://arxiv.org/abs/2512.19583
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at this https URL.
医学超声视频广泛用于医疗检查、疾病诊断和手术规划。高保真病变区域和目标器官的分割是计算机辅助手术工作流程中的关键组成部分。由于超声视频中的低对比度水平和噪声背景,会导致器官边界误分割,从而可能引发小对象丢失并增加边界分割误差。在长视频中进行对象跟踪仍然是一个重要的研究挑战。为了解决这些难题,我们提出了一种基于记忆库的波let滤波与融合网络,该网络采用了编码器-解码器结构以有效提取细粒度的空间特征,并整合高频(HF)信息。具体来说,在编码器部分提出了基于内存的波let卷积方法,同时捕捉类别、细节信息并利用相邻信息。通过级联的波let压缩来融合多尺度频域特性并扩大每个卷积层的感受野。设计了一个使用交叉注意力和记忆压缩机制的长短期记忆库,用于在长时间视频中跟踪对象。为了充分利用特征图中的边界敏感高频细节,在解码器中通过自适应波let滤波器设计了具有HF感知功能的特征融合模块。 在四个超声视频数据集(两个甲状腺结节、甲状腺腺体和心脏数据集)上进行广泛的基准测试,与现有方法相比,我们的方法在分割指标方面表现出显著改进。特别是,我们方法能够更准确地分割小甲状腺结节,证明了其对长时间视频中小超声对象的有效性。 代码可在[此处](https://this-is-the-code-location.com)获取(请将URL替换为实际链接)。
https://arxiv.org/abs/2512.15066
Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
精准畜牧业需要对社会行为进行客观评估,以支持群福利监测。然而,大多数现有方法通过使用静态接近阈值来推断互动关系,这在复杂的农场环境中无法区分亲和与对抗性行为。这种限制制约了自动化社交网络分析在商业环境中的解释能力。 我们提出了一种基于姿态的计算框架,用于分类交互行为,该框架超越了基于接近度的经验法则,通过建模解剖关键点的空间时间几何结构来进行互动识别。我们的方法不依赖于像素级别的外观或简单的距离测量,而是从关键点轨迹中编码特定于交互的动作签名,从而能够区分社交互动的性质。 此框架被实现为一个端到端计算机视觉流水线,包括YOLOv11对象检测(mAP@0.5: 96.24%),监督下的个体识别(98.24%准确性),ByteTrack多目标跟踪(81.96%的准确率),ZebraPose进行27点解剖关键点估计,以及一个支持向量机分类器,该分类器基于姿态衍生的距离动态性训练。在从商业乳牛场收集的注释互动剪辑上,仅使用姿态信息,分类器在区分亲和与对抗行为方面达到了77.51%的准确性。 对比只依赖接近度的方法,我们的方法在行为识别上的表现有显著提高,特别是在亲和行为上。这些结果为基于视觉自动化地推断社交互动的概念验证提供了依据,并且能够在普通硬件上实现近实时性能以构建了解交互的社交网络。
https://arxiv.org/abs/2512.14998
In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.
在智能交通系统(ITS)中,多目标跟踪主要依赖于基于帧的摄像头。然而,在光线昏暗和高速运动条件下,这些摄像头的表现通常较差。事件相机因其低延迟、高动态范围和高时间分辨率的特点,具备解决这些问题的巨大潜力。与基于帧的视觉相比,关于基于事件的视觉的研究较少。为填补这一研究空白,我们引入了一个专门针对基于事件的ITS的初步试点数据集,该数据集涵盖了车辆和行人检测及跟踪的内容。我们基于此数据集建立了一个追踪基准测试,并采用了一种专门的功能提取器,实现了卓越的性能。
https://arxiv.org/abs/2512.14595
Extended object tracking involves estimating both the physical extent and kinematic parameters of a target object, where typically multiple measurements are observed per time step. In this article, we propose a deterministic closed-form elliptical extended object tracker, based on decoupling of the kinematics, orientation, and axis lengths. By disregarding potential correlations between these state components, fewer approximations are required for the individual estimators than for an overall joint solution. The resulting algorithm outperforms existing algorithms, reaching the accuracy of sampling-based procedures. Additionally, a batch-based variant is introduced, yielding highly efficient computation while outperforming all comparable state-of-the-art algorithms. This is validated both by a simulation study using common models from literature, as well as an extensive quantitative evaluation on real automotive radar data.
扩展目标跟踪涉及估算目标对象的物理范围和运动学参数,通常在每个时间步骤中会观察到多个测量值。本文提出了一种基于确定性闭式椭圆扩展目标追踪器,该方法通过分离运动学、姿态和轴长来实现。通过忽略这些状态组件之间的潜在相关性,对于单独的估计器比总体联合解决方案所需的近似更少。由此产生的算法优于现有的算法,并且达到了采样法程序的精度水平。此外,还引入了一种基于批处理的方法,该方法计算效率极高,并在所有同类最先进的算法中表现最佳。这一结论既通过使用文献中的常见模型进行仿真研究得到验证,也通过大量的定量评估实际汽车雷达数据得到了证实。
https://arxiv.org/abs/2512.14426
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。
https://arxiv.org/abs/2512.13684
High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at this https URL.
高分辨率的单叶表型分析为植物发育和压力响应提供了精细的见解。然而,由于缺乏稳健的追踪方法(特别是对于结构复杂的作物如油菜),随着时间推移准确地进行叶片跟踪的全部潜力尚未被充分发掘。现有的特定于植物的跟踪方法通常仅限于小型物种或依赖于受限的成像条件。相比之下,通用多目标跟踪(MOT)方法并未针对动态生物场景设计。由于缺乏在现实条件下捕获的大规模数据集,准确叶片追踪模型的发展也受到了阻碍。 在这项工作中,我们介绍了CanolaTrack,这是一个新的基准数据集,包含5,704张RGB图像和31,840个注释的叶子实例,涵盖了184株油菜植物的早期生长阶段。为了实现在时间上的准确叶片跟踪,我们引入了LeafTrackNet,这是一种高效的框架,结合了一种基于YOLOv10的叶检测器和一种基于MobileNetV3的嵌入式网络。在推断过程中,通过一种基于嵌入式的记忆关联策略,在一段时间内保持叶子的身份信息。LeafTrackNet超越了特定于植物的跟踪器和最先进的MOT基线方法,在CanolaTrack上实现了9%的HOTA改进。我们提供的这项工作为现实条件下叶片级追踪提供了一个新的标准,并提供了CanolaTrack——农业作物中最大的叶片跟踪数据集,这将有助于未来在植物表型研究中的发展。 我们的代码和数据集可以在[此链接](https://给出URL)公开获取。
https://arxiv.org/abs/2512.13130
Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at this https URL.
对象跟踪是机器人和自动驾驶流水线中的一个重要步骤,需要对以前未见过的复杂物体进行泛化。现有的高性能方法通常依赖于预捕获的对象视图来构建显式的参考模型,这限制了它们只能处理一组已知的对象。然而,这样的参考模型在面对视觉上复杂的外观时会遇到困难,从而降低了跟踪质量。在这项工作中,我们介绍了一种基于光场图像的对象跟踪方法,该方法无需依赖预训练的模型,并且对诸如反射等复杂视觉行为具有鲁棒性。我们利用视觉基础模型从光场输入中提取语义和几何特征,并将其转换为视图相关的高斯点(Gaussian splats)。这些点作为统一的对象表示形式,支持可微渲染和姿态优化。此外,我们还引入了一个包含挑战性的反射物体的光场对象跟踪数据集,并提供了精确的姿态地面实况。实验表明,在这些困难的情况下,我们的方法与最先进的模型基于的方法具有竞争力,为机器人系统中的通用对象跟踪铺平了道路。代码和数据可在该链接获取。
https://arxiv.org/abs/2512.13007