Recent state-of-the-art object segmentation mechanisms, such as the Segment Anything Model (SAM) and FastSAM, first encode the full image over several layers and then focus on generating the mask for one particular object or area. We present an off-grid Fovea-Like Input Patching (FLIP) approach, which selects image input and encodes it from the beginning in an object-focused manner. While doing so, it separates locational encoding from an object-centric perceptual code. FLIP is more data-efficient and yields improved segmentation performance when masking relatively small objects in high-resolution visual scenes. On standard benchmarks such as Hypersim, KITTI-360, and OpenImages, FLIP achieves Intersection over Union (IoU) scores that approach the performance of SAM with much less compute effort. It surpasses FastSAM in all IoU measurements. We also introduce an additional semi-natural but highly intuitive dataset where FLIP outperforms SAM and FastSAM overall and particularly on relatively small objects. Seeing that FLIP is an end-to-end object-centric segmentation approach, it has high potential particularly for applications that benefit from computationally efficient, spatially highly selective object tracking.
近期的先进物体分割机制,如段落一切模型(Segment Anything Model,SAM)和FastSAM,首先对整个图像进行多层编码,然后专注于生成特定对象或区域的掩码。我们提出了一种无网格中心视野输入贴图(Fovea-Like Input Patching, FLIP)方法,该方法选择性地输入图像,并以物体为中心的方式从一开始就对其进行编码。在此过程中,FLIP 将位置编码与基于物体感知的代码分离。相比于其他模型,FLIP 更加高效利用数据,在高分辨率视觉场景中对相对较小的对象进行掩码处理时表现出色。在 Hypersim、KITTI-360 和 OpenImages 等标准基准测试上,FLIP 在交并比(Intersection over Union, IoU)评分方面接近 SAM 的性能,但计算资源消耗更少;同时,在所有 IoU 测量中超越了 FastSAM。我们还引入了一个额外的半自然且高度直观的数据集,在该数据集中 FLIP 总体上优于 SAM 和 FastSAM,特别是在相对较小的对象上表现尤为突出。鉴于 FLIP 是一种端到端的以物体为中心的分割方法,它特别适合于那些需要计算效率高、空间选择性强的对象跟踪的应用场景。
https://arxiv.org/abs/2502.02763
In this work, we present INTACT, a novel two-phase framework designed to enhance the robustness of deep neural networks (DNNs) against noisy LiDAR data in safety-critical perception tasks. INTACT combines meta-learning with adversarial curriculum training (ACT) to systematically address challenges posed by data corruption and sparsity in 3D point clouds. The meta-learning phase equips a teacher network with task-agnostic priors, enabling it to generate robust saliency maps that identify critical data regions. The ACT phase leverages these saliency maps to progressively expose a student network to increasingly complex noise patterns, ensuring targeted perturbation and improved noise resilience. INTACT's effectiveness is demonstrated through comprehensive evaluations on object detection, tracking, and classification benchmarks using diverse datasets, including KITTI, Argoverse, and ModelNet40. Results indicate that INTACT improves model robustness by up to 20% across all tasks, outperforming standard adversarial and curriculum training methods. This framework not only addresses the limitations of conventional training strategies but also offers a scalable and efficient solution for real-world deployment in resource-constrained safety-critical systems. INTACT's principled integration of meta-learning and adversarial training establishes a new paradigm for noise-tolerant 3D perception in safety-critical applications. INTACT improved KITTI Multiple Object Tracking Accuracy (MOTA) by 9.6% (64.1% -> 75.1%) and by 12.4% under Gaussian noise (52.5% -> 73.7%). Similarly, KITTI mean Average Precision (mAP) rose from 59.8% to 69.8% (50% point drop) and 49.3% to 70.9% (Gaussian noise), highlighting the framework's ability to enhance deep learning model resilience in safety-critical object tracking scenarios.
在这项工作中,我们介绍了INTACT,这是一种创新的两阶段框架,旨在提高深度神经网络(DNN)在安全关键感知任务中对噪声激光雷达数据的鲁棒性。INTACT结合了元学习与对抗性课程训练(Adversarial Curriculum Training, ACT),系统地解决了3D点云中的数据损坏和稀疏性的挑战。元学习阶段使一个教师网络具备任务无关的先验知识,使其能够生成具有抗噪能力的关键区域显著图。在ACT阶段中,这些显著图被用来逐步向学生网络暴露日益复杂的噪声模式,确保了有针对性的扰动,并增强了对噪音的抵抗能力。 INTACT的有效性通过使用KITTI、Argoverse和ModelNet40等多样化数据集进行对象检测、跟踪和分类基准上的综合评估来展示。结果显示,在所有任务中,INTACT使模型鲁棒性提高了高达20%,优于标准对抗训练和课程学习方法。此框架不仅解决了传统训练策略的局限性,还为资源受限的安全关键系统提供了可扩展且高效的解决方案。 INTACT通过将元学习与对抗训练原理相结合,建立了安全关键应用中的耐噪三维感知的新范式。在KITTI多目标跟踪精度(MOTA)方面,INTACT从64.1%提升到75.1%,并在高斯噪声条件下提升了9.6个百分点(从52.5%到73.7%)。同样地,在KITTI平均精度(mAP)上,该框架使得性能从59.8%增加至69.8%,在高斯噪声下则从49.3%提高到了70.9%,这突显了INTACT能够增强深度学习模型在安全关键对象跟踪场景中的鲁棒性。
https://arxiv.org/abs/2502.01896
Accurate 3D multi-object tracking (MOT) is vital for autonomous vehicles, yet LiDAR and camera-based methods degrade in adverse weather. Meanwhile, Radar-based solutions remain robust but often suffer from limited vertical resolution and simplistic motion models. Existing Kalman filter-based approaches also rely on fixed noise covariance, hampering adaptability when objects make sudden maneuvers. We propose Bayes-4DRTrack, a 4D Radar-based MOT framework that adopts a transformer-based motion prediction network to capture nonlinear motion dynamics and employs Bayesian approximation in both detection and prediction steps. Moreover, our two-stage data association leverages Doppler measurements to better distinguish closely spaced targets. Evaluated on the K-Radar dataset (including adverse weather scenarios), Bayes-4DRTrack demonstrates a 5.7% gain in Average Multi-Object Tracking Accuracy (AMOTA) over methods with traditional motion models and fixed noise covariance. These results showcase enhanced robustness and accuracy in demanding, real-world conditions.
准确的三维多目标跟踪(MOT)对于自动驾驶汽车至关重要,然而基于激光雷达和摄像头的方法在恶劣天气条件下性能会下降。相比之下,基于雷达的方法保持了稳健性,但通常受限于较低的垂直分辨率以及简单的运动模型。现有的卡尔曼滤波器方法也依赖于固定的噪声协方差,在目标做出突然动作时适应性较差。我们提出了Bayes-4DRTrack,这是一种基于4D雷达的MOT框架,采用了基于Transformer的运动预测网络来捕捉非线性运动动态,并在检测和预测步骤中应用了贝叶斯近似方法。此外,我们的两阶段数据关联技术利用多普勒测量值更好地区分近距离目标。在K-Radar数据集(包括恶劣天气场景)上的评估表明,Bayes-4DRTrack相较于使用传统运动模型和固定噪声协方差的方法,在平均多对象跟踪准确率(AMOTA)上提高了5.7%。这些结果展示了该方法在苛刻、真实世界条件下的鲁棒性和准确性显著提升。
https://arxiv.org/abs/2502.01357
Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. The primary challenge in next-frame prediction lies in effectively capturing and processing both spatial and temporal information from previous video sequences. The transformer architecture, known for its prowess in handling sequence data, has made remarkable progress in this domain. However, transformer-based next-frame prediction models face notable issues: (a) The multi-head self-attention (MHSA) mechanism requires the input embedding to be split into $N$ chunks, where $N$ is the number of heads. Each segment captures only a fraction of the original embeddings information, which distorts the representation of the embedding in the latent space, resulting in a semantic dilution problem; (b) These models predict the embeddings of the next frames rather than the frames themselves, but the loss function based on the errors of the reconstructed frames, not the predicted embeddings -- this creates a discrepancy between the training objective and the model output. We propose a Semantic Concentration Multi-Head Self-Attention (SCMHSA) architecture, which effectively mitigates semantic dilution in transformer-based next-frame prediction. Additionally, we introduce a loss function that optimizes SCMHSA in the latent space, aligning the training objective more closely with the model output. Our method demonstrates superior performance compared to the original transformer-based predictors.
视频中的下一帧预测对于自主驾驶、目标跟踪和运动预测等应用至关重要。下一帧预测的主要挑战在于有效捕捉并处理来自先前视频序列的空间和时间信息。具有处理序列数据能力的Transformer架构在这领域取得了显著进展,但基于Transformer的下一帧预测模型面临一些值得注意的问题:(a) 多头自注意力(MHSA)机制要求输入嵌入被分割为$N$个块,其中$N$是头部的数量。每个部分仅捕获原始嵌入信息的一部分,这在潜在空间中扭曲了嵌入表示,导致语义稀释问题;(b) 这些模型预测下一帧的嵌入而不是实际帧本身,但损失函数基于重构错误而非预测嵌入——这就产生了训练目标和模型输出之间的不一致。为此,我们提出了Semantic Concentration Multi-Head Self-Attention (SCMHSA)架构,该架构在基于Transformer的下一帧预测中有效地缓解了语义稀释问题。此外,我们还引入了一种优化潜在空间中的SCMHSA的损失函数,使训练目标更接近模型输出。我们的方法相较于原始的基于Transformer的预测器显示出优越的表现性能。
https://arxiv.org/abs/2501.16753
The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.
对扩展视频内容的分析在人工智能领域提出了独特的挑战,特别是在处理随时间推移视觉元素复杂性的情况下。目前采用顺序处理视频帧的方法难以维持物体的一致跟踪,尤其是在这些物体暂时消失后又重新出现在画面中时。这种方法的一个关键限制是无法有效地识别视频中的关键时刻,这主要是因为它们对时间关系的理解有限。 为了解决这些问题,我们提出了GraphVideoAgent这一前沿系统,该系统利用基于图的物体追踪技术与大型语言模型能力相结合。在核心部分,我们的框架采用了动态图结构来映射和监控视频序列中视觉实体之间不断变化的关系。这种创新方法使我们能够更细致地理解物体随着时间的变化如何相互作用和转换,并通过全面的情境意识改善帧的选择。 经过测试证明,在行业基准上GraphVideoAgent表现出显著的效果。在EgoSchema数据集的评估中,该系统比现有方法提高了2.2个点,平均只需要分析8.2帧。类似地,在NExT-QA基准测试中的性能也提升了2.0个点,并且只需要平均分析8.1帧。这些结果凸显了我们基于图的方法在提高长视频理解任务的准确性和计算效率方面的有效性。
https://arxiv.org/abs/2501.15953
Inspection systems utilizing unmanned aerial vehicles (UAVs) equipped with thermal cameras are increasingly popular for the maintenance of photovoltaic (PV) power plants. However, automation of the inspection task is a challenging problem as it requires precise navigation to capture images from optimal distances and viewing angles. This paper presents a novel localization pipeline that directly integrates PV module detection with UAV navigation, allowing precise positioning during inspection. Detections are used to identify the power plant structures in the image and associate these with the power plant model. We define visually recognizable anchor points for the initial association and use object tracking to discern global associations. We present three distinct methods for visual segmentation of PV modules based on traditional computer vision, deep learning, and their fusion, and we evaluate their performance in relation to the proposed localization pipeline. The presented methods were verified and evaluated using custom aerial inspection data sets, demonstrating their robustness and applicability for real-time navigation. Additionally, we evaluate the influence of the power plant model's precision on the localization methods.
使用配备热成像相机的无人飞行器(UAV)进行光伏(PV)电站维护检查系统越来越受欢迎。然而,实现检查任务自动化是一个难题,因为这需要精确导航以从最佳距离和视角捕捉图像。本文提出了一种新颖的位置识别流程,该流程将光伏模块检测直接与无人机导航集成在一起,允许在检查过程中实现精确定位。通过检测来识别电站结构并在图像中关联这些结构到电站模型。我们定义了可视化的锚点用于初始关联,并使用对象跟踪技术来区分全局关联。本文提出了三种基于传统计算机视觉、深度学习和它们融合的光伏模块视觉分割方法,并评估了这些方法在所提出的定位流程中的性能表现。 上述提出的方法通过使用定制的空中检查数据集进行了验证和评价,展示了其在实时导航中具备的强大鲁棒性和适用性。此外,我们还评估了电站模型精度对定位方法的影响。
https://arxiv.org/abs/2501.14587
We introduce YOLO11-JDE, a fast and accurate multi-object tracking (MOT) solution that combines real-time object detection with self-supervised Re-Identification (Re-ID). By incorporating a dedicated Re-ID branch into YOLO11s, our model performs Joint Detection and Embedding (JDE), generating appearance features for each detection. The Re-ID branch is trained in a fully self-supervised setting while simultaneously training for detection, eliminating the need for costly identity-labeled datasets. The triplet loss, with hard positive and semi-hard negative mining strategies, is used for learning discriminative embeddings. Data association is enhanced with a custom tracking implementation that successfully integrates motion, appearance, and location cues. YOLO11-JDE achieves competitive results on MOT17 and MOT20 benchmarks, surpassing existing JDE methods in terms of FPS and using up to ten times fewer parameters. Thus, making our method a highly attractive solution for real-world applications.
我们介绍了一种名为YOLO11-JDE的快速且准确的多目标跟踪(MOT)解决方案,该方案结合了实时物体检测与自监督重识别(Re-ID)。通过在YOLO11架构中加入专门的Re-ID分支,我们的模型实现了联合检测和嵌入(JDE),为每个检测生成外观特征。该Re-ID分支在完全自监督设置下进行训练,并同时进行检测任务的训练,从而无需昂贵的身份标签数据集。我们使用带有难正样本挖掘与半难负样本挖掘策略的三元组损失函数来学习区分性嵌入。通过定制化的跟踪实现,运动、外观和位置线索得到了增强的数据关联处理。YOLO11-JDE在MOT17和MOT20基准测试中取得了有竞争力的结果,在每秒帧数(FPS)方面超越了现有的JDE方法,并且使用的参数最多减少了十倍。因此,我们的方法成为适用于实际应用的极具吸引力的选择。
https://arxiv.org/abs/2501.13710
Object Tracking is essential for many computer vision applications, such as autonomous navigation, surveillance, and robotics. Unlike Passive Object Tracking (POT), which relies on static camera viewpoints to detect and track objects across consecutive frames, Active Object Tracking (AOT) requires a controller agent to actively adjust its viewpoint to maintain visual contact with a moving target in complex environments. Existing AOT solutions are predominantly single-agent-based, which struggle in dynamic and complex scenarios due to limited information gathering and processing capabilities, often resulting in suboptimal decision-making. Alleviating these limitations necessitates the development of a multi-agent system where different agents perform distinct roles and collaborate to enhance learning and robustness in dynamic and complex environments. Although some multi-agent approaches exist for AOT, they typically rely on external auxiliary agents, which require additional devices, making them costly. In contrast, we introduce the Collaborative System for Active Object Tracking (CSAOT), a method that leverages multi-agent deep reinforcement learning (MADRL) and a Mixture of Experts (MoE) framework to enable multiple agents to operate on a single device, thereby improving tracking performance and reducing costs. Our approach enhances robustness against occlusions and rapid motion while optimizing camera movements to extend tracking duration. We validated the effectiveness of CSAOT on various interactive maps with dynamic and stationary obstacles.
目标跟踪是许多计算机视觉应用的基础,例如自主导航、监控和机器人技术。与被动目标跟踪(POT)不同,后者依赖于固定视角的摄像头来检测并跨连续帧追踪对象,主动目标跟踪(AOT)则需要一个控制代理积极调整其视角以在复杂环境中保持对移动目标的视线接触。现有的大多数AOT解决方案都是基于单一代理的,这使得它们难以应对动态和复杂的场景,因为这些系统信息收集和处理能力有限,常常导致决策质量低下。为了解决这些问题,有必要开发一个多代理系统,在这种系统中不同的代理执行不同的角色并协作以增强学习能力和在动态复杂环境中的鲁棒性。尽管已有一些用于AOT的多代理方法存在,但它们通常依赖于外部辅助代理,这需要额外设备的支持,因此成本高昂。相比之下,我们提出了一种合作主动目标跟踪系统(CSAOT),该方法利用了多代理深度强化学习(MADRL)和专家混合框架,使得多个代理能够在单个设备上协同工作,从而提高追踪性能并降低运营成本。我们的方法增强了对遮挡和快速运动的鲁棒性,并优化摄像机动作以延长跟踪时间。我们在配备了动态和静态障碍物的各种交互式地图上验证了CSAOT的有效性。
https://arxiv.org/abs/2501.13994
This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at this https URL
这篇论文旨在通过长且丰富的上下文(LRC)建模来提升视频多模态大型语言模型(MLLM)的性能。因此,我们开发了InternVideo2.5的新版本,专注于增强原始MLLM感知细节和捕捉长时间段结构的能力。具体而言,我们的方法通过直接偏好优化将密集视觉任务注释整合到MLLM中,并通过自适应分层标记压缩技术生成紧凑的空间-时间表示。 实验结果表明,这种独特的LRC设计在主流视频理解基准测试(短时与长时)上显著提升了视频MLLM的表现,使MLLM能够记忆更长时间的视频输入(至少是原始长度的六倍),并掌握如对象跟踪和分割等专业视觉能力。我们的工作强调了多模态上下文丰富性(长度和精细度)在提升MLLM内在能力(专注与记忆)方面的重要性,并为未来关于视频MLLM的研究提供了新的见解。 代码和模型可以在提供的链接中找到:[此处插入实际URL]
https://arxiv.org/abs/2501.12386
Multi-object tracking (MOT) is a rising topic in video processing technologies and has important application value in consumer electronics. Currently, tracking-by-detection (TBD) is the dominant paradigm for MOT, which performs target detection and association frame by frame. However, the association performance of TBD methods degrades in complex scenes with heavy occlusions, which hinders the application of such methods in real-world this http URL this end, we incorporate pseudo-depth cues to enhance the association performance and propose Pseudo-Depth SORT (PD-SORT). First, we extend the Kalman filter state vector with pseudo-depth states. Second, we introduce a novel depth volume IoU (DVIoU) by combining the conventional 2D IoU with pseudo-depth. Furthermore, we develop a quantized pseudo-depth measurement (QPDM) strategy for more robust data association. Besides, we also integrate camera motion compensation (CMC) to handle dynamic camera situations. With the above designs, PD-SORT significantly alleviates the occlusion-induced ambiguous associations and achieves leading performances on DanceTrack, MOT17, and MOT20. Note that the improvement is especially obvious on DanceTrack, where objects show complex motions, similar appearances, and frequent occlusions. The code is available at this https URL.
多目标跟踪(MOT)是视频处理技术中一个新兴的话题,并在消费电子产品领域具有重要的应用价值。目前,基于检测的跟踪(TBD)是MOT的主要范式,该方法通过逐帧进行目标检测和关联来实现跟踪。然而,在存在严重遮挡的复杂场景下,TBD方法的关联性能会下降,这阻碍了这些方法在实际环境中的广泛应用。为此,我们引入伪深度线索以增强关联性能,并提出了基于伪深度的SORT算法(PD-SORT)。首先,我们将卡尔曼滤波器的状态向量扩展为包含伪深度状态的形式;其次,我们通过结合传统的2D IoU和伪深度提出了一种新的体积IoU(DVIoU);此外,我们开发了一个量化伪深度测量策略(QPDM),以实现更加稳健的数据关联。除此之外,我们也整合了相机运动补偿机制来处理动态摄像机的情况。借助上述设计,PD-SORT显著缓解了由遮挡引起的模糊关联问题,并在DanceTrack、MOT17和MOT20等数据集上取得了领先性能。特别地,在DanceTrack这种具有复杂运动模式、相似外观以及频繁遮挡的对象场景中,改进效果尤为明显。相关代码可在上述链接获取。
https://arxiv.org/abs/2501.11288
In the realm of multi-object tracking, the challenge of accurately capturing the spatial and temporal relationships between objects in video sequences remains a significant hurdle. This is further complicated by frequent occurrences of mutual occlusions among objects, which can lead to tracking errors and reduced performance in existing methods. Motivated by these challenges, we propose a novel adaptive key frame mining strategy that addresses the limitations of current tracking approaches. Specifically, we introduce a Key Frame Extraction (KFE) module that leverages reinforcement learning to adaptively segment videos, thereby guiding the tracker to exploit the intrinsic logic of the video content. This approach allows us to capture structured spatial relationships between different objects as well as the temporal relationships of objects across frames. To tackle the issue of object occlusions, we have developed an Intra-Frame Feature Fusion (IFF) module. Unlike traditional graph-based methods that primarily focus on inter-frame feature fusion, our IFF module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects within a frame. This innovation significantly enhances target distinguishability and mitigates tracking loss and appearance similarity due to occlusions. By combining the strengths of both long and short trajectories and considering the spatial relationships between objects, our proposed tracker achieves impressive results on the MOT17 dataset, i.e., 68.6 HOTA, 81.0 IDF1, 66.6 AssA, and 893 IDS, proving its effectiveness and accuracy.
在多目标跟踪领域,准确捕捉视频序列中物体之间的空间和时间关系仍是一个重大挑战。这个问题进一步因物体之间频繁出现的相互遮挡而复杂化,这可能导致现有方法中的追踪误差和性能下降。为了解决这些难题,我们提出了一种新颖的自适应关键帧挖掘策略,以弥补当前跟踪方法的不足。具体而言,我们引入了一个关键帧提取(KFE)模块,该模块利用强化学习来对视频进行自适应分割,从而指导追踪器探索视频内容的本质逻辑。这种方法使我们能够捕捉不同物体之间的结构化空间关系以及物体在不同帧间的时间关系。 为了应对物体遮挡的问题,我们开发了一种帧内特征融合(IFF)模块。与传统的基于图的方法主要关注跨帧特征融合不同,我们的IFF模块使用图卷积网络(GCN),促进目标及其周围物体在同一帧内的信息交换。这一创新显著提高了目标的可辨识性,并减少了由于遮挡导致的追踪丢失和外观相似度问题。 通过结合长轨迹和短轨迹的优点并考虑物体之间的空间关系,我们提出的跟踪器在MOT17数据集上取得了卓越的成绩:68.6 HOTA、81.0 IDF1、66.6 AssA 和 893 IDS,证明了其有效性和准确性。
https://arxiv.org/abs/2501.10129
Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \textbf{\href{this https URL}{GitHub Repository}}.
视频编辑模型已经取得了显著的进步,但其性能评估仍然颇具挑战。传统的评价指标,如CLIP文本和图像得分,在实践中往往表现不足:文本得分受限于训练数据的不足以及层级依赖关系;而图像得分则无法有效评估时间一致性。为此,我们提出了一种新的评估框架SST-EM(语义、空间与时间评估度量),该框架结合了现代视觉语言模型(VLMs)、物体检测和时间一致性检查等技术。SST-EM由四个部分组成: 1. 利用VLM从视频帧中提取语义信息。 2. 通过物体检测进行主要对象跟踪。 3. 使用大型语言模型代理对关注的对象进行细化处理。 4. 利用视觉变换器(ViT)评估时间一致性。 这些组件被整合到一个统一的度量标准中,其权重根据人类评价和回归分析得出。名称SST-EM反映了它在视频评估中的语义、空间与时间方面所侧重的特点。SST-EM能够全面地评估视频编辑中的语义准确性和时间平滑性。源代码可在**[GitHub仓库](https://this https URL)**中获取。
https://arxiv.org/abs/2501.07554
Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
木材代表了一种日益宝贵且多用途的资源。然而,伐木作业如采伐、处理和测量原木仍然需要在偏远环境中进行大量的人工劳动,并伴随着显著的安全风险。逐步实现这些任务的自动化具有提高效率和安全性的潜力,但这要求能够准确检测单个原木以及活树及其环境。尽管已经为这一挑战性应用领域提出了初步的方法,但专门的数据和算法仍不足以开发出稳健的解决方案。为了弥补这一差距,我们引入了TimberVision数据集,该数据集包含超过2000张标注的RGB图像,总计包括51,000个树干部分,包括切割面和平行面,从而在数量和细节方面大大超越了现有的任何相关数据集。 基于这些数据,我们进行了一系列关于定向物体检测和实例分割的消融实验,并评估了多个场景参数对模型性能的影响。我们还引入了一个通用框架,用于融合由我们的模型检测到的所有任务组件,以形成统一的树干表示。此外,我们自动推导了几何属性并应用多对象跟踪,进一步增强了鲁棒性。 我们的检测和跟踪方法能够仅从RGB图像数据中提供高度描述性和准确性的树干表示,即使在具有挑战性的环境条件下也是如此。我们的解决方案适用于广泛的应用场景,并可以与其它传感器模式轻松结合使用。
https://arxiv.org/abs/2501.07360
3D single object tracking (3DSOT) in LiDAR point clouds is a critical task for outdoor perception, enabling real-time perception of object location, orientation, and motion. Despite the impressive performance of current 3DSOT methods, evaluating them on clean datasets inadequately reflects their comprehensive performance, as the adverse weather conditions in real-world surroundings has not been considered. One of the main obstacles is the lack of adverse weather benchmarks for the evaluation of 3DSOT. To this end, this work proposes a challenging benchmark for LiDAR-based 3DSOT in adverse weather, which comprises two synthetic datasets (KITTI-A and nuScenes-A) and one real-world dataset (CADC-SOT) spanning three weather types: rain, fog, and snow. Based on this benchmark, five representative 3D trackers from different tracking frameworks conducted robustness evaluation, resulting in significant performance degradations. This prompts the question: What are the factors that cause current advanced methods to fail on such adverse weather samples? Consequently, we explore the impacts of adverse weather and answer the above question from three perspectives: 1) target distance; 2) template shape corruption; and 3) target shape corruption. Finally, based on domain randomization and contrastive learning, we designed a dual-branch tracking framework for adverse weather, named DRCT, achieving excellent performance in benchmarks.
基于激光雷达点云的单个物体三维跟踪(3DSOT)是室外感知的关键任务,它能够实现目标对象位置、姿态和运动的实时感知。尽管目前的3DSOT方法表现出色,但仅在清洁数据集上进行评估无法全面反映其性能,因为现实世界中的恶劣天气条件未被充分考虑。其中一个主要障碍是没有为3DSOT评估设计的恶劣天气基准测试。 为此,这项工作提出了一个具有挑战性的基于激光雷达的3DSOT恶劣天气基准测试,包括两个合成数据集(KITTI-A和nuScenes-A)以及一个真实世界的数据集(CADC-SOT),涵盖了雨、雾和雪三种类型的恶劣天气。根据这一基准,来自不同跟踪框架的五个代表性的3D追踪器进行了鲁棒性评估,结果表明性能显著下降。这引发了问题:是什么因素导致当前先进的方法在这种恶劣天气样本上表现不佳?因此,我们从三个方面探讨了恶劣天气的影响,并回答了上述问题:1)目标距离;2)模板形状损坏;和3)目标形状损坏。 最后,在领域随机化和对比学习的基础上,我们设计了一个用于恶劣天气的双分支跟踪框架DRCT(Domain Randomization and Contrastive Learning-based Dual-branch Tracker),在基准测试中取得了卓越的成绩。
https://arxiv.org/abs/2501.07133
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
我们通过实证研究了从视频中进行自回归预训练的方法。为了开展这项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉令牌序列,并训练变压器模型以自回归方式预测未来的令牌。我们的模型在包含超过1万亿个视觉令牌的多样化数据集(包括视频和图像)上进行了预训练。我们在架构、训练和推理设计选择方面做了各种探索。我们评估了所学习到的视觉表示形式在一系列下游任务上的表现,包括图像识别、视频分类、对象跟踪和机器人技术。我们的研究结果表明,尽管具有最少的归纳偏置,自回归预训练仍然能在一个广泛的标准上取得竞争性的性能。 最后,我们发现随着视频模型规模的增长,其扩展曲线与语言模型相似,不过增长率有所不同。更多详情请参阅此链接(在实际回答中应提供具体网址,此处以“https URL”表示)。
https://arxiv.org/abs/2501.05453
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
长篇视频理解面临的挑战是利用大型视觉语言模型分析在有限上下文窗口内分散但集中于空间的关键时刻。在这项工作中,我们引入了VideoMindPalace框架,该框架受到“心智宫殿”(Mind Palace)的启发,将关键视频时刻组织成一个拓扑结构化的语义图。 VideoMindPalace通过以下三个方面来组织关键信息: 1. **手部和物体跟踪及交互**:通过追踪视频中人手与物体之间的互动情况。 2. **聚类活动区域**:表示重复发生的特定活动区,以反映场景中的动态变化。 3. **环境布局映射**:将视频的物理空间结构化为图的形式。 这种方法允许大型语言模型进行自然语言解析,从而提供基于时间和空间上下文的语义理解。此外,我们提出了Video MindPalace基准(VMB),用于评估类似于人类的理解能力,包括空间定位、时间推理和布局感知序列理解。 在VMB以及已建立的视频问答数据集上进行测试,包括EgoSchema、NExT-QA、IntentQA和Active Memories Benchmark,VideoMindPalace展示了显著的空间-时间一致性提升和与人类一致性的推断能力,从而推动了视觉语言模型中长篇视频分析的能力。
https://arxiv.org/abs/2501.04336
Tracking and acquiring simultaneous optical images of randomly moving targets obscured by scattering media remains a challenging problem of importance to many applications that require precise object localization and identification. In this work we develop an end-to-end neuromorphic optical engineering and computational approach to demonstrate how to track and image normally invisible objects by combining an event detecting camera with a multistage neuromorphic deep learning strategy. Photons emerging from dense scattering media are detected by the event camera and converted to pixel-wise asynchronized spike trains - a first step in isolating object-specific information from the dominant uninformative background. Spiking data is fed into a deep spiking neural network (SNN) engine where object tracking and image reconstruction are performed by two separate yet interconnected modules running in parallel in discrete time steps over the event duration. Through benchtop experiments we demonstrate tracking and imaging randomly moving objects in dense turbid media as well as image reconstruction of spatially stationary but optically dynamic objects. Standardized character sets serve as representative proxies for geometrically complex objects, underscoring the method's generality. The results highlight the advantages of a fully neuromorphic approach in meeting a major imaging technology with high computational efficiency and low power consumption.
追踪和获取被散射介质遮挡的随机移动目标的同时光学图像,仍然是许多需要精确对象定位和识别的应用中的一个挑战性问题。在这项工作中,我们开发了一种端到端神经形态光工程与计算方法,展示如何通过结合事件检测相机与多阶段神经形态深度学习策略来追踪并成像通常不可见的物体。从密集散射介质中发出的光子被事件相机检测,并转换为像素级异步尖峰序列——这是从占主导地位的无用背景中提取特定于对象信息的第一步。 尖峰数据被输入到深层尖峰神经网络(SNN)引擎,在此引擎中,通过两个相互关联但在离散时间步骤中并行运行的模块执行对象追踪和图像重建。通过台式实验,我们展示了在密集浑浊介质中随机移动物体的追踪与成像以及空间静止但光学动态物体的图像重建。标准化字符集作为几何复杂对象的代表性代理,强调了该方法的通用性。 结果突显了一个完全神经形态的方法在满足高计算效率和低功耗的主要成像技术方面的优势。
https://arxiv.org/abs/2501.03874
Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.
之前的视觉目标跟踪方法采用图像特征回归模型或坐标自回归模型来进行边界框预测。图像特征回归法严重依赖于匹配结果,不利用位置先验信息;而自回归方法只能通过训练集中可用的边界框进行训练,在测试时面对未见过的数据可能会导致次优性能。受到扩散模型的启发,去噪学习可以增强模型对未见数据的鲁棒性。因此,我们在边界框上引入噪声,生成用于训练的有噪音的边界框,从而提高在测试数据上的模型鲁棒性。我们提出了一种新范式,将视觉目标跟踪问题表述为一个去噪学习过程。 然而,由于跟踪算法通常要求实时运行,直接应用扩散模型进行对象跟踪会严重损害跟踪速度。因此,我们将去噪学习过程分解到每个模型内的去噪块中,而不是通过多次运行整个模型来实现,并且我们总结提出的范式为一种模内潜在的去噪学习流程。 具体来说,我们提出了一种去噪视觉变换器(ViT),它由多个去噪模块组成。在这些去噪模块中,模板和搜索嵌入被投影到每个模块作为条件。一个去噪块负责去除预测边界框中的噪声,而多层堆叠的去噪块合作完成整个去噪过程。随后,我们利用图像特征和轨迹信息来细化去噪后的边界框。此外,我们也使用轨迹记忆和视觉记忆来提高跟踪稳定性。 实验结果验证了该方法的有效性,在几个具有挑战性的数据集上实现了竞争性的性能表现。
https://arxiv.org/abs/2501.02467
The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.08% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency. The code will be publicly available at the time of publishing: this https URL.
高级驾驶辅助系统(ADAS)的演进增加了对稳健且通用的多目标跟踪算法的需求。传统的基于统计模型的跟踪方法依赖于预定义的运动模型和关于系统噪声分布的假设。尽管这些方法在计算上效率很高,但它们往往缺乏适应不同交通场景的能力,并需要大量的手动设计和参数调整。为了解决这些问题,我们提出了一种新的用于车辆的3D多目标跟踪方法——HybridTrack,该方法在一个检测驱动的跟踪框架内集成了数据驱动的卡尔曼滤波器(KF)。特别是,它直接从数据中学习转移残差和卡尔曼增益,从而消除了手动运动模型和随机参数建模的需求。在现实世界的KITTI数据集上进行验证后,HybridTrack达到了82.08%的HOTA准确率,显著优于现有方法。我们还在不同的配置下评估了我们的方法,并实现了最快的处理速度112 FPS。因此,HybridTrack消除了对特定场景设计的依赖性,同时提高了性能并保持实时效率。发布时代码将在以下网址公开提供:this https URL。
https://arxiv.org/abs/2501.01275
Changes in room acoustics, such as modifications to surface absorption or the insertion of a scattering object, significantly impact measured room impulse responses (RIRs). These changes can affect the performance of systems used in echo cancellation and active acoustics and support tasks such as navigation and object tracking. Recognizing and quantifying such changes is, therefore, critical for advancing technologies based on room acoustics. This study introduces a method for analyzing acoustic environment changes by evaluating the similarity of consecutively recorded RIRs. Short-time coherence is employed to characterize modifications, including changes in wall absorption or the presence of a moving person in the room. A sensitivity rating is further used to quantify the magnitude of these changes. The results clearly differentiate between types of modifications -- atmospheric variation, changes in absorption, and human presence. The methods described provide a novel approach to analyzing and interpreting room acoustics, emphasizing RIR similarity and extracting information from temporal and spectral signal properties.
房间声学的变化,例如表面吸收的修改或插入散射物体,会显著影响测量到的房间脉冲响应(RIR)。这些变化会影响回声消除和主动声学系统的性能,并支持导航和目标跟踪等任务。因此,识别并量化这些变化对于推进基于房间声学的技术至关重要。 本研究介绍了一种通过评估连续记录的RIR之间的相似性来分析声环境变化的方法。短时相干性被用于表征修改情况,包括墙面吸收的变化或房间里出现移动的人。进一步使用敏感度评级来量化这些变化的程度。结果可以清晰地区分各种类型的修改——大气变化、吸声变化和人体存在。 描述的方法为分析和解释房间声学提供了一种新颖的途径,强调RIR相似性,并从时间和频谱信号特性中提取信息。
https://arxiv.org/abs/2501.01206