Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6\% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.
https://arxiv.org/abs/2604.14526
Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.
https://arxiv.org/abs/2604.14074
3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at this https URL
https://arxiv.org/abs/2604.13789
The challenge of 3D multi-object tracking (3D MOT) is achieving robustness in real-world applications, for example under adverse conditions and maintaining consistency as distance increases. To overcome these challenges, sensor fusion approaches that combine LiDAR, cameras, and radar have emerged. However, existing multi-modal fusion methods usually treat radar as another learned feature inside the network. When the overall model degrades in difficult environmental conditions, the robustness advantages that radar could provide are also reduced. We propose RadarMOT, a radar-informed 3D MOT framework that explicitly uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Evaluations on the MAN-TruckScenes dataset show that RadarMOT consistently improves the Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% at long range and 10.3% in adverse weather. The code will be available at this https URL
https://arxiv.org/abs/2604.13571
Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.
https://arxiv.org/abs/2604.13426
Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.
https://arxiv.org/abs/2604.12665
Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial--temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial--temporal grounding in complex real-world scenarios. STORM-Bench is released at this https URL.
https://arxiv.org/abs/2604.10527
We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.
https://arxiv.org/abs/2604.10415
The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.
https://arxiv.org/abs/2604.08711
Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible--infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible--infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.
https://arxiv.org/abs/2604.06865
Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.
在人工智能的驱动下,计算机视觉技术的 recent 显著提升了监控系统性能。其中一个 notable 应用是交通监控,该技术结合计算机视觉与基于深度学习的物体检测和计数方法。本研究提出了一种离线实时交通监控系统,该系统将预训练的 YOLOv11 检测器与 BoT-SORT/ByteTrack 多目标跟踪算法相结合,采用 PyTorch/OpenCV 实现,并封装在基于 Qt 的桌面界面中。该 CNN 处理流程无需云端支持即可高效实现视频流中的车辆检测与计数。在不同场景下,系统计数准确率达到 (66.67%-95.83%)。按车辆类别细分的检测结果显示:小汽车检测精确度为 0.97-1.00,卡车精确度为 1.00;召回率方面,小汽车为 0.82-1.00,卡车为 0.70-1.00,相应的 F1 分数分别为 (小汽车 0.90-1.00,卡车 0.82-1.00)。尽管恶劣天气条件可能对性能产生负面影响,但在典型环境下系统表现依然稳健。通过将轻量化模型与无需云端的易用界面相结合,本研究展示了人工智能驱动交通监控系统的能力,为未来智慧城市的现代化与发展做出了贡献。
https://arxiv.org/abs/2604.04080
Adaptive robots in dynamic production environments require robust perception capabilities, including 6D pose estimation and multi-object tracking. To address limitations in real-world data dependency, noise robustness, and spatiotemporal consistency, a LiDAR framework based on the Robot Operating System integrating a synthetic-data-trained Transformation-Equivariant 3D Detection with multi-object-tracking leveraging center poses is proposed. Validated across 72 scenarios with motion capture technology, overall results yield an Intersection over Union of 62.6% for standalone pose estimation, rising to 83.12% with multi-object-tracking integration. Our LiDAR-based framework achieves 91.12% of Higher Order Tracking Accuracy, advancing robustness and versatility of LiDAR-based perception systems for industrial mobile manipulators.
https://arxiv.org/abs/2604.02109
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $\kappa$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at this https URL.
https://arxiv.org/abs/2604.02032
Associating measurements with tracks is a crucial step in Multi-Object Tracking (MOT) to guarantee the safety of autonomous vehicles. To manage the exponentially growing number of track hypotheses, truncation becomes necessary. In the $\delta$-Generalized Labeled Multi-Bernoulli ($\delta$-GLMB) filter application, this truncation typically involves the ranked assignment problem, solved by Murty's algorithm or the Gibbs sampling approach, both with limitations in terms of complexity or accuracy, respectively. With the motivation to improve these limitations, this paper addresses the ranked assignment problem arising from data association tasks with an approach that employs Graph Neural Networks (GNNs). The proposed Ranked Assignment Prediction Graph Neural Network (RAPNet) uses bipartite graphs to model the problem, harnessing the computational capabilities of deep learning. The conclusive evaluation compares the RAPNet with Murty's algorithm and the Gibbs sampler, showing accuracy improvements compared to the Gibbs sampler.
将量测与航迹关联是多目标跟踪(MOT)中确保自动驾驶车辆安全的关键步骤。为应对呈指数级增长的航迹假设数量,截断成为必要操作。在δ-广义标签多伯努利(δ-GLMB)滤波器的应用中,该截断通常涉及排序分配问题,可通过Murty算法或吉布斯采样方法求解,但二者分别在复杂度或精度方面存在局限。为改善这些局限,本文采用图神经网络(GNNs)的方法处理数据关联任务中产生的排序分配问题。所提出的排序分配预测图神经网络(RAPNet)利用二分图对问题进行建模,并发挥深度学习的计算能力。最终评估将RAPNet与Murty算法及吉布斯采样器进行比较,结果显示相较于吉布斯采样器在精度上有所提升。
https://arxiv.org/abs/2604.01696
Recent Tracking-by-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces unexplored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally consistent track queries to exhaust the tracker's limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater's memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for physical-world realizability by leveraging simulations of advanced perception sensor spoofing. Experiments on MOT17 and MOT20 benchmarks demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.
近期,基于查询传播(TBP)的方法通过实现具备长时域建模能力的端到端(E2E)管道,推动了多目标跟踪(MOT)的发展。然而,这种对查询传播的依赖引入了对抗攻击中尚未被探索的架构漏洞。本文提出FADE,一种旨在利用这些特定漏洞的新型攻击框架。FADE采用两种针对核心TBP机制的攻击策略:(i)时序查询洪泛:生成虚假的时序一致跟踪查询,以耗尽跟踪器的有限查询预算,迫使其终止有效跟踪;(ii)时序记忆污染:通过状态去相关切断时序关联,并擦除已匹配跟踪的已学习特征身份,直接攻击查询更新器的记忆模块。此外,我们引入了一种可微分管道,通过利用高级感知传感器欺骗的模拟来优化这些攻击,以实现物理世界可实现性。在MOT17和MOT20基准测试上的实验表明,FADE对最先进的TBP跟踪器具有高度攻击效果,导致显著的身份切换和跟踪终止。
https://arxiv.org/abs/2604.00452
Robust visual object tracking (VOT) remains challenging in high-speed motion scenarios, where conventional RGB sensors suffer from severe motion blur and performance degradation. Event cameras, with microsecond temporal resolution and high dynamic range, provide complementary structural cues that can potentially compensate for these limitations. However, existing RGB-Event fusion methods typically treat event data as dense intensity representations and adopt black-box fusion strategies, failing to explicitly leverage the directional geometric priors inherently encoded in event streams to rectify degraded RGB features. To address this limitation, we propose SOR-Track, a streamlined framework for robust RGB-Event tracking based on Spatial Orthogonal Refinement (SOR). The core SOR module employs a set of orthogonal directional filters that are dynamically guided by local motion orientations to extract sharp and motion-consistent structural responses from event streams. These responses serve as geometric anchors to modulate and refine aliased RGB textures through an asymmetric structural modulation mechanism, thereby explicitly bridging structural discrepancies between two modalities. Extensive experiments on the large-scale FE108 benchmark demonstrate that SOR-Track consistently outperforms existing fusion-based trackers, particularly under motion blur and low-light conditions. Despite its simplicity, the proposed method offers a principled and physics-grounded approach to multi-modal feature alignment and texture rectification. The source code of this paper will be released on this https URL
在高速运动场景中,鲁棒视觉目标跟踪(VOT)仍面临挑战,传统RGB传感器易受严重运动模糊及性能下降的影响。事件相机凭借微秒级时间分辨率与高动态范围,提供了可弥补上述局限的补充性结构线索。然而,现有RGB-事件融合方法通常将事件数据视为密集强度表示,并采用黑盒融合策略,未能显式利用事件流中固有的方向几何先验来校正退化的RGB特征。针对这一局限,我们提出SOR-Track——一种基于空间正交细化(SOR)的轻量化RGB-事件跟踪框架。核心SOR模块通过一组由局部运动方向动态引导的正交方向滤波器,从事件流中提取清晰且运动一致的结构响应。这些响应作为几何锚点,经由非对称结构调制机制对混叠的RGB纹理进行调制与细化,从而显式弥合双模态间的结构差异。在大规模FE108基准上的大量实验表明,SOR-Track持续优于现有融合跟踪器,尤其在运动模糊与低光照条件下。尽管设计简洁,该方法仍提供了基于物理原理的多模态特征对齐与纹理校正方案。本文源代码将在此https URL发布。
https://arxiv.org/abs/2603.27913
Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.
https://arxiv.org/abs/2603.27811
Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.
事件相机通过异步流的形式捕获每个像素亮度变化,生成稀疏且时间精确的数据。与传统的帧式传感器相比,它们在捕捉高速动态场景时具有显著优势,同时功耗大幅降低。基于历史观测预测未来事件表示是一个重要问题,它使得未来语义分割或目标跟踪等下游任务无需访问未来传感器测量值即可进行。尽管近期最先进的方法取得了优异性能,但它们通常依赖计算量庞大的骨干网络,部分还需大规模预训练,这限制了其在资源受限场景中的应用。本研究提出E-TIDE,一种轻量级、端到端可训练的事件张量预测架构,专为无需大规模预训练的高效运行而设计。该方法采用TIDE模块(时间交互动态事件模块),其动机在于针对稀疏事件张量设计高效时空交互,通过大核混合和感知活动门控来捕获时间依赖关系,同时保持低计算复杂度。在标准事件数据集上的实验表明,我们的方法在模型规模和训练需求显著减少的情况下仍能达到具有竞争力的性能,非常适合在严格延迟和内存预算下进行实时部署。
https://arxiv.org/abs/2603.27757
Panoramic multi-object tracking is important for industrial safety monitoring, wide-area robotic perception, and infrastructure-light deployment in large workspaces. In these settings, the sensing system must provide full-surround coverage, metric geometric cues, and stable target association under wide field-of-view distortion and occlusion. Existing image-plane trackers are tightly coupled to the camera projection and become unreliable in panoramic imagery, while conventional Euclidean 3D formulations introduce redundant directional parameters and do not naturally unify angular, scale, and depth estimation. In this paper, we present $\mathbf{S^3KF}$, a panoramic 3D multi-object tracking framework built on a motorized rotating LiDAR and a quad-fisheye camera rig. The key idea is a geometry-consistent state representation on the unit sphere $\mathbb{S}^2$, where object bearing is modeled by a two-degree-of-freedom tangent-plane parameterization and jointly estimated with box scale and depth dynamics. Based on this state, we derive an extended spherical Kalman filtering pipeline that fuses panoramic camera detections with LiDAR depth observations for multimodal tracking. We further establish a map-based ground-truth generation pipeline using wearable localization devices registered to a shared global LiDAR map, enabling quantitative evaluation without motion-capture infrastructure. Experiments on self-collected real-world sequences show decimeter-level planar tracking accuracy, improved identity continuity over a 2D panoramic baseline in dynamic scenes, and real-time onboard operation on a Jetson AGX Orin platform. These results indicate that the proposed framework is a practical solution for panoramic perception and industrial-scale multi-object this http URL project page can be found at this https URL.
全景多目标跟踪对于工业安全监控、广域机器人感知以及大型工作空间中的轻量化基础设施部署具有重要意义。在此类场景中,感知系统必须提供全向覆盖、度量几何线索,并在大视场畸变与遮挡条件下保持稳定的目标关联。现有图像平面跟踪器与相机投影紧密耦合,在全景图像中可靠性不足;而传统的欧几里得三维表述会引入冗余的方向参数,且无法自然统一角度、尺度与深度估计。本文提出 $\mathbf{S^3KF}$,一种基于电机驱动旋转激光雷达与四鱼眼相机阵列的全景三维多目标跟踪框架。核心思想是在单位球面 $\mathbb{S}^2$ 上建立几何一致的状态表示,其中目标方位通过两自由度切平面参数化建模,并与边界框尺度及深度动态联合估计。基于该状态表示,我们推导出扩展球面卡尔曼滤波流程,实现全景相机检测与激光雷达深度观测的多模态融合跟踪。 further 建立基于地图的真值生成流程,通过佩戴式定位设备与共享全局激光雷达地图配准,无需动作捕捉设施即可实现定量评估。在自采集真实场景序列上的实验表明,该方法达到分米级平面跟踪精度,在动态场景中相比二维全景基线提升了身份连续性,并可在Jetson AGX Orin平台实现实时机载运行。这些结果证明,所提框架为全景感知与工业级多目标跟踪提供了实用解决方案。项目页面可通过此 http URL 访问,详情见此 https URL。
https://arxiv.org/abs/2603.27534
Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at this https URL.
视觉骨干网络在现代计算机视觉中扮演着核心角色。提升其效率可直接惠及众多下游应用。为衡量效率,许多出版物依赖乘加运算次数(MACs)作为执行时间的预测指标。本文通过实验证明该指标存在缺陷,尤其在边缘设备场景下。通过对比常见架构设计元素的MACs数量与执行时间,我们识别出高效执行的关键因素,并为骨干网络设计优化提供见解。基于这些见解,我们提出LowFormer——一种新型视觉骨干网络家族。LowFormer采用精简的宏观与微观设计,包含Lowtention(多头自注意力机制的轻量化替代方案)。Lowtention不仅效率更高,还在ImageNet上取得更优结果。此外,我们推出了LowFormer的边缘GPU版本,可在边缘GPU和桌面GPU上进一步超越基线速度。通过在较小图像分类数据集上的评估,以及将其适配到目标检测、语义分割、图像检索、视觉目标跟踪等多个下游任务,我们验证了LowFormer的广泛适用性。与近期最先进骨干网络相比,LowFormer模型在各类硬件平台上均实现了显著加速。代码与模型详见链接。
https://arxiv.org/abs/2603.26551