We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.
我们介绍了SpatialTrackerV2,这是一种用于单目视频的前馈式3D点跟踪方法。与基于现成组件构建的模块化管道进行3D跟踪不同,我们的方法将点跟踪、单目深度估计和相机姿态估计之间的内在联系统一起来,形成一个高性能且端到端的3D点跟踪器。SpatialTrackerV2 将世界空间中的3D运动分解为场景几何、摄像机自运动和像素级物体运动,并采用完全可微分的架构,支持在包括合成序列、带姿势的RGB-D视频以及未标记的真实世界视频等广泛数据集上进行可扩展训练。通过从这种异构数据中联合学习几何结构与运动信息,SpatialTrackerV2 在性能上超越了现有3D跟踪方法30%,并且其精度可以媲美领先的动态3D重建方法,但运行速度却快50倍。
https://arxiv.org/abs/2507.12462
Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.
体操技能分类是一种计算机视觉任务,旨在从图像中推断运动员所执行的技能,从而实现自动性能评估和个性化分析。传统的方法通过姿势估计方法来确定来自图像中的骨骼数据位置,然后再将这些信息输入到分类算法中以推断出所执行的技能。尽管在人体姿态估计算法方面取得了进展,但它们仍然涉及高昂的计算成本、较长的推理时间和复杂的设置,这限制了这类方法在实时应用或移动设备上的适用性。 这项工作提出了一种直接的方法来进行体操技能识别,这种方法利用深度估计和运动员区域检索来避免昂贵的人体姿势估计模块。通过使用Depth Anything V2进行深度估计以及YOLOv10进行运动员定位,我们从背景中分割出主体,而不是依赖传统的姿态估计技术。这种策略提高了效率、减少了推理时间,并改善了分类准确性。 我们的方法在很大程度上超越了基于骨架的方法,在使用RGB图像补丁时实现了38.3倍更快的推理速度,并且通过深度图块提升了分类准确率(0.837对0.815)。除了这些性能提升之外,我们管道的模块化设计还允许灵活替换组件,从而支持未来的改进和适应现实世界的应用。
https://arxiv.org/abs/2507.12292
Accurate 3D reconstruction of vehicles is vital for applications such as vehicle inspection, predictive maintenance, and urban planning. Existing methods like Neural Radiance Fields and Gaussian Splatting have shown impressive results but remain limited by their reliance on dense input views, which hinders real-world applicability. This paper addresses the challenge of reconstructing vehicles from sparse-view inputs, leveraging depth maps and a robust pose estimation architecture to synthesize novel views and augment training data. Specifically, we enhance Gaussian Splatting by integrating a selective photometric loss, applied only to high-confidence pixels, and replacing standard Structure-from-Motion pipelines with the DUSt3R architecture to improve camera pose estimation. Furthermore, we present a novel dataset featuring both synthetic and real-world public transportation vehicles, enabling extensive evaluation of our approach. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, showcasing the method's ability to achieve high-quality reconstructions even under constrained input conditions.
准确的三维车辆重建对于车辆检测、预测性维护和城市规划等应用至关重要。现有的方法,如神经辐射场(Neural Radiance Fields)和高斯点阵化(Gaussian Splatting),虽然已经取得了令人印象深刻的结果,但仍然受限于对密集输入视图的依赖,这阻碍了其在现实世界中的应用。本文解决了从稀疏视角重建车辆的问题,通过利用深度图和稳健的姿态估计架构来合成新视角并增强训练数据。具体而言,我们改进了高斯点阵化技术,引入了一个选择性的光度损失函数,仅应用于具有高度置信度的像素,并用DUSt3R架构替代标准的基于结构从运动(Structure-from-Motion)的方法以提升相机姿态估计的质量。此外,我们还提出了一套包含合成和真实世界公共交通车辆的新数据集,以便全面评估我们的方法。 实验结果表明,在多个基准测试中,该方法达到了最先进的性能水平,并展示了即使在受限输入条件下也能实现高质量重建的能力。
https://arxiv.org/abs/2507.12095
We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multi-level pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at this https URL.
我们提出了SGLoc,这是一种新颖的定位系统,它直接从3D高斯点绘(3DGS)表示中回归相机姿态,并通过利用语义信息来实现这一点。我们的方法使用2D图像与3D场景表示之间的语义关系,在没有先验姿态信息的情况下估计6自由度的姿态。在此系统中,我们引入了一种多级姿态回归策略,该策略逐步从全局3DGS地图中估计和细化查询图像的姿态,而无需初始姿态先验条件。 此外,我们还提出了一种基于语义的全局检索算法,用于在2D(图像)与3D(3DGS地图)之间建立对应关系。通过匹配从2D查询图像提取的场景语义描述符与3DGS语义表示,我们将图像与其局部区域对齐到全局3DGS地图中,从而获得粗略的姿态估计。随后,我们通过迭代优化查询图像和由3DGS生成的渲染图像之间的差异来细化粗略姿态。 在12scenes和7scenes数据集上,SGLoc相对于基线方法表现出优越性能,在没有初始姿态先验的情况下显示出强大的全局定位能力。代码将在[此处提供链接]。
https://arxiv.org/abs/2507.12027
Event-based sensors have emerged as a promising solution for addressing challenging conditions in pedestrian and traffic monitoring systems. Their low-latency and high dynamic range allow for improved response time in safety-critical situations caused by distracted walking or other unusual movements. However, the availability of data covering such scenarios remains limited. To address this gap, we present SEPose -- a comprehensive synthetic event-based human pose estimation dataset for fixed pedestrian perception generated using dynamic vision sensors in the CARLA simulator. With nearly 350K annotated pedestrians with body pose keypoints from the perspective of fixed traffic cameras, SEPose is a comprehensive synthetic multi-person pose estimation dataset that spans busy and light crowds and traffic across diverse lighting and weather conditions in 4-way intersections in urban, suburban, and rural environments. We train existing state-of-the-art models such as RVT and YOLOv8 on our dataset and evaluate them on real event-based data to demonstrate the sim-to-real generalization capabilities of the proposed dataset.
基于事件的传感器作为解决行人和交通监控系统中复杂条件的一种有前景的解决方案已崭露头角。它们低延迟和高动态范围的特点,使得在因分心行走或其他异常动作引起的安全关键情况下能够更快地作出反应。然而,涵盖这些情况的数据供应仍然有限。为了解决这一缺口,我们推出了SEPose——这是一个使用动态视觉传感器在CARLA模拟器中生成的固定行人感知事件型人体姿态估计综合合成数据集。该数据集包含近35万个标记行人的身体姿态关键点,从固定的交通摄像头视角来看,在城市、郊区和农村环境中4路交叉口繁忙和轻度人群及车流的各种光照和天气条件下进行。 SEPose是一个全面的多人姿态估计合成数据集,它覆盖了各种照明和天气条件下的繁忙和轻量级的人群和车辆。我们在我们的数据集上训练现有的最先进模型(如RVT和YOLOv8),并用真实事件型数据对它们进行评估,以展示所提出的数据集的模拟到现实泛化能力。
https://arxiv.org/abs/2507.11910
Monocular pose estimation of non-cooperative spacecraft is significant for on-orbit service (OOS) tasks, such as satellite maintenance, space debris removal, and station assembly. Considering the high demands on pose estimation accuracy, mainstream monocular pose estimation methods typically consist of keypoint detectors and PnP solver. However, current keypoint detectors remain vulnerable to structural symmetry and partial occlusion of non-cooperative spacecraft. To this end, we propose a graph-based keypoints network for the monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages the geometric constraint of keypoints graph. In order to better validate keypoint detectors, we present a moderate-scale dataset for the spacecraft keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000 simulated images, and corresponding high-precise keypoint annotations. Extensive experiments and an ablation study have demonstrated the high accuracy and effectiveness of our GKNet, compared to the state-of-the-art spacecraft keypoint detectors. The code for GKNet and the SKD dataset is available at this https URL.
非合作航天器的单目姿态估计对于在轨服务(OOS)任务,如卫星维护、太空碎片清理和空间站组装等方面具有重要意义。鉴于对姿态估计精度的需求较高,主流的单目姿态估计算法通常由关键点检测器和PnP求解器组成。然而,当前的关键点检测器仍然容易受到非合作航天器结构对称性和部分遮挡的影响。 为此,我们提出了一种基于图的关键点网络,用于非合作航天器的单目姿态估计,命名为GKNet(Graph-based Keypoints Network),该网络利用关键点图形的几何约束。为了更好地验证关键点检测器的效果,我们提出了一个中等规模的数据集——SKD(Spacecraft Keypoint Detection)数据集,包括3个航天目标、90,000张模拟图像及其对应的高精度关键点注释。 通过广泛的实验和消融研究证明了我们的GKNet在与最先进的航天器关键点检测器相比时具有更高的准确性和有效性。GKNet的代码和SKD数据集可以在以下网址获取:[提供链接的地方,原文中给出的是“this https URL”,实际使用时请替换为真实链接]。
https://arxiv.org/abs/2507.11077
Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty through joint angle-based modeling. The key techniques include: (i) A joint angle-based model of human pose, which is robust to describe kinematic human poses; (ii) Approximating temporal variation of joint angles through high order Fourier series to get reliable "ground truth"; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of well-established HRNet. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking.
无标记人体姿态估计(HPE)在多个领域得到了广泛应用。当前的HPE方法在分析动态人体姿势时,偶尔会出现关键点识别错误和关键点轨迹随机波动的问题。现有基于深度学习的模型改进性能受到手动标注训练数据不准确性的限制。本文提出了一种通过关节角度建模来克服这一难题的新方法。 关键技术包括: (i) 一种基于关节角度的人体姿态模型,该模型能稳健地描述动态人体姿势; (ii) 使用高阶傅里叶级数逼近关节角度的时变性,以获得可靠的“真实值”; (iii) 设计了一种双向循环网络作为后处理模块来改进成熟HRNet(High-Resolution Network)的姿态估计。使用我们方法构建的高质量数据集进行训练后,该网络表现出色,在纠正误识的关键点和优化其时空轨迹方面效果显著。 测试表明,基于关节角度的细化(JAR)在像花样滑冰和街舞等具有挑战性的案例中优于最先进的HPE改进网络。
https://arxiv.org/abs/2507.11075
Autonomous driving systems are highly dependent on sensors like cameras, LiDAR, and inertial measurement units (IMU) to perceive the environment and estimate their motion. Among these sensors, perception-based sensors are not protected from harsh weather and technical failures. Although existing methods show robustness against common technical issues like rotational misalignment and disconnection, they often degrade when faced with dynamic environmental factors like weather conditions. To address these problems, this research introduces a novel deep learning-based motion estimator that integrates visual, inertial, and millimeter-wave radar data, utilizing each sensor strengths to improve odometry estimation accuracy and reliability under adverse environmental conditions such as snow, rain, and varying light. The proposed model uses advanced sensor fusion techniques that dynamically adjust the contributions of each sensor based on the current environmental condition, with radar compensating for visual sensor limitations in poor visibility. This work explores recent advancements in radar-based odometry and highlights that radar robustness in different weather conditions makes it a valuable component for pose estimation systems, specifically when visual sensors are degraded. Experimental results, conducted on the Boreas dataset, showcase the robustness and effectiveness of the model in both clear and degraded environments.
自动驾驶系统高度依赖如摄像头、激光雷达(LiDAR)和惯性测量单元(IMU)等传感器来感知环境并估计自身运动。在这些传感器中,基于感知的传感器容易受到恶劣天气和技术故障的影响。尽管现有方法展示了对常见技术问题如旋转偏移和断开连接具有鲁棒性的能力,但在面临动态环境因素例如天气条件时,性能往往会下降。为了应对这些问题,这项研究引入了一种新型深度学习基础运动估算器,该系统融合了视觉、惯性及毫米波雷达数据,利用每个传感器的优势来提高在恶劣环境下(如雪天、雨天和光照变化)的里程计估计精度和可靠性。所提出的模型使用先进的传感器融合技术,能够根据当前环境条件动态调整各个传感器的重要性权重,并通过雷达补偿视距受限条件下视觉传感器性能下降的问题。这项研究探讨了基于雷达的里程计领域的最新进展,并强调雷达在不同天气状况下的鲁棒性使其成为姿态估计系统中的重要组成部分,尤其是在视觉传感器退化的情况下尤为重要。 实验结果在Boreas数据集上进行,展示了该模型在清晰环境和恶劣条件下的稳健性和有效性。
https://arxiv.org/abs/2507.10376
Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
相机姿态估计是一项基本的计算机视觉任务,对于视觉定位和多视角立体重建等应用至关重要。在以物体为中心且输入数据稀疏的情况下,背景纹理占据不同视点图像的主要部分时,姿态估计的准确性会受到显著影响。为此,我们引入了万花筒背景攻击(KBA),该方法使用相同的片段形成具有多重径向对称性的圆盘。这些圆盘在不同的视角下保持高度相似性,即使使用自然纹理片段也能有效攻击姿态估计模型。此外,提出了一种投影方向一致性损失来优化万花筒片段,从而显著提高攻击效果。实验结果表明,经过优化的对抗性万花筒背景可以有效地针对各种相机姿态估计模型进行攻击。
https://arxiv.org/abs/2507.10265
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. Our code is available at this https URL and dataset at this https URL.
假肢在临床康复中扮演着至关重要的角色,能够帮助下肢截肢者恢复行动能力,并提高其生活质量。步态分析对于优化假肢的设计和对准至关重要,直接关系到下肢截肢者的移动能力和生活品质。基于视觉的机器学习(ML)方法为步态分析提供了一种可扩展且非侵入性的解决方案,但由于假肢的独特外观及新的运动模式,这种方法在正确检测和分析假肢方面面临挑战。 本文旨在通过介绍一个多用途数据集——ProGait来填补这一空白,以支持包括视频对象分割、2D人体姿态估计以及步态分析(GA)在内的多种视觉任务。该数据集包含来自四名大腿截肢者在测试多款新装配的假肢时进行行走试验所拍摄的412段视频片段,并描绘了使用股骨上端假肢的人体对象的存在、轮廓、姿势及步态模式。 除了提供数据集本身之外,我们还介绍了基准任务和微调后的基线模型,以展示ProGait数据集的实际应用及其性能。我们将我们的基线模型与预训练的视觉模型进行了比较,结果显示在应用于特定假肢任务时,使用ProGait数据集可以提高泛化能力。 我们的代码可在[此链接](https://this_https_URL)获取,而数据集则可通过[此链接](https://this_https_URL)访问。
https://arxiv.org/abs/2507.10223
WiFi-based human pose estimation has emerged as a promising non-visual alternative approaches due to its pene-trability and privacy advantages. This paper presents VST-Pose, a novel deep learning framework for accurate and continuous pose estimation using WiFi channel state information. The proposed method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that adopts a dual-stream architecture to separately capture temporal dependencies and structural relationships among body joints. To enhance sensitivity to subtle human motions, a velocity modeling branch is integrated into the framework, which learns short-term keypoint dis-placement patterns and improves fine-grained motion representation. We construct a 2D pose dataset specifically designed for smart home care scenarios and demonstrate that our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model's robustness and effectiveness in 3D pose estimation tasks. The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Our codes are available in this https URL.
基于WiFi的人体姿态估计由于其穿透性和隐私优势,已成为一种有前途的非视觉替代方法。本文提出了VST-Pose,这是一种新颖的深度学习框架,利用无线电信道状态信息进行准确且连续的姿态估计。该方法引入了ViSTA-Former,这是一个具有双流架构的空间时间注意力骨干网络,用于分别捕捉身体关节之间的时序依赖关系和结构关系。为了增强对细微人体运动的敏感性,在框架中集成了一个速度建模分支,它学习关键点短期位移模式,并改进细粒度运动表示。 我们构建了一个专门为智能家庭护理场景设计的2D姿态数据集,并证明我们的方法在PCK@50指标上达到了92.2%的准确率,比现有方法在自收集的数据集中高出8.3%。进一步在公开的MMFi数据集上的评估证实了该模型在三维姿态估计任务中的鲁棒性和有效性。 所提出的系统为室内环境中的连续人体运动分析提供了一种可靠且尊重隐私的解决方案。我们的代码可在[提供的URL]获得。
https://arxiv.org/abs/2507.09672
3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at this https URL.
近年来,3D手部姿态估计因其在人机交互、虚拟现实及相关领域的关键应用而备受关注。准确地估算手关节位置对于高质量的手部姿态估计至关重要。然而,现有方法往往忽略了远端指骨顶端(TIP)和手腕在预测整体手关节时的重要性,并且通常未能考虑到手势估计中远端关节误差累积的现象,这会导致某些关节出现更大的误差,从而导致姿态估计中的错位和伪影,降低整体重建质量。 为了解决这一挑战,我们提出了一种新型分段架构的增强型手部姿态估计算法(EHPE)。我们在局部提取TIP和手腕位置,以此减轻误差累积对TIP预测的影响,并在此基础上进一步减少所有关节的预测错误。EHPE由两个关键阶段组成:在第一阶段(TIP和腕关节抽取阶段,TW阶段),估算TIP和腕关节的位置以提供初步准确的手部关节配置;在第二阶段(先验引导关节估计阶段,PG阶段),采用双分支交互网络来优化剩余关节的位置。 在两项广泛使用的基准测试中进行的大量实验表明,EHPE达到了最先进的性能。代码可以在提供的链接处获取。
https://arxiv.org/abs/2507.09560
The brain can only be fully understood through the lens of the behavior it generates -- a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST (BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity, and pose estimation and action segmentation in both the single- and multi-animal settings. Our method establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.
通过行为来理解大脑——这是现代神经科学研究中的一个基本原则,尽管它在技术上面临巨大挑战。许多研究使用摄像头捕捉行为,但视频分析方法通常依赖于需要大量标注数据的专用模型。我们提出了一种新的、可扩展的框架BEAST(基于Transformer的自监督预训练的行为分析),用于对各种神经行为进行特定实验的视觉变换器进行预训练,以克服这一限制。BEAST结合了掩码自动编码和时间对比学习,可以有效利用未标注的视频数据。 通过跨多个物种的全面评估,我们证明了在三个关键的神经行为任务中性能得到了改进:提取与神经活动相关的行为特征,在单个动物和多动物设置中的姿态估计以及动作分割。我们的方法建立了一个强大的、多功能的骨干模型,加速了在标签数据仍然稀缺的情况下的行为分析。
https://arxiv.org/abs/2507.09513
Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at this https URL.
传统的人体姿态估计依赖于编码关键点先验的架构,这限制了它们在处理新型姿态或未见过的关键点时的泛化能力。最近的语言引导方法(如LocLLM)将关键点定位重新表述为视觉-语言任务,通过文本描述实现了零样本泛化的功能。然而,LocLLM中的线性投影器无法捕捉到对于高精度定位至关重要的复杂空间-文本交互。 为了应对这一挑战,我们提出了PoseLLM——首个基于大型语言模型(Large Language Model, LLM)的姿态估计框架,并用非线性的MLP视觉-语言连接器替换了原有的线性投影器。这个轻量级的两层MLP配备了GELU激活函数,能够实现跨模态特征转换的层级化处理,从而增强了视觉补丁与文本关键点描述之间的融合。 仅使用COCO数据集进行训练后,PoseLLM在COCO验证集上达到了77.8 AP的成绩,相较于LocLLM高出+0.4 AP,并且在Human-Art和MPII数据集中保持了强大的零样本泛化能力。我们的工作表明,一个简单而强大的非线性连接器可以显著提升定位精度而不牺牲泛化的性能,从而推动语言引导的姿态估计领域的最新技术水平。 代码可在提供的链接处获取。
https://arxiv.org/abs/2507.09139
With the advancement of vision-based autonomous driving technology, pedestrian detection have become an important component for improving traffic safety and driving system robustness. Nevertheless, in complex traffic scenarios, conventional pose estimation approaches frequently fail to accurately reconstruct occluded keypoints, primarily due to obstructions caused by vehicles, vegetation, or architectural elements. To address this issue, we propose a novel real-time occluded pedestrian pose completion framework termed Separation and Dimensionality Reduction-based Generative Adversarial Imputation Nets (SDR-GAIN). Unlike previous approaches that train visual models to distinguish occlusion patterns, SDR-GAIN aims to learn human pose directly from the numerical distribution of keypoint coordinates and interpolate missing positions. It employs a self-supervised adversarial learning paradigm to train lightweight generators with residual structures for the imputation of missing pose keypoints. Additionally, it integrates multiple pose standardization techniques to alleviate the difficulty of the learning process. Experiments conducted on the COCO and JAAD datasets demonstrate that SDR-GAIN surpasses conventional machine learning and Transformer-based missing data interpolation algorithms in accurately recovering occluded pedestrian keypoints, while simultaneously achieving microsecond-level real-time inference.
随着基于视觉的自动驾驶技术的进步,行人检测已成为提高交通安全和驾驶系统鲁棒性的重要组成部分。然而,在复杂的交通场景中,传统的姿态估计方法经常无法准确重建被遮挡的关键点,主要是因为车辆、植被或建筑元素造成的阻碍。为了解决这个问题,我们提出了一种名为分离与降维生成对抗网络填充(SDR-GAIN)的实时遮挡行人姿势补全框架。 不同于以往的方法通过训练视觉模型来区分遮挡模式,SDR-GAIN旨在直接从关键点坐标的数值分布中学习人体姿态,并插值缺失的位置。它采用了一种自监督对抗性学习范式,利用轻量级生成器和残差结构进行缺失姿势关键点的补全。此外,该方法还整合了多种姿态标准化技术来缓解学习过程中的难度。 在COCO和JAAD数据集上的实验表明,SDR-GAIN在准确恢复遮挡行人的关键点方面超过了传统的机器学习和基于Transformer的缺失数据插值算法,并同时实现了微秒级的实时推理性能。
https://arxiv.org/abs/2306.03538
3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: this https URL.
3D高斯点阵(3DGS)已展示了其从无姿态图像中重建场景的潜力。然而,基于优化的3DGS方法在处理视角稀疏的问题时由于先验知识有限而面临困难。同时,前馈式高斯方法受到输入格式的限制,使得难以整合更多的视图信息。为了解决这些问题,我们提出了RegGS,这是一种基于3D高斯配准框架的方法,用于重建无姿态且稀疏的视角场景。RegGS将由前馈网络生成的局部3D高斯点阵对齐到全局一致的3D高斯表示中。 从技术角度来看,我们实现了熵正则化的Sinkhorn算法来高效地解决最优传输混合2-Wasserstein(MW₂)距离问题,该距离作为$\mathrm{Sim}(3)$空间中高斯混合模型(GMMs)对齐度量。此外,我们设计了一个联合的3DGS配准模块,整合了MW₂距离、光度一致性以及深度几何信息。这使得可以在粗到细的过程中进行注册,并准确估计相机姿态和场景对齐。 在RE10K和ACID数据集上的实验表明,RegGS能够以高保真度注册局部高斯点阵,实现了精确的姿态估计和高质量的新视角合成。 项目页面:[此链接](https://this https URL/)。
https://arxiv.org/abs/2507.08136
Autonomous flight in GPS denied indoor spaces requires trajectories that keep visual localization error tightly bounded across varied missions. Whereas visual inertial odometry (VIO) accumulates drift over time, scene coordinate regression (SCR) yields drift-free, high accuracy absolute pose estimation. We present a perception-aware framework that couples an evidential learning-based SCR pose estimator with a receding horizon trajectory optimizer. The optimizer steers the onboard camera toward pixels whose uncertainty predicts reliable scene coordinates, while a fixed-lag smoother fuses the low rate SCR stream with high rate IMU data to close the perception control loop in real time. In simulation, our planner reduces translation (rotation) mean error by 54% / 15% (40% / 31%) relative to yaw fixed and forward-looking baselines, respectively. Moreover, hardware in the loop experiment validates the feasibility of our proposed framework.
在没有GPS的室内空间中进行自主飞行,需要设计轨迹以确保视觉定位误差在整个任务期间都保持在一个紧密控制的范围内。虽然视觉惯性里程计(VIO)会随着时间推移累积漂移误差,场景坐标回归(SCR)则能够提供无漂移、高精度的姿态估计。我们提出了一种感知驱动的框架,该框架结合了基于证据学习的SCR姿态估计算法和滚动地平线轨迹优化器。优化器引导机载摄像头转向不确定性较低的像素,这些像素可以预测出可靠的场景坐标。同时,固定延迟平滑器将低频率的SCR数据流与高频率的惯性测量单元(IMU)数据融合,从而实时闭合感知控制环路。 在模拟实验中,我们的规划器相对于固定航向和前视基线,在位置误差上减少了54%(旋转误差减少了15%),而在旋转误差方面则分别减少了40%/31%。此外,硬件在环实验验证了我们所提出框架的可行性。
https://arxiv.org/abs/2507.07467
Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{this https URL}{this https URL}.
基于类别的物体姿态估计,即在没有具体实例先验知识的情况下预测已知类别内物体的姿态,在仓库自动化和制造业等应用中至关重要。现有方法依赖于RGB图像或点云数据时,通常难以处理遮挡问题以及跨不同实例和类别的泛化能力。本文提出了一种基于多模态的关键点学习框架(MK-Pose),该框架整合了RGB图像、点云和基于类别的文本描述信息。模型采用了一个自监督关键点检测模块,并通过注意力机制生成查询,结合软热图匹配以及基于图形的关系建模。此外,设计了一个增强的特征融合模块以集成局部几何信息与全局上下文。 MK-Pose在CAMERA25和REAL275数据集上进行了评估,并进一步在HouseCat6D数据集上测试了其跨数据集的能力。实验结果表明,在没有形状先验的情况下,MK-Pose在交并比(IoU)和平均精度方面优于现有的最先进方法。 相关代码将在[此链接](https://this%20URL)发布。
https://arxiv.org/abs/2507.06662
Robust 6D object pose estimation in cluttered or occluded conditions using monocular RGB images remains a challenging task. One reason is that current pose estimation networks struggle to extract discriminative, pose-aware features using 2D feature backbones, especially when the available RGB information is limited due to target occlusion in cluttered scenes. To mitigate this, we propose a novel pose estimation-specific pre-training strategy named Mask6D. Our approach incorporates pose-aware 2D-3D correspondence maps and visible mask maps as additional modal information, which is combined with RGB images for the reconstruction-based model pre-training. Essentially, this 2D-3D correspondence maps a transformed 3D object model to 2D pixels, reflecting the pose information of the target in camera coordinate system. Meanwhile, the integrated visible mask map can effectively guide our model to disregard cluttered background information. In addition, an object-focused pre-training loss function is designed to further facilitate our network to remove the background interference. Finally, we fine-tune our pre-trained pose prior-aware network via conventional pose training strategy to realize the reliable pose prediction. Extensive experiments verify that our method outperforms previous end-to-end pose estimation methods.
在混乱或被遮挡条件下仅使用单目RGB图像进行鲁棒的6D物体姿态估计仍然是一个具有挑战性的任务。其中一个原因是当前的姿态估计网络难以利用2D特征骨干提取出有区分度且与姿态相关的特征,尤其是在目标由于场景中的遮挡导致可用RGB信息有限的情况下。为了解决这个问题,我们提出了一种新颖的姿态估计专用预训练策略,名为Mask6D。 我们的方法通过将姿态感知的2D-3D对应图和可见掩码图作为额外模态信息与RGB图像结合,进行基于重建模型的预训练。具体来说,这种2D-3D对应关系映射了变换后的3D物体模型到2D像素的位置,反映了目标在相机坐标系中的姿态信息。与此同时,集成的可见掩码图可以有效引导我们的模型忽略杂乱背景的信息。 此外,我们设计了一个以对象为中心的预训练损失函数,进一步帮助网络消除背景干扰的影响。最后,我们通过传统的姿态训练策略对预训练的姿态先验感知网络进行微调,实现可靠的姿态预测。广泛的实验验证表明,我们的方法优于之前的端到端姿态估计方法。
https://arxiv.org/abs/2507.06486
Recent advances on 6D object-pose estimation has achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode - and the potential of test-time sensor control to mitigate such variations - largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation, achieving performance comparable to or better than costly increases in real-world training data quantity and diversity. Adapting either RGB or depth sensors individually is effective, while jointly adapting multimodal RGB-D configurations yields even greater improvements. SenseShift6D extends the 6D-pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments. Our dataset is available at: this http URL Associated scripts can be found at: this http URL
近期在6D物体姿态估计方面的进展,在诸如LM-O、YCB-V和T-Less等具有代表性的基准数据集上取得了高精度表现。然而,这些数据集是在固定照明条件和相机设置下采集的,这使得在真实世界中光照、曝光度、增益或深度传感器模式变化的影响以及测试时间传感器控制以缓解此类变化潜力的研究基本上未被探索。 为了解决这一差距,我们引入了SenseShift6D——首个物理上模拟13种RGB曝光值、9种RGB增益水平、自动曝光功能、4种深度捕获模式及5级光照强度的RGB-D数据集。针对三种常见的家用物体(喷雾瓶、薯片筒和金属盒),我们收集了共计101,900张RGB图像与10,000张深度图像,可以为每个对象姿态提供多达1,380种独特的传感器-光照配置组合。 在我们的数据集上进行的实验表明,在测试时应用传感器控制比使用数字数据增强技术更能显著提升模型性能,并能实现相当于或优于通过增加真实世界中训练数据的数量和多样性来获得的效果。分别调整RGB或深度传感器是有效的,而同时适应多模态RGB-D配置则会带来更大的改进。 SenseShift6D将6D姿态评估范式从以数据为中心转变为具有传感器感知的鲁棒性评估,为构建能够在不确定的真实世界环境中稳健运行的自适应和自我调节感知系统奠定了基础。我们的数据集可以在[此处](this http URL)获取;相关脚本可在[此处](this http URL)找到。
https://arxiv.org/abs/2507.05751