Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183
Modern agricultural applications rely more and more on deep learning solutions. However, training well-performing deep networks requires a large amount of annotated data that may not be available and in the case of 3D annotation may not even be feasible for human annotators. In this work, we develop a deep learning approach to segment mushrooms and estimate their pose on 3D data, in the form of point clouds acquired by depth sensors. To circumvent the annotation problem, we create a synthetic dataset of mushroom scenes, where we are fully aware of 3D information, such as the pose of each mushroom. The proposed network has a fully convolutional backbone, that parses sparse 3D data, and predicts pose information that implicitly defines both instance segmentation and pose estimation task. We have validated the effectiveness of the proposed implicit-based approach for a synthetic test set, as well as provided qualitative results for a small set of real acquired point clouds with depth sensors. Code is publicly available at this https URL.
现代农业应用越来越依赖深度学习解决方案。然而,训练表现良好的深度网络需要大量注释数据,这可能无法获得,在3D注释情况下甚至可能不可行。在这项工作中,我们提出了一种用于分割蘑菇并估计其三维数据的方法,以点云的形式获取深度传感器测量得到的数据。为了绕过注释问题,我们创建了一个蘑菇场景的合成数据集,我们完全意识到3D信息,比如每个蘑菇的姿态。所提出的网络具有全卷积骨干,可以解析稀疏的3D数据,预测隐含的实例分割和姿态估计任务。我们在synthetic测试集以及一小部分真实获取点云的定性结果上进行了验证。代码公开可用,在https:// this URL。
https://arxiv.org/abs/2404.12144
The Indian classical dance-drama Kathakali has a set of hand gestures called Mudras, which form the fundamental units of all its dance moves and postures. Recognizing the depicted mudra becomes one of the first steps in its digital processing. The work treats the problem as a 24-class classification task and proposes a vector-similarity-based approach using pose estimation, eliminating the need for further training or fine-tuning. This approach overcomes the challenge of data scarcity that limits the application of AI in similar domains. The method attains 92% accuracy which is a similar or better performance as other model-training-based works existing in the domain, with the added advantage that the method can still work with data sizes as small as 1 or 5 samples with a slightly reduced performance. Working with images, videos, and even real-time streams is possible. The system can work with hand-cropped or full-body images alike. We have developed and made public a dataset for the Kathakali Mudra Recognition as part of this work.
印度古典舞蹈戏剧Kathakali有一组被称为Mudras的手势,它们构成了所有其舞蹈动作和姿势的基本单位。识别所描绘的手势是其在数字处理中的第一步。该工作将问题视为一个24类分类任务,并使用姿态估计基于向量的方法,消除了进一步的训练或微调的需求。这种方法克服了数据稀缺性,这限制了AI在类似领域中的应用。该方法获得了92%的准确度,这是该领域其他基于模型训练的工作中的类似或更好的性能,并且具有可以与少量数据样本一起工作的稍微降低性能的优点。可以与图像、视频和实时流合作工作。系统可以处理手裁剪或全身图像。我们在这个工作中为Kathakali Mudra识别开发并公开了一个数据集。
https://arxiv.org/abs/2404.11205
Object pose refinement is essential for robust object pose estimation. Previous work has made significant progress towards instance-level object pose refinement. Yet, category-level pose refinement is a more challenging problem due to large shape variations within a category and the discrepancies between the target object and the shape prior. To address these challenges, we introduce a novel architecture for category-level object pose refinement. Our approach integrates an HS-layer and learnable affine transformations, which aims to enhance the extraction and alignment of geometric information. Additionally, we introduce a cross-cloud transformation mechanism that efficiently merges diverse data sources. Finally, we push the limits of our model by incorporating the shape prior information for translation and size error prediction. We conducted extensive experiments to demonstrate the effectiveness of the proposed framework. Through extensive quantitative experiments, we demonstrate significant improvement over the baseline method by a large margin across all metrics.
对象姿态优化对于稳健的目标姿态估计是至关重要的。之前的工作在实例级别的对象姿态优化方面取得了显著的进展。然而,由于类别内形状的较大差异以及目标对象和形状先前的差异,类别级别的姿态优化是一个更具挑战性的问题。为了应对这些挑战,我们引入了一种新颖的类别级别对象姿态优化架构。我们的方法结合了HS层和学习可变变换,旨在增强几何信息的提取和匹配。此外,我们还引入了一种跨云变换机制,有效地合并了多样数据源。最后,我们通过引入平移和大小误差预测的形状先验信息,将模型的极限推向了更高的水平。我们进行了广泛的实验来证明所提出的框架的有效性。通过广泛的定量实验,我们证明,与基线方法相比,我们的框架在所有指标上显著取得了更大的改进。
https://arxiv.org/abs/2404.11139
Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.
从视频理解人类运动对于各种应用至关重要,包括姿态估计、网格恢复和动作识别。虽然最先进的方法主要依赖于Transformer架构,但这些方法在实际场景中具有局限性。当在实时连续流中预测时,Transformer的运行速度较慢,并且不具有对新帧率的泛化能力。鉴于这些限制,我们提出了一个新颖的无需关注的时空模型来解决人类运动理解问题,该模型基于最近在状态空间模型方面的进展。我们的模型不仅在各种运动理解任务中与Transformer架构的模型性能相匹敌,还带来了诸如对不同视频帧率适应性和与较长序列点关键点的合作工作等功能。此外,与基于Transformer的方案一样,所提出的模型支持离线和实时应用。对于实时连续预测,我们的模型在保持高准确性的同时,具有比基于Transformer的方案更快的记忆效率和成倍提高的运行速度。
https://arxiv.org/abs/2404.10880
Gait is a behavioral biometric modality that can be used to recognize individuals by the way they walk from a far distance. Most existing gait recognition approaches rely on either silhouettes or skeletons, while their joint use is underexplored. Features from silhouettes and skeletons can provide complementary information for more robust recognition against appearance changes or pose estimation errors. To exploit the benefits of both silhouette and skeleton features, we propose a new gait recognition network, referred to as the GaitPoint+. Our approach models skeleton key points as a 3D point cloud, and employs a computational complexity-conscious 3D point processing approach to extract skeleton features, which are then combined with silhouette features for improved accuracy. Since silhouette- or CNN-based methods already require considerable amount of computational resources, it is preferable that the key point learning module is faster and more lightweight. We present a detailed analysis of the utilization of every human key point after the use of traditional max-pooling, and show that while elbow and ankle points are used most commonly, many useful points are discarded by max-pooling. Thus, we present a method to recycle some of the discarded points by a Recycling Max-Pooling module, during processing of skeleton point clouds, and achieve further performance improvement. We provide a comprehensive set of experimental results showing that (i) incorporating skeleton features obtained by a point-based 3D point cloud processing approach boosts the performance of three different state-of-the-art silhouette- and CNN-based baselines; (ii) recycling the discarded points increases the accuracy further. Ablation studies are also provided to show the effectiveness and contribution of different components of our approach.
步伐是一种行为生物测量方法,可以通过观察一个人从远处走来的方式来识别个体。目前的大多数步伐识别方法依赖于轮廓图或骨骼图,而它们之间的联合应用没有被充分利用。轮廓图和骨骼图的特征可以提供互补信息,以应对外貌变化或姿势估计错误。为了充分利用轮廓图和骨骼图的优势,我们提出了一个新的步伐识别网络,称为GaitPoint+。我们的方法将骨骼关键点建模为3D点云,并采用一种计算复杂性友好的3D点处理方法来提取骨骼特征,然后将这些特征与轮廓图特征相结合以提高准确性。由于轮廓图或CNN方法已经需要相当多的计算资源,因此更快的关键点学习模块和更轻量级的骨架网络更受欢迎。我们对使用传统最大池化方法后每个人体关键点的利用率进行了深入分析,并发现,尽管肘部和足踝关键点最常用,但许多有用的关键点却被最大池化丢弃了。因此,我们提出了一种通过回收被丢弃的关键点来提高骨架点云处理过程性能的方法,并在处理骨架点云的过程中实现进一步的性能提升。我们提供了全面的一组实验结果,表明:(i)通过基于点的方法对3D点云处理技术获得的骨骼特征可以提高三种最先进的轮廓图和CNN基站的性能;(ii)回收被丢弃的关键点可以进一步提高准确性。我们还提供了消融研究,以显示我们方法的不同组件的有效性和贡献。
https://arxiv.org/abs/2404.10213
Human pose estimation faces hurdles in real-world applications due to factors like lighting changes, occlusions, and cluttered environments. We introduce a unique RGB-Thermal Nearly Paired and Annotated 2D Pose Dataset, comprising over 2,400 high-quality LWIR (thermal) images. Each image is meticulously annotated with 2D human poses, offering a valuable resource for researchers and practitioners. This dataset, captured from seven actors performing diverse everyday activities like sitting, eating, and walking, facilitates pose estimation on occlusion and other challenging scenarios. We benchmark state-of-the-art pose estimation methods on the dataset to showcase its potential, establishing a strong baseline for future research. Our results demonstrate the dataset's effectiveness in promoting advancements in pose estimation for various applications, including surveillance, healthcare, and sports analytics. The dataset and code are available at this https URL
由于因素如光照变化、遮挡和杂乱环境,人体姿态估计在现实应用中面临挑战。我们引入了一个独特的RGB-Thermal Nearly Paired和注释的2D人体姿态数据集,包括超过2400张高质量的LWIR(热)图像。每张图像都精心注释了2D人体姿态,为研究者和技术人员提供了一个宝贵的资源。这个数据集从七名演员在多样日常活动(如坐、吃、走)中拍摄获取,有助于在遮挡和其他具有挑战性的场景中进行姿态估计。我们在数据集上对最先进的姿态估计方法进行基准,以展示其潜力,并为未来的研究建立了一个强大的基线。我们的结果表明,该数据集在推动各种应用中人体姿态估计的进步方面非常有效,包括监视、医疗和体育分析。数据集和代码都可以在以下链接中找到:https://www.链接
https://arxiv.org/abs/2404.10212
Large garages are ubiquitous yet intricate scenes in our daily lives, posing challenges characterized by monotonous colors, repetitive patterns, reflective surfaces, and transparent vehicle glass. Conventional Structure from Motion (SfM) methods for camera pose estimation and 3D reconstruction fail in these environments due to poor correspondence construction. To address these challenges, this paper introduces LetsGo, a LiDAR-assisted Gaussian splatting approach for large-scale garage modeling and rendering. We develop a handheld scanner, Polar, equipped with IMU, LiDAR, and a fisheye camera, to facilitate accurate LiDAR and image data scanning. With this Polar device, we present a GarageWorld dataset consisting of five expansive garage scenes with diverse geometric structures and will release the dataset to the community for further research. We demonstrate that the collected LiDAR point cloud by the Polar device enhances a suite of 3D Gaussian splatting algorithms for garage scene modeling and rendering. We also propose a novel depth regularizer for 3D Gaussian splatting algorithm training, effectively eliminating floating artifacts in rendered images, and a lightweight Level of Detail (LOD) Gaussian renderer for real-time viewing on web-based devices. Additionally, we explore a hybrid representation that combines the advantages of traditional mesh in depicting simple geometry and colors (e.g., walls and the ground) with modern 3D Gaussian representations capturing complex details and high-frequency textures. This strategy achieves an optimal balance between memory performance and rendering quality. Experimental results on our dataset, along with ScanNet++ and KITTI-360, demonstrate the superiority of our method in rendering quality and resource efficiency.
大型车库在我们的日常生活中无处不在,但它们复杂的场景却鲜为人知。由于 poor correspondence construction,传统的相机姿态估计和3D重建方法在这些问题中失败。为了解决这些问题,本文介绍了Let'sGo,一种带有激光雷达的Gaussian插值方法,用于大型车库建模和渲染。我们开发了一种便携式扫描仪Polar,配备了惯性导航器、激光雷达和鱼眼相机,以帮助准确扫描 LiDAR 和图像数据。使用Polar设备,我们提出了一个由五个不同几何结构的宽敞车库场景组成的GarageWorld数据集,并将数据集发布给社区进行进一步研究。我们证明了Polar设备收集的LiDAR点云增强了用于车库场景建模和渲染的一组3D Gaussian插值算法。我们还提出了用于3D Gaussian插值算法训练的新型深度正则器,有效地消除了渲染图像中的浮动伪影,并开发了一个轻量级的实时级别细节(LOD)高斯渲染器,用于在基于web的设备上进行实时查看。此外,我们还探索了一种结合传统网格表示简单几何和颜色(例如墙壁和地面)与现代3D Gaussian表示捕捉复杂细节和高频纹理的混合表示策略。这种策略在内存性能和渲染质量之间实现了最优平衡。我们数据集上的实验结果,加上ScanNet++和KITTI-360,证明了我们在渲染质量和资源效率方面的优越性。
https://arxiv.org/abs/2404.09748
Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.
动作识别对于自恋型视频理解至关重要,可以无需用户努力自动持续监测日常生活活动(ADLs)的动作。现有文献主要关注3D手势输入,需要计算密集的深度估计网络或佩戴不舒适的深度传感器。相比之下,对于自恋型动作识别,尽管市场上有用户友好的智能眼镜可以捕捉到一个单色RGB图像,但关于2D手势识别的研究却相对不足。我们的研究旨在填补这一研究空白,通过探索2D手势估计领域,做出两点贡献。首先,我们引入了两种新的2D手势估计方法,即EffHandNet单手估计和EffHandEgoNet,专为自恋视角设计,可以捕捉手与物体之间的交互。这两项方法在H2O和FPHA公开基准上均超越了最先进的模型。其次,我们提出了一个自适应的动作识别架构,包括EffHandEgoNet和基于Transformer的动作识别方法。在H2O和FPHA数据集上评估,我们的架构具有更快的推理时间,准确率分别为91.32%和94.43%,均超越了最先进的方法,包括基于3D的方法。我们的工作表明,使用2D骨骼数据对于自恋型动作理解是一种稳健的方法。广泛的评估和消融研究证明了手势估计方法的影响,以及每个输入如何影响整体性能。
https://arxiv.org/abs/2404.09308
In this paper, we analyze and improve into the recently proposed DeDoDe keypoint detector. We focus our analysis on some key issues. First, we find that DeDoDe keypoints tend to cluster together, which we fix by performing non-max suppression on the target distribution of the detector during training. Second, we address issues related to data augmentation. In particular, the DeDoDe detector is sensitive to large rotations. We fix this by including 90-degree rotations as well as horizontal flips. Finally, the decoupled nature of the DeDoDe detector makes evaluation of downstream usefulness problematic. We fix this by matching the keypoints with a pretrained dense matcher (RoMa) and evaluating two-view pose estimates. We find that the original long training is detrimental to performance, and therefore propose a much shorter training schedule. We integrate all these improvements into our proposed detector DeDoDe v2 and evaluate it with the original DeDoDe descriptor on the MegaDepth-1500 and IMC2022 benchmarks. Our proposed detector significantly increases pose estimation results, notably from 75.9 to 78.3 mAA on the IMC2022 challenge. Code and weights are available at this https URL
在本文中,我们对最近提出的DeDoDe关键点检测器进行了分析和改进。我们重点关注了几个关键问题。首先,我们发现DeDoDe关键点倾向于聚类在一起,这是通过在训练过程中对检测器的目标分布进行非最大抑制来修复的。其次,我们解决了与数据增强相关的问题。特别是,DeDoDe检测器对大旋转敏感。我们通过包括90度旋转和水平翻转来解决这个问题。最后,DeDoDe检测器的解耦性质使得对下游有用性的评估变得复杂。我们通过使用预训练的密集匹配器(RoMa)将关键点匹配,并使用双视图姿态估计来评估。我们发现,原始的长时间训练对性能是有害的,因此我们提出了一个更短的学习计划。我们将所有这些改进集成到我们提出的检测器DeDoDe v2中,并在MegaDepth-1500和IMC2022基准上使用原始DeDoDe描述符进行评估。我们的检测器显著提高了姿态估计结果,特别是从75.9到78.3mAA on the IMC2022 challenge。代码和权重可在此处访问:<https://www.example.com>
https://arxiv.org/abs/2404.08928
Capturing the 3D human body is one of the important tasks in computer vision with a wide range of applications such as virtual reality and sports analysis. However, conventional frame cameras are limited by their temporal resolution and dynamic range, which imposes constraints in real-world application setups. Event cameras have the advantages of high temporal resolution and high dynamic range (HDR), but the development of event-based methods is necessary to handle data with different characteristics. This paper proposes a novel event-based method for 3D pose estimation and human mesh recovery. Prior work on event-based human mesh recovery require frames (images) as well as event data. The proposed method solely relies on events; it carves 3D voxels by moving the event camera around a stationary body, reconstructs the human pose and mesh by attenuated rays, and fit statistical body models, preserving high-frequency details. The experimental results show that the proposed method outperforms conventional frame-based methods in the estimation accuracy of both pose and body mesh. We also demonstrate results in challenging situations where a conventional camera has motion blur. This is the first to demonstrate event-only human mesh recovery, and we hope that it is the first step toward achieving robust and accurate 3D human body scanning from vision sensors.
捕捉三维人体是计算机视觉中一个重要的任务,具有广泛的应用,如虚拟现实和运动分析。然而,传统的帧相机由于其时间分辨率低和动态范围受限,对现实应用场景设置了限制。事件相机具有高时间分辨率和高速度范围(HDR)的优势,但基于事件的 method 的发展是必要的,以处理具有不同特性的数据。本文提出了一种新颖的事件为基础的三维姿态估计和人体网格恢复方法。先前的基于事件的人体网格恢复需要帧(图像)和事件数据。所提出的方法仅依赖事件;它通过移动事件相机绕静止的身体来创建三维体素,通过衰减光线重构人体姿势和网格,并保留高频细节。实验结果表明,与传统帧为基础的方法相比,所提出的方法在估计姿态和身体网格的准确性方面具有优越性。我们还演示了在具有运动模糊的传统相机难以处理的情况下的结果。这是第一个展示事件仅为基础的人体网格恢复,希望这是实现从视觉传感器获得稳健且准确的三维人体扫描的第一步。
https://arxiv.org/abs/2404.08504
In this paper we have present an improved Cycle GAN based model for under water image enhancement. We have utilized the cycle consistent learning technique of the state-of-the-art Cycle GAN model with modification in the loss function in terms of depth-oriented attention which enhance the contrast of the overall image, keeping global content, color, local texture, and style information intact. We trained the Cycle GAN model with the modified loss functions on the benchmarked Enhancing Underwater Visual Perception (EUPV) dataset a large dataset including paired and unpaired sets of underwater images (poor and good quality) taken with seven distinct cameras in a range of visibility situation during research on ocean exploration and human-robot cooperation. In addition, we perform qualitative and quantitative evaluation which supports the given technique applied and provided a better contrast enhancement model of underwater imagery. More significantly, the upgraded images provide better results from conventional models and further for under water navigation, pose estimation, saliency prediction, object detection and tracking. The results validate the appropriateness of the model for autonomous underwater vehicles (AUV) in visual navigation.
在本文中,我们提出了一个改进的基于Cycle GAN的深海图像增强模型。我们利用了最先进的Cycle GAN模型的循环一致学习技术,并对损失函数进行了修改,以实现深度定向关注,从而增强整个图像的对比度,同时保留全局内容、颜色、局部纹理和样式信息。我们使用修改后的损失函数在经过充分验证的深海视觉感知(EUPV)数据集上训练了Cycle GAN模型,该数据集包括由七种不同相机在各种能见度条件下拍摄的 paired和未 paired水下图像(劣质和优质)。此外,我们还进行了定性和定量的评估,证明了所提出的技术具有实际应用价值,并提供了更好的水下图像增强模型。值得注意的是,升级后的图像在传统模型的基础上表现更好,对于水下导航、姿态估计、熵检测、物体检测和跟踪等应用具有更高的性能。这些结果证实了该模型在自主水下车辆(AUV)视觉导航方面的适用性。
https://arxiv.org/abs/2404.07649
This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for different downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outperforming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.
本文提出了一种用于更好地处理各种下游计算机视觉任务的GeneraLIst编码器-解码器(GLID)预训练方法。虽然自监督预训练方法(例如,遮罩自动编码器)在迁移学习方面已经取得了成功,但为了适应各种下游任务,仍需要附加任务特定的子架构。GLID通过允许预训练的泛化编码器-解码器在各种视觉任务上进行微调,从而克服了这一挑战。在GLID训练方案中,预训练预语任务和其他下游任务被建模为"查询-回答"问题,包括预训练预语任务和其他下游任务。我们预训练了一个任务无关的编码器-解码器,包括查询遮罩对。在微调过程中,GLID仅用任务特定的线性变换层替换最顶层的线性变换层。这减少了预训练-微调架构不一致性,并使预训练模型更好地适应下游任务。GLID在各种视觉任务上实现了竞争力的性能,包括目标检测、图像分割、姿态估计和深度估计,超过了或与专家模型(如Mask2Former、DETR、ViTPose和BinsFormer)相当。
https://arxiv.org/abs/2404.07603
This paper introduces a novel pipeline designed to bring ultrasound (US) plane pose estimation closer to clinical use for more effective navigation to the standard planes (SPs) in the fetal brain. We propose a semi-supervised segmentation model utilizing both labeled SPs and unlabeled 3D US volume slices. Our model enables reliable segmentation across a diverse set of fetal brain images. Furthermore, the model incorporates a classification mechanism to identify the fetal brain precisely. Our model not only filters out frames lacking the brain but also generates masks for those containing it, enhancing the relevance of plane pose regression in clinical settings. We focus on fetal brain navigation from 2D ultrasound (US) video analysis and combine this model with a US plane pose regression network to provide sensorless proximity detection to SPs and non-SPs planes; we emphasize the importance of proximity detection to SPs for guiding sonographers, offering a substantial advantage over traditional methods by allowing earlier and more precise adjustments during scanning. We demonstrate the practical applicability of our approach through validation on real fetal scan videos obtained from sonographers of varying expertise levels. Our findings demonstrate the potential of our approach to complement existing fetal US technologies and advance prenatal diagnostic practices.
本文提出了一种新颖的超声平面姿态估计管道,旨在将超声平面姿态估计更有效地应用于胎儿脑中的标准平面(SPs),实现更有效的导航。我们提出了一种半监督分割模型,利用已标注的SPs和未标注的3D US volume切片。我们的模型在多样性的胎儿脑图像上实现可靠的分割。此外,模型还包括分类机制,可以精确地识别胎儿脑。我们的模型不仅滤除了缺乏脑的帧,还为包含脑的帧生成掩码,从而增强平面姿态回归在临床场景中的相关性。我们专注于从二维超声视频分析开始的胎儿脑导航,并将此模型与二维超声平面姿态回归网络相结合,提供无传感器距离检测到SPs和非SPs平面;我们强调对SPs进行距离检测对引导超声师的重要性,以及在扫描过程中更精确的调整的优势。我们通过验证来自不同专业水平的超声师获得的实际胎儿扫描视频来证明我们方法的实用性。我们的研究结果表明,我们的方法可以补充现有的胎儿US技术,促进产科诊断技术的进步。
https://arxiv.org/abs/2404.07124
Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, autonomous vehicles. However, most existing methods have difficulties in large-scale scenes due to three core challenges: \textit{(a) inaccurate depth input.} Accurate depth input is impossible to get in real-world large-scale scenes. \textit{(b) inaccurate pose estimation.} Most existing approaches rely on accurate pre-estimated camera poses. \textit{(c) insufficient scene representation capability.} A single global radiance field lacks the capacity to effectively scale to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale scene reconstruction. A vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a feature-metric bundle adjustment (FBA) method is designed for accurate and robust camera tracking in large-scale scenes. In terms of implicit scene representation, we propose an incremental scene representation method to construct the entire large-scale scene as multiple local radiance fields to enhance the scalability of 3D scene representation. Extended experiments have been conducted to demonstrate the effectiveness and accuracy of our method in depth estimation, pose estimation, and large-scale scene reconstruction.
为了实现照片现实感视图合成,密集场景重建在虚拟现实/增强现实和自动驾驶等领域具有各种应用价值。然而,由于三个核心挑战,大多数现有方法在大型场景中存在困难:(a)不准确的深度输入。在现实世界的大型场景中,准确获取深度输入是不可能的。(b)不准确的姿态估计。大多数现有方法依赖于精确预估的相机姿态。(c)不足的场景表示能力。为此,我们提出了一个逐步联合学习框架,可以实现精确的深度、姿态估计和大规模场景重建。网络采用一个基于视觉变换器的网络来增强在规模信息估计方面的性能。对于姿态估计,我们设计了一种基于特征 metrics 的捆绑调整(FBA)方法,用于在大型场景中实现准确且鲁棒的目标跟踪。在隐式场景表示方面,我们提出了一种逐步场景表示方法,将整个大型场景表示为多个局部辐射场,以增强 3D 场景表示的可扩展性。已经进行了扩展实验,以证明我们方法在深度估计、姿态估计和大规模场景重建方面的有效性和准确性。
https://arxiv.org/abs/2404.06050
Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives.
收集训练图像的准确相机姿态已被证明对于学习3D感知生成对抗网络(GANs)来说非常有益,然而在实践中,这可能相当昂贵。本文针对从未经过姿态估计的图像中学习3D感知GANs,我们提出了一种通过学习到的模板特征场(TeFF)在训练图像上进行动态姿态估计的方法。具体来说,我们在生成器中除了具有之前方法中的生成 Radiance 场之外,还要求生成器学习一个2D语义特征场,并在共享 Radiance 场的密度。这样的框架允许我们在利用生成模型发现的集均值数据的同时,获得一个规范的3D特征模板,并进一步高效地在真实数据上估计姿态参数。在各种具有挑战性的数据集上的实验结果表明,我们的方法在质量和数量方面都优于最先进的替代方案。
https://arxiv.org/abs/2404.05705
3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under controlled pose differences and propose to learn our object pose estimator with those images. Directly using the original diffusion model leads to images with noisy poses and artifacts. To tackle this issue, firstly, we exploit an image encoder, which is learned from a specially designed contrastive pose learning, to filter the unreasonable details and extract image feature maps. Additionally, we propose a novel learning strategy that allows the model to learn object poses from those generated image sets without knowing the alignment of their canonical poses. Experimental results show that our method has the capability of category-level object pose estimation from a single shot setting (as pose definition), while significantly outperforming other state-of-the-art methods on the few-shot category-level object pose estimation benchmarks.
3D物体姿态估计是一个具有挑战性的任务。以前的工作总是需要成千上万的带有注释姿态的物体图像才能学习3D姿态匹配,这费力且耗时。在本文中,我们提出了一种不需要姿态注释的类别级别3D物体姿态估计器。我们利用扩散模型(例如,零到一的扩散模型)在控制姿态差异的集合上生成一系列图像,并利用这些图像来学习我们的物体姿态估计器。直接使用原始扩散模型会导致带有噪音的姿态和伪影。为了解决这个问题,首先,我们利用专门设计的对比性姿态学习来提取图像特征,过滤不合理的细节。其次,我们提出了一种新的学习策略,使得模型能够从生成的图像集中学习物体姿态,而无需知道其全姿态的对齐。实验结果表明,我们的方法具有从单张照片设置中进行类别级别物体姿态估计的能力(作为姿态定义),同时在几个 shots的分类级别物体姿态估计基准上显著优于其他最先进的方法。
https://arxiv.org/abs/2404.05626
Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{this https URL}.
准确地区分每个物体是多目标跟踪(MOT)算法的一个基本目标。然而,要实现这个目标仍然具有挑战性,主要原因如下:(i)在拥挤的场景中,物体边界框的高重叠会导致近距离物体之间的混淆。然而,当观察2D视频时,人类会自然地感知场景中元素的深度。受到这个启发,尽管在相机平面上,物体的边界框很接近,我们仍然可以在深度维度上区分它们,从而建立对物体的3D感知。(ii)对于快速不规则的相机运动视频,物体位置的突然变化可能导致ID切换。然而,如果已知相机姿态,我们可以通过估计线性运动模型的误差来补偿。在本文中,我们提出了深度MOT(DepthMOT),它实现了:(i)检测和估计场景深度图(end-to-end),(ii)通过相机姿态估计来补偿不规则相机运动。在VisDrone-MOT和UAVDT数据集上进行的大量实验证明,深度MOT在表现优异。代码将在此处公开可用:https://this URL。
https://arxiv.org/abs/2404.05518
3D hand pose estimation from images has seen considerable interest from the literature, with new methods improving overall 3D accuracy. One current challenge is to address hand-to-hand interaction where self-occlusions and finger articulation pose a significant problem to estimation. Little work has applied physical constraints that minimize the hand intersections that occur as a result of noisy estimation. This work addresses the intersection of hands by exploiting an occupancy network that represents the hand's volume as a continuous manifold. This allows us to model the probability distribution of points being inside a hand. We designed an intersection loss function to minimize the likelihood of hand-to-point intersections. Moreover, we propose a new hand mesh parameterization that is superior to the commonly used MANO model in many respects including lower mesh complexity, underlying 3D skeleton extraction, watertightness, etc. On the benchmark InterHand2.6M dataset, the models trained using our intersection loss achieve better results than the state-of-the-art by significantly decreasing the number of hand intersections while lowering the mean per-joint positional error. Additionally, we demonstrate superior performance for 3D hand uplift on Re:InterHand and SMILE datasets and show reduced hand-to-hand intersections for complex domains such as sign-language pose estimation.
从图像中进行3D手势估计已经引起了相当大的关注,新的方法提高了整体3D准确性。目前的挑战是解决手自相交的问题,由于自相交和手指关节定位对估计造成了很大的困扰。很少有工作应用了物理约束来最小化由于噪声估计而发生的 hand 交叉。本文通过利用占据网络来对手的体积表示为连续多面体,来处理手之间的交叉。这使得我们可以建模手内部点的概率分布。我们设计了一个交叉损失函数来最小化 hand-to-point 交叉的概率。此外,我们提出了一个新的手网格参数化,其在许多方面优于常用的 MANO 模型,包括更低的网格复杂度、底层 3D 骨架提取、水tightness 等。在基准 InterHand2.6M 数据集上,使用我们提出的交叉损失函数训练的模型比最先进的模型在许多方面都取得了更好的结果,同时显著减少了手交叉的数量,降低了每个关节的位置均误差。此外,我们在 Re:InterHand 和 SMILE 数据集上展示了卓越的 3D 手抬起性能,并且在复杂领域(如手语姿态估计)中显示出减少的手到手的交叉。
https://arxiv.org/abs/2404.05414
We present STITCH: an augmented dexterity pipeline that performs Suture Throws Including Thread Coordination and Handoffs. STITCH iteratively performs needle insertion, thread sweeping, needle extraction, suture cinching, needle handover, and needle pose correction with failure recovery policies. We introduce a novel visual 6D needle pose estimation framework using a stereo camera pair and new suturing motion primitives. We compare STITCH to baselines, including a proprioception-only and a policy without visual servoing. In physical experiments across 15 trials, STITCH achieves an average of 2.93 sutures without human intervention and 4.47 sutures with human intervention. See this https URL for code and supplemental materials.
我们提出了STITCH:一个增强的灵巧管道,用于执行包括线程协调和传递的缝合。STITCH通过迭代执行针插入、线程扫视、针提取、缝合紧固、针传递和针姿态纠正,并使用失败恢复策略实现针的姿位纠正。我们引入了一个新的视觉6D针姿估计框架,使用双目相机对,并新引入了缝合运动元素。我们比较STITCH与基线,包括仅使用自适应性和没有视觉伺服的策略。在15个物理实验中,STITCH在没有人类干预的情况下平均实现了2.93个缝合,在有人类干预的情况下实现了4.47个缝合。你可以在此处查看代码和补充材料链接。
https://arxiv.org/abs/2404.05151