Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However, only a few works consider human-scene interactions together with text conditions, which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multi-modality nature of text, scene, and motion, as well as the need for spatial reasoning. To address these challenges, we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object, we leverage the power of large language models. For motion generation, we design an object-centric scene representation for the generative model to focus on the target object, thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices.
由于其广泛的应用,从文本描述中生成人类运动已经吸引了越来越多的研究兴趣。然而,只有少数工作考虑了人类场景互动与文本条件,这对于视觉和物理真实感至关重要。本文专注于基于文本描述的3D室内场景中生成人类运动的任务。由于文本、场景和动作的多模态性质以及需要进行空间推理,这项任务带来了挑战。为了应对这些挑战,我们提出了一个新方法,将复杂问题分解成两个更易管理的问题子问题:(1)目标对象的语义理解;(2)以物体为中心的运动生成。对于目标对象的语义理解,我们利用大型语言模型的力量。对于运动生成,我们为生成模型设计了一个物体中心化的场景表示,以关注目标对象,从而降低场景复杂性,并促进将人类运动与物体之间的关系建模清楚。实验证明,与基线相比,我们的方法具有更好的运动质量,并验证了我们设计的选择是正确的。
https://arxiv.org/abs/2405.07784
Autonomous driving systems require a quick and robust perception of the nearby environment to carry out their routines effectively. With the aim to avoid collisions and drive safely, autonomous driving systems rely heavily on object detection. However, 2D object detections alone are insufficient; more information, such as relative velocity and distance, is required for safer planning. Monocular 3D object detectors try to solve this problem by directly predicting 3D bounding boxes and object velocities given a camera image. Recent research estimates time-to-contact in a per-pixel manner and suggests that it is more effective measure than velocity and depth combined. However, per-pixel time-to-contact requires object detection to serve its purpose effectively and hence increases overall computational requirements as two different models need to run. To address this issue, we propose per-object time-to-contact estimation by extending object detection models to additionally predict the time-to-contact attribute for each object. We compare our proposed approach with existing time-to-contact methods and provide benchmarking results on well-known datasets. Our proposed approach achieves higher precision compared to prior art while using a single image.
自动驾驶系统需要对周围环境进行快速且可靠的感知,以有效执行其任务。为了避免碰撞并安全驾驶,自动驾驶系统 reliance heavily on object detection。然而,单独的2D物体检测是不够的;为了进行更安全的规划,还需要更多的信息,如相对速度和距离。单目3D物体检测器试图通过直接预测相机图像中的3D边界框和物体速度来解决这个问题。最近的研究以每像素方式估计了时间到达,并建议这是比速度和深度联合更有效的测量方法。然而,每像素时间到达需要物体检测器实现其目的,因此增加了整体计算需求。为了应对这个问题,我们提出了一种通过扩展物体检测模型来预测每个物体的时间到达的方法。我们比较了我们提出的方法与现有时间到达方法,并在知名数据集上进行了基准测试。我们提出的方法在单张图片上实现更高精度的时间到达,同时使用了一个物体。
https://arxiv.org/abs/2405.07698
Monocular 3D object detection aims for precise 3D localization and identification of objects from a single-view image. Despite its recent progress, it often struggles while handling pervasive object occlusions that tend to complicate and degrade the prediction of object dimensions, depths, and orientations. We design MonoMAE, a monocular 3D detector inspired by Masked Autoencoders that addresses the object occlusion issue by masking and reconstructing objects in the feature space. MonoMAE consists of two novel designs. The first is depth-aware masking that selectively masks certain parts of non-occluded object queries in the feature space for simulating occluded object queries for network training. It masks non-occluded object queries by balancing the masked and preserved query portions adaptively according to the depth information. The second is lightweight query completion that works with the depth-aware masking to learn to reconstruct and complete the masked object queries. With the proposed object occlusion and completion, MonoMAE learns enriched 3D representations that achieve superior monocular 3D detection performance qualitatively and quantitatively for both occluded and non-occluded objects. Additionally, MonoMAE learns generalizable representations that can work well in new domains.
单目3D物体检测旨在实现从单视图图像中精确地定位和识别物体。尽管其最近的进展,它在处理普遍存在的物体遮挡时往往遇到困难,这会复杂化并降低物体尺寸、深度和方向的预测。我们设计MonoMAE,一种以遮罩卷积神经网络为灵感的多目3D检测器,通过在特征空间中遮罩和重构物体来解决物体遮挡问题。MonoMAE由两个新颖的设计组成。第一个是深度感知遮罩,它选择性地在特征空间中遮罩非遮挡物体查询的某些部分,以模拟网络训练中的遮挡物体查询。它通过根据深度信息平衡遮罩和保留查询部分来遮罩非遮挡物体查询。第二个是轻量级查询完成,它与深度感知遮罩协同工作,学习如何重构和完成遮膜物体查询。通过所提出的物体遮挡和完成,MonoMAE获得了丰富的3D表示,在遮挡和非遮挡物体上实现卓越的单目3D检测性能。此外,MonoMAE还学习了可以在新领域中表现良好的泛化表示。
https://arxiv.org/abs/2405.07696
Deep Neural Networks (DNNs) require large amounts of annotated training data for a good performance. Often this data is generated using manual labeling (error-prone and time-consuming) or rendering (requiring geometry and material information). Both approaches make it difficult or uneconomic to apply them to many small-scale applications. A fast and straightforward approach of acquiring the necessary training data would allow the adoption of deep learning to even the smallest of applications. Chroma keying is the process of replacing a color (usually blue or green) with another background. Instead of chroma keying, we propose luminance keying for fast and straightforward training image acquisition. We deploy a black screen with high light absorption (99.99\%) to record roughly 1-minute long videos of our target objects, circumventing typical problems of chroma keying, such as color bleeding or color overlap between background color and object color. Next we automatically mask our objects using simple brightness thresholding, saving the need for manual annotation. Finally, we automatically place the objects on random backgrounds and train a 2D object detector. We do extensive evaluation of the performance on the widely-used YCB-V object set and compare favourably to other conventional techniques such as rendering, without needing 3D meshes, materials or any other information of our target objects and in a fraction of the time needed for other approaches. Our work demonstrates highly accurate training data acquisition allowing to start training state-of-the-art networks within minutes.
深度神经网络(DNNs)需要大量带注释的训练数据来获得良好的性能。通常,这是通过手动标注(错误率高且耗时)或渲染(需要几何和材料信息)来生成的。这两种方法使得将它们应用于许多小规模应用程序变得困难或不经济。获取所需训练数据的一种快速直接的方法将使将深度学习应用于即使是规模最小的应用程序成为可能。 色度键值是将一种颜色(通常为蓝色或绿色)替换为另一种背景的过程。我们提出了一种亮度键值来进行快速而直接的训练图像获取。我们使用高光吸收(99.99\%)的黑色屏幕来记录我们目标物体的1-minute视频,从而避免典型的色度键值问题,例如颜色 bleeding或背景颜色和物体颜色之间的颜色重叠。接下来,我们使用简单的亮度阈值自动遮盖我们的物体,无需手动标注。最后,我们自动将物体放置在随机背景上并训练2D物体检测器。我们对YCB-V对象集进行了广泛的评估,并与其他传统技术(例如渲染)进行了比较,无需需要我们目标对象的3D模型、材料或其他信息,且用时更短。 我们的工作证明,精确的训练数据获取使可以在几分钟内开始训练最先进的网络。
https://arxiv.org/abs/2405.07653
The deep neural network (DNN) models are widely used for object detection in automated driving systems (ADS). Yet, such models are prone to errors which can have serious safety implications. Introspection and self-assessment models that aim to detect such errors are therefore of paramount importance for the safe deployment of ADS. Current research on this topic has focused on techniques to monitor the integrity of the perception mechanism in ADS. Existing introspection models in the literature, however, largely concentrate on detecting perception errors by assigning equal importance to all parts of the input data frame to the perception module. This generic approach overlooks the varying safety significance of different objects within a scene, which obscures the recognition of safety-critical errors, posing challenges in assessing the reliability of perception in specific, crucial instances. Motivated by this shortcoming of state of the art, this paper proposes a novel method integrating raw activation patterns of the underlying DNNs, employed by the perception module, analysis with spatial filtering techniques. This novel approach enhances the accuracy of runtime introspection of the DNN-based 3D object detections by selectively focusing on an area of interest in the data, thereby contributing to the safety and efficacy of ADS perception self-assessment processes.
深度神经网络(DNN)模型在自动驾驶系统(ADS)中的物体检测中得到了广泛应用。然而,这些模型容易产生错误,可能会对安全性产生严重的影响。因此,为了确保ADS的安全部署,自省和自我评估模型(Introspection and self-assessment models)至关重要。目前,关于这一主题的研究主要集中在关注ADS感知机制的完整性监视技术。然而,文献中现有的自省模型大多集中在将感知模块对输入数据框中所有部分赋予相同的重要性来检测感知错误。这种通用方法忽略了场景中不同物体之间 safety significance 的差异,从而难以识别关键安全错误,评估感知在特定、关键实例的可靠性具有挑战性。为了弥补这一缺陷,本文提出了一种将底层DNN的原始激活模式与感知模块分析相结合的新方法。这种新颖的方法通过选择性地关注数据中感兴趣区域,提高基于DNN的3D物体检测的运行时自省精度,从而为ADS感知自我评估过程的安全性和有效性作出贡献。
https://arxiv.org/abs/2405.07600
In recent years, there has been an increasing demand for customizable 3D virtual spaces. Due to the significant human effort required to create these virtual spaces, there is a need for efficiency in virtual space creation. While existing studies have proposed methods for automatically generating layouts such as floor plans and furniture arrangements, these methods only generate text indicating the layout structure based on user instructions, without utilizing the information obtained during the generation process. In this study, we propose an agent-driven layout generation system using the GPT-4V multimodal large language model and validate its effectiveness. Specifically, the language model manipulates agents to sequentially place objects in the virtual space, thus generating layouts that reflect user instructions. Experimental results confirm that our proposed method can generate virtual spaces reflecting user instructions with a high success rate. Additionally, we successfully identified elements contributing to the improvement in behavior generation performance through ablation study.
近年来,定制化3D虚拟空间的需求不断增加。由于创建这些虚拟空间需要大量的人力劳动,因此需要虚拟空间创建的效率。尽管现有的研究已经提出了自动生成布局的方法,如平面图和家具布置,但这些方法仅根据用户指令生成文本表示布局结构,而没有利用生成过程中获取的信息。在本文中,我们提出了一种基于GPT-4V多模态大型语言模型的代理驱动布局生成系统,并验证了其有效性。具体来说,语言模型通过操纵代理在虚拟空间中按顺序放置物体,从而生成反映用户指令的布局。实验结果证实了我们提出的具有高成功率的生成虚拟空间的方法。此外,通过消融研究,我们成功识别了导致行为生成性能改进的因素。
https://arxiv.org/abs/2405.08037
Robust 3D object detection remains a pivotal concern in the domain of autonomous field robotics. Despite notable enhancements in detection accuracy across standard datasets, real-world urban environments, characterized by their unstructured and dynamic nature, frequently precipitate an elevated incidence of false positives, thereby undermining the reliability of existing detection paradigms. In this context, our study introduces an advanced post-processing algorithm that modulates detection thresholds dynamically relative to the distance from the ego object. Traditional perception systems typically utilize a uniform threshold, which often leads to decreased efficacy in detecting distant objects. In contrast, our proposed methodology employs a Neural Network with a self-adaptive thresholding mechanism that significantly attenuates false negatives while concurrently diminishing false positives, particularly in complex urban settings. Empirical results substantiate that our algorithm not only augments the performance of 3D object detection models in diverse urban and adverse weather scenarios but also establishes a new benchmark for adaptive thresholding techniques in field robotics.
稳健的3D物体检测在自主领域机器人领域仍然是一个关键问题。尽管在标准数据集和现实世界城市环境中检测准确度的提升很大,但由于这些环境的特点是动态和无结构的,因此经常导致误检率的上升,从而削弱了现有检测范式的可靠性。在这种情况下,我们的研究引入了一种高级的后处理算法,该算法相对于自车对象距离自适应地调整检测阈值。传统的感知系统通常使用统一阈值,这往往导致在检测远处物体时效力下降。相反,我们提出的方法采用了一个具有自适应阈值机制的神经网络,该机制可以显著减弱误检率,同时降低误检率,特别是在复杂的城市环境中。实验结果证实,我们的算法不仅增强了各种城市和恶劣天气场景中3D物体检测模型的性能,而且为场机器人技术中的自适应阈值技术树立了新的基准。
https://arxiv.org/abs/2405.07479
The increasing prominence of e-commerce has underscored the importance of Virtual Try-On (VTON). However, previous studies predominantly focus on the 2D realm and rely heavily on extensive data for training. Research on 3D VTON primarily centers on garment-body shape compatibility, a topic extensively covered in 2D VTON. Thanks to advances in 3D scene editing, a 2D diffusion model has now been adapted for 3D editing via multi-viewpoint editing. In this work, we propose GaussianVTON, an innovative 3D VTON pipeline integrating Gaussian Splatting (GS) editing with 2D VTON. To facilitate a seamless transition from 2D to 3D VTON, we propose, for the first time, the use of only images as editing prompts for 3D editing. To further address issues, e.g., face blurring, garment inaccuracy, and degraded viewpoint quality during editing, we devise a three-stage refinement strategy to gradually mitigate potential issues. Furthermore, we introduce a new editing strategy termed Edit Recall Reconstruction (ERR) to tackle the limitations of previous editing strategies in leading to complex geometric changes. Our comprehensive experiments demonstrate the superiority of GaussianVTON, offering a novel perspective on 3D VTON while also establishing a novel starting point for image-prompting 3D scene editing.
随着电子商务的日益突出,虚拟试穿(VTON)的重要性得到了凸显。然而,以前的研究主要集中在2D领域,并且依赖大量数据进行训练。关于3D VTON的研究主要集中在衣身材形兼容性上,这一主题在2D VTON中得到了广泛覆盖。得益于3D场景编辑的进步,现在已经将2D扩散模型适应性地用于3D编辑通过多视角编辑。在这项工作中,我们提出了GaussianVTON,一种集成了Gaussian分形(GS)编辑和2D VTON的创新3D VTON管道。为了实现从2D到3D VTON的无缝过渡,我们提出了仅使用图像作为编辑提示的3D编辑的建议。为了进一步解决诸如面糊、衣物不准确和编辑视角质量下降等问题,我们设计了一个三阶段精炼策略,逐渐减轻潜在问题。此外,我们还引入了一种新编辑策略,称为Edit Recall Reconstruction(ERR)来解决以前编辑策略在导致复杂几何变化方面的限制。我们全面的实验证明,GaussianVTON具有优越性,为3D VTON提供了一种新的视角,同时也在图像提示3D场景编辑方面建立了新的起点。
https://arxiv.org/abs/2405.07472
In the high-stakes world of baseball, every nuance of a pitcher's mechanics holds the key to maximizing performance and minimizing runs. Traditional analysis methods often rely on pre-recorded offline numerical data, hindering their application in the dynamic environment of live games. Broadcast video analysis, while seemingly ideal, faces significant challenges due to factors like motion blur and low resolution. To address these challenges, we introduce PitcherNet, an end-to-end automated system that analyzes pitcher kinematics directly from live broadcast video, thereby extracting valuable pitch statistics including velocity, release point, pitch position, and release extension. This system leverages three key components: (1) Player tracking and identification by decoupling actions from player kinematics; (2) Distribution and depth-aware 3D human modeling; and (3) Kinematic-driven pitch statistics. Experimental validation demonstrates that PitcherNet achieves robust analysis results with 96.82% accuracy in pitcher tracklet identification, reduced joint position error by 1.8mm and superior analytics compared to baseline methods. By enabling performance-critical kinematic analysis from broadcast video, PitcherNet paves the way for the future of baseball analytics by optimizing pitching strategies, preventing injuries, and unlocking a deeper understanding of pitcher mechanics, forever transforming the game.
在充满紧张的棒球世界中,每个投手动作的细微差别都掌握着实现卓越表现和最小化失分的关键。传统的分析方法通常依赖于预先记录的离线数值数据,这使得它们在实时比赛的动态环境中应用受限。直播视频分析虽然看似理想,但由于运动模糊和低分辨率等 factors,面临着相当大的挑战。为了应对这些挑战,我们引入了PitcherNet,一种端到端的自动系统,它可以直接从直播视频分析投手的动作,从而提取包括速度、触发点、 pitch position 和 release extension在内的有价值投球统计数据。这个系统利用了三个关键组件:(1)通过解耦动作与运动员动作来跟踪和识别球员;(2)分布和深度感知的三维人体建模;(3)基于动作的投球统计。实验验证表明,PitcherNet在投手跟踪器识别方面的准确度达到了96.82%,减少了关节位置误差1.8mm,并比基线方法具有更卓越的 analytics。通过让直播视频实现关键的动态分析,PitcherNet为棒球分析的未来铺平了道路,通过优化投球策略、防止受伤和揭示投手动作,永远改变了游戏。
https://arxiv.org/abs/2405.07407
Animatable clothing transfer, aiming at dressing and animating garments across characters, is a challenging problem. Most human avatar works entangle the representations of the human body and clothing together, which leads to difficulties for virtual try-on across identities. What's worse, the entangled representations usually fail to exactly track the sliding motion of garments. To overcome these limitations, we present Layered Gaussian Avatars (LayGA), a new representation that formulates body and clothing as two separate layers for photorealistic animatable clothing transfer from multi-view videos. Our representation is built upon the Gaussian map-based avatar for its excellent representation power of garment details. However, the Gaussian map produces unstructured 3D Gaussians distributed around the actual surface. The absence of a smooth explicit surface raises challenges in accurate garment tracking and collision handling between body and garments. Therefore, we propose two-stage training involving single-layer reconstruction and multi-layer fitting. In the single-layer reconstruction stage, we propose a series of geometric constraints to reconstruct smooth surfaces and simultaneously obtain the segmentation between body and clothing. Next, in the multi-layer fitting stage, we train two separate models to represent body and clothing and utilize the reconstructed clothing geometries as 3D supervision for more accurate garment tracking. Furthermore, we propose geometry and rendering layers for both high-quality geometric reconstruction and high-fidelity rendering. Overall, the proposed LayGA realizes photorealistic animations and virtual try-on, and outperforms other baseline methods. Our project page is this https URL.
翻译 面向角色的可穿戴转移,旨在在角色之间设计和动画服装,是一个具有挑战性的问题。大多数人类Avatar将人体的表示和服装的表示交织在一起,导致在跨身份的虚拟试穿中存在困难。更糟糕的是,交织的表示通常无法准确跟踪服装的滑动运动。为了克服这些限制,我们提出了Layered Gaussian Avatars(LayGA),一种新的表示方法,将人体和服装表示为两个独立的层,以实现从多视角视频中进行 photorealistic 可穿戴转移。我们的表示基于高斯图层为基础,具有出色的服装细节表示能力。然而,高斯图产生无结构的3D高斯分布在实际表面上,这使得准确的人体和服装跟踪和碰撞处理变得具有挑战性。因此,我们提出了两阶段训练,包括单层重建和多层拟合。在单层重建阶段,我们提出了一系列几何约束以重建平滑的表面,同时获得身体和服装的分割。接下来,在多层拟合阶段,我们训练了两个单独的模型来表示身体和服装,并利用重构的服装几何结构作为3D指导以实现更准确的可穿戴跟踪。此外,我们还提出了几何和渲染层,用于高保真度渲染和高质量几何重建。总体而言,LayGA实现了 photorealistic animations 和虚拟试穿,并超越了其他基线方法。我们的项目页面是https:// this URL。
https://arxiv.org/abs/2405.07319
In NeRF-aided editing tasks, object movement presents difficulties in supervision generation due to the introduction of variability in object positions. Moreover, the removal operations of certain scene objects often lead to empty regions, presenting challenges for NeRF models in inpainting them effectively. We propose an implicit ray transformation strategy, allowing for direct manipulation of the 3D object's pose by operating on the neural-point in NeRF rays. To address the challenge of inpainting potential empty regions, we present a plug-and-play inpainting module, dubbed differentiable neural-point resampling (DNR), which interpolates those regions in 3D space at the original ray locations within the implicit space, thereby facilitating object removal & scene inpainting tasks. Importantly, employing DNR effectively narrows the gap between ground truth and predicted implicit features, potentially increasing the mutual information (MI) of the features across rays. Then, we leverage DNR and ray transformation to construct a point-based editable NeRF pipeline PR^2T-NeRF. Results primarily evaluated on 3D object removal & inpainting tasks indicate that our pipeline achieves state-of-the-art performance. In addition, our pipeline supports high-quality rendering visualization for diverse editing operations without necessitating extra supervision.
在 NeRF 辅助编辑任务中,由于引入了对象位置的变异性,物体运动监督生成存在困难。此外,某些场景对象的移除操作通常会导致空洞区域,这对 NeRF 模型在修复它们方面有效性的实现带来了挑战。我们提出了一个隐式光线变换策略,通过在 NeRF 光线上对神经点进行操作,使得可以直接操纵 3D 对象的姿势。为了应对修复可能存在的空洞区域的挑战,我们提出了一个可插拔的修复模块,被称为可导神经点采样(DNR),它在原始光线位置的 3D 空间中平滑地修复这些区域,从而促进物体移除和场景修复任务。重要的是,使用 DNR 有效地缩小了地面真实值和预测隐式特征之间的差距,可能增加了一维特征之间的互信息(MI)。然后,我们利用 DNR 和光线变换构建了一个基于点的可编辑 NeRF 管道 PR^2T-NeRF。在主要评估的 3D 物体移除和修复任务上,我们的管道实现了最先进的表现。此外,我们的管道还支持各种编辑操作的高质量渲染可视化,而无需增加额外的监督。
https://arxiv.org/abs/2405.07306
An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However, establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency, which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict, i.e., the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges, we propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes. To achieve this goal, we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU), object detection (+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at this https URL, hoping to inspire future research.
一种在感知大型动态场景时具有普遍3D表示的有效预训练框架是非常想要的。然而,建立一个既任务通用又标签高效的理想框架,以统一同一原始语义在不同场景中的表示,仍然具有挑战性。当前的对比性3D预训练方法通常遵循帧级的一致性,重点关注每个脱离图像的2D-3D关系。这种不考虑全局一致性的一致性极大地阻碍了达到普遍预训练框架的有前途的路径:(1)跨场景语义自冲突,即来自不同场景的同一语义段之间强烈的碰撞;(2)缺乏一个全局统一的键,将跨场景语义一致性推入3D表示学习。为解决上述挑战,我们提出了一个CSC框架,将场景级别的语义一致性放在核心位置,桥接不同场景中类似语义段之间的连接。为实现这一目标,我们结合了视觉基础模型提供的 coherent 语义线索和互补多模态信息得到的知识丰富的跨场景原型。这允许我们训练一个通用的3D预训练模型,从而无需太多微调努力完成各种下游任务。通过在 nuScenes 上使用他们的任务特定3D网络进行实验,我们在语义分割(+1.4%mIoU)、目标检测(+1.0%mAP)和透视分割(+3.0%PQ)方面实现了与当前最佳预训练方法相当甚至更好的性能。代码发布在https://这个URL上,希望激发未来的研究兴趣。
https://arxiv.org/abs/2405.07201
Guided by the hologram technology of the infamous Star Wars franchise, I present an application that creates real-time holographic overlays using LiDAR augmented 3D reconstruction. Prior attempts involve SLAM or NeRFs which either require highly calibrated scenes, incur steep computation costs, or fail to render dynamic scenes. I propose 3 high-fidelity reconstruction tools that can run on a portable device, such as a iPhone 14 Pro, which can allow for metric accurate facial reconstructions. My systems enable interactive and immersive holographic experiences that can be used for a wide range of applications, including augmented reality, telepresence, and entertainment.
在《星际迷航》这个著名的电影系列中,引导着虚拟现实技术的全息图技术的应用,我提出了一个使用激光增强型3D建模创建实时全息图覆盖的应用程序。之前的尝试包括SLAM或NeRFs,它们要么需要高度校准的场景,要么需要高昂的计算成本,或者无法渲染动态场景。我提出了3个可以在便携式设备上运行的高保真度重建工具,比如iPhone 14 Pro,这将允许实现精确的 metric 面部重建。我的系统可以实现交互式和沉浸式的全息图体验,可以应用于各种应用,包括增强现实、远程呈现和娱乐。
https://arxiv.org/abs/2405.07178
With the rapid advancement of technologies such as virtual reality, augmented reality, and gesture control, users expect interactions with computer interfaces to be more natural and intuitive. Existing visual algorithms often struggle to accomplish advanced human-computer interaction tasks, necessitating accurate and reliable absolute spatial prediction methods. Moreover, dealing with complex scenes and occlusions in monocular images poses entirely new challenges. This study proposes a network model that performs parallel processing of root-relative grids and root recovery tasks. The model enables the recovery of 3D hand meshes in camera space from monocular RGB images. To facilitate end-to-end training, we utilize an implicit learning approach for 2D heatmaps, enhancing the compatibility of 2D cues across different subtasks. Incorporate the Inception concept into spectral graph convolutional network to explore relative mesh of root, and integrate it with the locally detailed and globally attentive method designed for root recovery exploration. This approach improves the model's predictive performance in complex environments and self-occluded scenes. Through evaluation on the large-scale hand dataset FreiHAND, we have demonstrated that our proposed model is comparable with state-of-the-art models. This study contributes to the advancement of techniques for accurate and reliable absolute spatial prediction in various human-computer interaction applications.
随着诸如虚拟现实、增强现实和手势控制的快速进步,用户期望与计算机界面进行更自然、直观的交互。现有的视觉算法通常很难完成高级的人机交互任务,因此需要准确可靠的绝对空间预测方法。此外,处理单目图像中的复杂场景和遮挡也带来了全新的挑战。本研究提出了一个并行处理根相对网格和根恢复任务的网络模型。该模型能够从单目RGB图像中恢复相机空间中的三维手纹理。为了促进端到端训练,我们利用隐式学习方法对2D热图进行处理,增强不同子任务中2D提示的兼容性。将Inception概念引入到光谱图卷积网络中,以探索根的相对纹理,并将其与局部详细和全局关注方法相结合,用于根恢复探索。这种方法在复杂环境和自闭场景中提高了模型的预测性能。通过在大型手数据集FreiHAND上进行评估,我们证明了与最先进的模型相比,我们所提出的模型具有可比性。本研究为各种人机交互应用中准确可靠绝对空间预测技术的发展做出了贡献。
https://arxiv.org/abs/2405.07167
The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncated Depth NeRF (TD-NeRF), a novel approach that enables training NeRF from unknown camera poses - by jointly optimizing learnable parameters of the radiance field and camera poses. Our approach explicitly utilizes monocular depth priors through three key advancements: 1) we propose a novel depth-based ray sampling strategy based on the truncated normal distribution, which improves the convergence speed and accuracy of pose estimation; 2) to circumvent local minima and refine depth geometry, we introduce a coarse-to-fine training strategy that progressively improves the depth precision; 3) we propose a more robust inter-frame point constraint that enhances robustness against depth noise during training. The experimental results on three datasets demonstrate that TD-NeRF achieves superior performance in the joint optimization of camera pose and NeRF, surpassing prior works, and generates more accurate depth geometry. The implementation of our method has been released at this https URL.
依赖准确的相机姿态是广泛应用神经辐射场(NeRF)模型进行3D建模和SLAM任务的显著障碍。现有的方法引入了单目深度优先权来共同优化相机姿态和NeRF,但未能充分利用深度优先权,忽视了它们的固有噪声。在本文中,我们提出了截断深度NeRF(TD-NeRF),一种通过共同优化Radiance场和学习器参数来训练NeRF的新颖方法。我们的方法通过三个关键进步利用了单目深度优先权:1)我们提出了一种基于截断正态分布的新型深度光线采样策略,从而提高姿态估计的收敛速度和准确性;2)为了绕过局部最小值并优化深度几何,我们引入了一种粗到细的训练策略,逐步提高深度精度;3)我们提出了一种更健壮的跨帧点约束,在训练过程中增强对深度噪声的鲁棒性。在三个数据集上的实验结果表明,TD-NeRF在联合优化相机姿态和NeRF方面取得了卓越的性能,超越了之前的工作,并生成了更准确的深度几何。我们方法的实现已经发布在https:// this URL上。
https://arxiv.org/abs/2405.07027
We address the problem of robot guided assembly tasks, by using a learning-based approach to identify contact model parameters for known and novel parts. First, a Variational Autoencoder (VAE) is used to extract geometric features of assembly parts. Then, we combine the extracted features with physical knowledge to derive the parameters of a contact model using our newly proposed neural network structure. The measured force from real experiments is used to supervise the predicted forces, thus avoiding the need for ground truth model parameters. Although trained only on a small set of assembly parts, good contact model estimation for unknown objects were achieved. Our main contribution is the network structure that allows us to estimate contact models of assembly tasks depending on the geometry of the part to be joined. Where current system identification processes have to record new data for a new assembly process, our method only requires the 3D model of the assembly part. We evaluate our method by estimating contact models for robot-guided assembly tasks of pin connectors as well as electronic plugs and compare the results with real experiments.
我们通过使用基于学习的接触模型参数识别方法来解决机器人引导装配任务的问题。首先,我们使用变分自编码器(VAE)提取装配部件的几何特征。然后,我们将提取的特征与物理知识相结合,利用我们新提出的神经网络结构计算接触模型的参数。使用实实验测量的力来监督预测力,从而避免了需要真实模型参数的情况。 尽管我们的训练仅基于一小部分装配部件,但我们成功地估计了未知物体的接触模型。我们主要的贡献是允许根据部件的几何形状估计装配任务的接触模型。与现有的系统识别过程需要记录新的装配过程数据不同,我们的方法只需要装配部件的3D模型。我们通过估算机器人引导装配任务中的引脚连接器接触模型以及电子插座的接触模型,并与实际实验结果进行比较来评估我们的方法。
https://arxiv.org/abs/2405.06991
Accurately reconstructing a 3D scene including explicit geometry information is both attractive and challenging. Geometry reconstruction can benefit from incorporating differentiable appearance models, such as Neural Radiance Fields and 3D Gaussian Splatting (3DGS). In this work, we propose a learnable scene model that incorporates 3DGS with an explicit geometry representation, namely a mesh. Our model learns the mesh and appearance in an end-to-end manner, where we bind 3D Gaussians to the mesh faces and perform differentiable rendering of 3DGS to obtain photometric supervision. The model creates an effective information pathway to supervise the learning of the scene, including the mesh. Experimental results demonstrate that the learned scene model not only achieves state-of-the-art rendering quality but also supports manipulation using the explicit mesh. In addition, our model has a unique advantage in adapting to scene updates, thanks to the end-to-end learning of both mesh and appearance.
准确地重构一个包含明确的几何信息的3D场景是具有吸引力和挑战性的。借助于包括神经网络辐射场和3D高斯平滑(3DGS)等可导出外观模型的几何重建,可以提高几何重建的准确性。在这项工作中,我们提出了一种学习式场景模型,该模型包括3DGS,并以显式几何表示方式进行建模。我们的模型以端到端的方式学习网格和外观,并将3D高斯约束到网格面并对其进行3DGS的差分渲染,以获得光照指导。实验结果表明,学习到的场景模型不仅实现了最先进的渲染质量,而且还支持使用显式网格进行操作。此外,由于我们同时学习网格和外观,我们的模型在适应场景更新方面具有独特的优势。
https://arxiv.org/abs/2405.06945
Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.
从点云序列中识别人类动作引起了学术界和产业界的高度关注,因为它具有广泛的应用。然而,大多数先前的点云动作识别研究通常需要复杂的网络来提取帧内空间特征和帧间时间特征,导致冗余计算数量过多。这导致延迟过高,使得它们对于现实应用不再实用。为了解决这个问题,我们提出了一个名为PRENet的平滑fit冗余编码点云序列网络。我们方法的主要思想是利用平滑fit来减轻序列内的空间冗余,同时编码整个序列的时间冗余以最小化冗余计算。具体来说,我们的网络由两个主要模块组成:平滑fit嵌入模块和时域-空间一致性编码模块。平滑fit嵌入模块利用观察到连续点云帧在物理空间中具有独特的几何特征的事实,实现地理位置编码数据的重复利用。时域-空间一致性编码模块将时间冗余部分与相应的空间布局相结合,从而提高识别准确性。我们进行了大量实验来验证我们网络的有效性。实验结果表明,与最先进的方法相比,我们的方法具有几乎相同的识别准确度,同时速度快了约四倍。
https://arxiv.org/abs/2405.06929
It is now possible to estimate 3D human pose from monocular images with off-the-shelf 3D pose estimators. However, many practical applications require fine-grained absolute pose information for which multi-view cues and camera calibration are necessary. Such multi-view recordings are laborious because they require manual calibration, and are expensive when using dedicated hardware. Our goal is full automation, which includes temporal synchronization, as well as intrinsic and extrinsic camera calibration. This is done by using persons in the scene as the calibration objects. Existing methods either address only synchronization or calibration, assume one of the former as input, or have significant limitations. A common limitation is that they only consider single persons, which eases correspondence finding. We attain this generality by partitioning the high-dimensional time and calibration space into a cascade of subspaces and introduce tailored algorithms to optimize each efficiently and robustly. The outcome is an easy-to-use, flexible, and robust motion capture toolbox that we release to enable scientific applications, which we demonstrate on diverse multi-view benchmarks. Project website: this https URL.
现在,可以从配备标准 3D 姿态估计算法的单目图像中估计 3D 人类姿势。然而,许多实际应用需要细粒度的绝对姿态信息,这需要多视角线索和相机标定。这样的多视角记录很费力,因为它们需要手动标定,并且使用专用硬件时费用较高。我们的目标是实现完全自动化,包括时间同步和固有和 extrinsic 相机标定。这是通过将场景中的人作为标定对象来实现的。现有的方法要么只解决同步问题,要么只解决标定问题,或者具有很大的局限性。一个常见的问题是,它们只考虑单个人员,这简化了匹配发现。我们通过将高维时间和标定空间划分为子空间并引入针对每个人员定制的高效且鲁棒算法来获得这种泛化。其结果是一个易于使用、灵活且鲁棒的动作捕捉工具箱,我们将其发布给科学应用,并在各种多视角基准测试中对其进行了演示。项目网站:https://this.html。
https://arxiv.org/abs/2405.06845
This paper proposes a novel task named "3D part grouping". Suppose there is a mixed set containing scattered parts from various shapes. This task requires algorithms to find out every possible combination among all the parts. To address this challenge, we propose the so called Gradient Field-based Auto-Regressive Sampling framework (G-FARS) tailored specifically for the 3D part grouping task. In our framework, we design a gradient-field-based selection graph neural network (GNN) to learn the gradients of a log conditional probability density in terms of part selection, where the condition is the given mixed part set. This innovative approach, implemented through the gradient-field-based selection GNN, effectively captures complex relationships among all the parts in the input. Upon completion of the training process, our framework becomes capable of autonomously grouping 3D parts by iteratively selecting them from the mixed part set, leveraging the knowledge acquired by the trained gradient-field-based selection GNN. Our code is available at: this https URL.
本文提出了一种名为“3D分组”的新任务。假设存在一个包含来自各种形状的分散部分混合集。这项任务要求算法在所有部分中找到所有可能的组合。为解决这个挑战,我们提出了一个专为3D分组任务设计的称为“基于梯度场自回归采样”的框架(G-FARS)。在我们的框架中,我们设计了一个基于梯度场选择的图神经网络(GNN)来学习基于选择的分子的对数条件概率密度梯度。这种创新方法通过基于梯度场选择的GNN实现了对输入中所有部分的复杂关系的有效捕捉。在训练过程完成时,我们的框架能够通过迭代选择它们来自混合集来自动分组3D部分,利用训练后的基于梯度场选择的GNN获得的知识。我们的代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2405.06828