Decomposing an object's appearance into representations of its materials and the surrounding illumination is difficult, even when the object's 3D shape is known beforehand. This problem is ill-conditioned because diffuse materials severely blur incoming light, and is ill-posed because diffuse materials under high-frequency lighting can be indistinguishable from shiny materials under low-frequency lighting. We show that it is possible to recover precise materials and illumination -- even from diffuse objects -- by exploiting unintended shadows, like the ones cast onto an object by the photographer who moves around it. These shadows are a nuisance in most previous inverse rendering pipelines, but here we exploit them as signals that improve conditioning and help resolve material-lighting ambiguities. We present a method based on differentiable Monte Carlo ray tracing that uses images of an object to jointly recover its spatially-varying materials, the surrounding illumination environment, and the shapes of the unseen light occluders who inadvertently cast shadows upon it.
将物体的外观分解成其材料和周围环境的表示方法是困难的,即使物体的三维形状已知。这个问题是Conditioning不足的,因为扩散材料严重模糊入射光,也因为高频照明下的扩散材料可以与低频照明下的闪亮材料分辨不清。我们表明,可以利用意外生成的 shadows,例如摄影师围绕物体移动时生成的 shadows。这些 shadows 在大多数先前的反渲染管道中都是令人困扰的,但在这里我们利用它们作为改善 conditioning 和解决材料照明混淆的信号。我们提出了基于不同变的蒙特卡罗射线渲染的方法,该方法使用物体的图像一起恢复其空间 varying 的材料、周围的照明环境,以及无意中对物体生成的光遮蔽器的形状。
https://arxiv.org/abs/2305.16321
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
我们提出了神经网络3D关节构造前奏(NAP),这是合成3D关节对象模型的第一种3D深度生成模型。尽管研究了生成3D物体、组合或场景的广泛研究,但仍缺乏关注捕捉关节对象分布的重点,这是人类和机器人交互的常见对象类别。生成关节对象,我们首先设计了一个 novel 关节树/图参数化,然后应用一个扩散除噪的probabilistic模型,在这个表示上,可以从随机完整图生成关节对象。为了捕捉 both the geometry 和运动结构,Whose distribution will affect each other,我们设计了图注意力除噪网络,以学习逆扩散过程。我们提出了一种新的距离,该距离适应广泛使用的3D生成度量任务,以评估生成质量,并实验表明我们在关节对象生成方面表现出高性能。我们还展示了多个条件生成应用,包括Part2Motion、PartNet-Imagination、Motion2Part和 GAPart2Object。
https://arxiv.org/abs/2305.16315
The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
对人类手作为机器人任务中的媒介进行分析和使用个人视角的视频是困难的,这因为手和人类手与机器人末端执行器之间的视觉不匹配而 occlusion。从这个意义上说,人类手是一个麻烦。然而,通常手也提供有价值的信号,例如手的姿势可能暗示着正在握着什么物体。在这项工作中,我们提议提取一个Factored Representation,将agent(人类手)和环境分开。这可以减轻 occlusion 和不匹配,同时保留信号,从而简化后续机器人任务中模型的设计。在这个观点的中心是我们所提出的视频扩散模型(VIDM),它利用现实世界图像的先验知识和视频早期帧中物体的外观(通过注意力)。我们的实验证明了 VIDM 在改善个人视角视频涂色质量和我们Factored Representation 对于许多任务的有效性:物体检测、3D重建操纵物体、从视频中学习奖励函数、政策和可用性。
https://arxiv.org/abs/2305.16301
Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.
可控场景合成的目标是为各种工业应用创建交互环境。场景图提供了高度合适的接口,以通过紧凑的方式抽象场景上下文,方便这些应用。现有的方法依赖于从广泛的数据库或预先训练的形状嵌入中检索,往往忽略场景对象和对象之间的关系,导致由于它们的生成能力有限而产生不一致的结果。为了解决这一问题,我们提出了CommonScenes,这是一个全生成模型,将场景图转换为相应的可控3D场景,语义上真实且符合常识。我们的管道由两个分支组成,一个通过Variational Auto-encoder 预测整个场景布局,另一个通过隐式扩散生成兼容的形状,捕捉全球场景对象和本地对象之间的关系,同时保持形状多样性。生成的场景可以通过编辑输入场景图和采样扩散模型中的噪声来操纵。由于缺少提供高质量对象级网格与关系的场景图数据集,我们还建立了SG-Front,将现有的室内数据集3D-Front中添加额外的场景图标签。在SG-Front上进行广泛的实验,CommonScenes 在生成一致性、质量和多样性方面明显优于其他方法。代码和数据集将在接受后发布。
https://arxiv.org/abs/2305.16283
This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, the lack of semantics hinders interaction with objects in complex scenes. We propose to imitate the backbone feature of off-the-shelf perception models to achieve zero-shot semantic segmentation with NeRF. Our framework reformulates the segmentation process by directly rendering semantic features and only applying the decoder from perception models. This eliminates the need for expensive backbones and benefits 3D consistency. Furthermore, we can project the learned semantics onto extracted mesh surfaces for real-time interaction. With the state-of-the-art Segment Anything Model (SAM), our framework accelerates segmentation by 16 times with comparable mask quality. The experimental results demonstrate the efficacy and computational advantages of our approach. Project page: \url{https://me.kiui.moe/san/}.
本论文研究如何将神经网络辐射场(NeRF)与语义增强其应用范围。尽管NeRF在虚拟现实和数字创造等实际应用领域已经被证明有用,但缺乏语义会阻碍复杂场景下与物体的互动。我们提议仿效现有的感知模型的主干特性,以通过直接渲染语义特征来实现NeRF的零次元语义分割。我们的框架重写了分割过程,仅从感知模型中应用解码器,从而消除了昂贵的主干需求并实现了3D一致性。此外,我们可以将学到的语义投影到提取的网格表面,实现实时交互。利用最先进的分割任意模型(SAM),我们的框架将分割速度提高了16倍,与同等掩模质量相比。实验结果证明了我们方法的有效性和计算优势。项目页面: \url{https://me.kiui.moe/san/}。
https://arxiv.org/abs/2305.16233
Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page: this https URL
评分蒸馏采样(SDS)在从训练好的大规模文本到图像迁移模型中提取文本到三维生成方面表现出巨大的潜力,但仍然存在过载、过度平滑和多样性不足的问题。在本文中,我们提出将三维参数建模为随机变量,而不是像SDS那样以常量的形式表示,并提出了 variational score distillation(VSD),这是一个基于粒子的变量替换框架,以解释和解决上述问题的文本到三维生成。我们表明,SDS是VSD的一个特殊情况,会导致较差的样本,不论大小都是cfg权重。相比之下,VSD从迁移模型中继承了各种cfg权重,同时通过一个共同的cfg权重(即7.5)来提高多样性和样本质量。我们还提出了在文本到三维生成的设计空间中的多种改进,例如蒸馏时间计划和密度初始化,这些与蒸馏算法相悖,但尚未充分探索。我们的总体方法被称为“勤奋的梦想家”,可以生成高渲染分辨率(即512x512)和高保真度的人偶场卷积图形,具有丰富的结构和复杂的效果(例如烟雾和滴落)。此外,从人偶场卷积中初始化,通过VSD优化的网格精度详细而现实。项目页面: this https URL
https://arxiv.org/abs/2305.16213
Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play a significant role in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, we show that the model only using monocular-reconstructed synthetic training data can perform comparably to real data with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at \url{this https URL}.
与深度学习网络的最新发展相伴,基于外观的 gaze 估计在同域内的训练和测试中取得了显著成功。相对于同域任务,不同域的差异使跨域性能严重下降,从而阻止 gaze 估计在现实世界中的应用。在所有因素中,头部姿态和注视范围被认为是 gaze 估计最终性能的重要影响因素,而收集大量数据的成本很高。该工作提出了一种有效的模型训练 pipeline,包括一个训练数据合成和 gaze 估计模型的无监督跨域适应方法。该合成方法利用单图像三维重构扩大源域头部姿态范围,而不需要三维面部形状数据集。为了弥补合成和真实图像之间的必然差距,我们进一步提出了适合合成全貌数据的无监督跨域适应方法。我们提出了一个分离注意力相关的特征的解码网络,并引入背景增强一致性损失,利用合成源域的特点。通过综合实验,我们表明,仅使用单眼重构的合成训练数据可以使用与大量标签范围的真实数据相当的性能。我们提出的跨域适应方法进一步改进了多个目标域的性能。代码和数据将可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2305.16140
Semantic occupancy prediction aims to infer dense geometry and semantics of surroundings for an autonomous agent to operate safely in the 3D environment. Existing occupancy prediction methods are almost entirely trained on human-annotated volumetric data. Although of high quality, the generation of such 3D annotations is laborious and costly, restricting them to a few specific object categories in the training dataset. To address this limitation, this paper proposes Open Vocabulary Occupancy (OVO), a novel approach that allows semantic occupancy prediction of arbitrary classes but without the need for 3D annotations during training. Keys to our approach are (1) knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D occupancy network, and (2) pixel-voxel filtering for high-quality training data generation. The resulting framework is simple, compact, and compatible with most state-of-the-art semantic occupancy prediction models. On NYUv2 and SemanticKITTI datasets, OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches. Furthermore, we conduct extensive analyses and ablation studies to offer insights into the design of the proposed framework.
语义占用预测旨在推断密集几何和环境语义,以便自主代理在3D环境中安全地操作。现有的占用预测方法几乎完全基于人类标注的体积数据训练。尽管质量很高,但生成这样的3D注释非常繁琐和昂贵,只能限制在训练数据集中的少数具体对象类别中。为了克服这个限制,本文提出了Open Vocabulary Occupancy(OVO),一种新的方法,可以预测任意类语义占用,但在训练期间不需要3D注释。我们的方法是关键点(1)是从预训练的2D开放词汇分割模型中提取知识蒸馏到3D占用网络,(2)是用于生成高质量训练数据的像素-立方体过滤。结果框架简单、紧凑,并与大多数最先进的语义占用预测模型兼容。在NYUv2和SemanticKITTI数据集上,OVO与监督的语义占用预测方法相比表现出竞争力。此外,我们进行了广泛的分析和剔除研究,以提供对所提出框架的设计启示。
https://arxiv.org/abs/2305.16133
Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While synthetically generated training data is a viable alternative, the domain shift between real and synthetic data is a significant challenge. In this work, we aim to narrow the performance gap between models trained on synthetic data and few real images and fully supervised models trained on large-scale data. We achieve this by approaching the problem from two perspectives: 1) We introduce SyntheticP3D, a new synthetic dataset for object pose estimation generated from CAD models and enhanced with a novel algorithm. 2) We propose a novel approach (CC3D) for training neural mesh models that perform pose estimation via inverse rendering. In particular, we exploit the spatial relationships between features on the mesh surface and a contrastive learning scheme to guide the domain adaptation process. Combined, these two approaches enable our models to perform competitively with state-of-the-art models using only 10% of the respective real training images, while outperforming the SOTA model by 10.4% with a threshold of pi/18 using only 50% of the real training data. Our trained model further demonstrates robust generalization to out-of-distribution scenarios despite being trained with minimal real data.
获取准确的三维物体姿态对于许多计算机视觉应用至关重要,例如3D重建和场景理解。然而,标注现实世界的物体是耗时且具有挑战性的。虽然合成的训练数据是一个可行的替代方案,但 domain shift between real and synthetic data是一个重大的挑战。在这项工作中,我们旨在缩小训练数据集上模型与一小部分真实图像和完全监督模型之间的差距。我们通过从两个角度看待问题来实现这一目标:1)我们引入了SyntheticP3D,一个从CAD模型生成的物体姿态估计新合成数据集,并使用了一种 novel 算法增强。2)我们提出了一种 novel 的方法(CC3D)用于训练神经网络网格模型,通过逆渲染进行姿态估计。特别是,我们利用网格表面上特征的空间关系和比较学习计划指导域适应过程。结合这两个方法,我们的模型可以使用仅真实训练图像的10%的数据,以竞争的方式与最先进的模型进行表现,而使用仅真实训练数据的50%的数据时,比SOTA模型表现更好,达到阈值pi/18时,表现高出10.4%。我们的训练模型进一步证明了在离群值场景下的鲁棒泛化能力,尽管训练数据仅有很少的真实数据。
https://arxiv.org/abs/2305.16124
Generative modeling has experienced substantial progress in recent years, particularly in text-to-image and text-to-video synthesis. However, the medical field has not yet fully exploited the potential of large-scale foundational models for synthetic data generation. In this paper, we introduce GenerateCT, the first method for text-conditional computed tomography (CT) generation, addressing the limitations in 3D medical imaging research and making our entire framework open-source. GenerateCT consists of a pre-trained large language model, a transformer-based text-conditional 3D chest CT generation architecture, and a text-conditional spatial super-resolution diffusion model. We also propose CT-ViT, which efficiently compresses CT volumes while preserving auto-regressiveness in-depth, enabling the generation of 3D CT volumes with variable numbers of axial slices. Our experiments demonstrate that GenerateCT can produce realistic, high-resolution, and high-fidelity 3D chest CT volumes consistent with medical language text prompts. We further investigate the potential of GenerateCT by training a model using generated CT volumes for multi-abnormality classification of chest CT volumes. Our contributions provide a valuable foundation for future research in text-conditional 3D medical image generation and have the potential to accelerate advancements in medical imaging research. Our code, pre-trained models, and generated data are available at this https URL.
生成建模在近年来取得了显著进展,特别是在文本到图像和文本到视频合成方面。然而,医学领域尚未完全充分利用大规模基础模型生成合成数据的潜力。在本文中,我们介绍了GenerateCT,这是一种针对文本ConditionalComputedTomography(CT)生成的第一方法,解决了三维医学成像研究的局限性,使我们整个框架开源。GenerateCT由一个预先训练的大型语言模型、基于Transformer的文本Conditional3D胸部CT生成架构和一个文本Conditional空间超分辨率扩散模型组成。我们还提出了CT-ViT,它高效压缩CT体积,同时保持自回归性的深度,使能够生成具有不同 axial slices 的3DCT体积。我们的实验表明,GenerateCT可以与医学语言文本 prompts保持一致地生成现实、高分辨率和高逼真的3D胸部CT体积。我们进一步研究了GenerateCT的潜力,通过使用生成的CT体积训练一个模型,以对胸部CT体积的多个异常进行分类。我们的贡献为未来文本Conditional3D医学图像生成研究提供了宝贵的基础,并可能加速医学成像研究的前进。我们的代码、预训练模型和生成数据可在这个httpsURL上可用。
https://arxiv.org/abs/2305.16037
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
大型预训练模型通过实现多模态学习对计算机视觉产生了重大影响。CLIP模型在图像分类、对象检测和语义分割等方面取得了令人印象深刻的结果。然而,模型在3D点云处理任务方面的性能受到3D投影和CLIP训练图像之间的域差的限制。本文提出了DiffCLIP,一个新的预训练框架,结合稳定的扩散控制Net,最小化视觉分支中的域差。此外,在文本分支中引入了少量的任务风格prompt generation模块。在ModelNet10、ModelNet40和扫描对象NN数据集上进行广泛的实验表明,DiffCLIP具有很强的3D理解能力。通过稳定的扩散和风格prompt generation,DiffCLIP实现了对扫描对象NN中 obj_bg 对象零样本分类的准确率为43.2%,这是当前最先进的性能,而ModelNet10中的对象零样本分类的准确率为80.6%,与当前最先进的性能相当。
https://arxiv.org/abs/2305.15957
Detecting 3D mask attacks to a face recognition system is challenging. Although genuine faces and 3D face masks show significantly different remote photoplethysmography (rPPG) signals, rPPG-based face anti-spoofing methods often suffer from performance degradation due to unstable face alignment in the video sequence and weak rPPG signals. To enhance the rPPG signal in a motion-robust way, a landmark-anchored face stitching method is proposed to align the faces robustly and precisely at the pixel-wise level by using both SIFT keypoints and facial landmarks. To better encode the rPPG signal, a weighted spatial-temporal representation is proposed, which emphasizes the face regions with rich blood vessels. In addition, characteristics of rPPG signals in different color spaces are jointly utilized. To improve the generalization capability, a lightweight EfficientNet with a Gated Recurrent Unit (GRU) is designed to extract both spatial and temporal features from the rPPG spatial-temporal representation for classification. The proposed method is compared with the state-of-the-art methods on five benchmark datasets under both intra-dataset and cross-dataset evaluations. The proposed method shows a significant and consistent improvement in performance over other state-of-the-art rPPG-based methods for face spoofing detection.
检测面部识别系统的三维口罩攻击是一项挑战性的任务。虽然真实的面部和3D口罩显示显著不同的远程光偏振测量(rPPG)信号,但基于rPPG的面部反伪造方法经常由于视频序列中面部不稳定性以及较弱的rPPG信号而性能下降。为了在运动条件下增强rPPG信号,一种地标性框架面部拼接方法被提出,通过同时使用SIFT关键点和面部地标来 robustly and precisely align the faces at the pixel-level。为了更好地编码rPPG信号,一种加权时间和空间表示被提出,该表示强调具有丰富血管的面部区域。此外,不同颜色空间中的rPPG信号特征也被共同利用。为了提高泛化能力,一种轻量级高效的神经网络和一个门控循环单元(GRU)被设计,从rPPG时间和空间表示中分别提取空间和时间特征来进行分类。在内部数据集和跨数据集评估中,该方法与最先进的方法进行了比较。该方法在面部仿冒检测中的表现比其他任何基于rPPG的面部伪造方法都显著提高。
https://arxiv.org/abs/2305.15940
Radars and cameras belong to the most frequently used sensors for advanced driver assistance systems and automated driving research. However, there has been surprisingly little research on radar-camera fusion with neural networks. One of the reasons is a lack of large-scale automotive datasets with radar and unmasked camera data, with the exception of the nuScenes dataset. Another reason is the difficulty of effectively fusing the sparse radar point cloud on the bird's eye view (BEV) plane with the dense images on the perspective plane. The recent trend of camera-based 3D object detection using BEV features has enabled a new type of fusion, which is better suited for radars. In this work, we present RC-BEVFusion, a modular radar-camera fusion network on the BEV plane. We propose BEVFeatureNet, a novel radar encoder branch, and show that it can be incorporated into several state-of-the-art camera-based architectures. We show significant performance gains of up to 28% increase in the nuScenes detection score, which is an important step in radar-camera fusion research. Without tuning our model for the nuScenes benchmark, we achieve the best result among all published methods in the radar-camera fusion category.
雷达和摄像头是高级驾驶辅助系统和自动驾驶研究的最常用的传感器之一。然而,与神经网络的雷达-摄像头融合研究却相对较少,这让人感到意外。其中一个原因是缺乏大规模包含雷达和未暴露摄像头数据的汽车数据集,除了nuScenes数据集之外。另一个原因是有效地融合在鸟眼视图(BEV)平面上的稀疏雷达点云和Perspective平面上的密集图像是困难的。最近的趋势是使用基于BEV特征的相机三维物体检测,这导致了一种新的融合类型,更适合雷达。在本文中,我们介绍了RC-BEVFusion,这是一个基于BEV平面的模块化雷达-摄像头融合网络。我们提出了BEVFeatureNet,这是一种新的雷达编码分支,并证明它可以被集成到多个先进的相机架构中。我们展示了nuScenes检测得分的显著提高,达到28%的增加,这是雷达-摄像头融合研究的一个重要步骤。在没有对nuScenes基准进行调优的情况下,我们取得了雷达-摄像头融合类别中最好的结果。
https://arxiv.org/abs/2305.15883
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at this https URL.
由于最近在姿态估计方法方面的先进技术,可以从常见的视频中提取以3D骨骼序列的形式呈现的人类运动。尽管这些应用机会非常好,但基于内容的高效和有效地访问此类空间时间和骨骼数据的问题仍然是一项挑战性的问题。在本文中,我们提出了一种基于内容的文本到运动检索任务,旨在根据指定的自然语言文本描述检索相关的运动。为了定义这个未知的任务基准,我们使用了BERT和CLIP语言表示来编码文本模式,并成功实现了空间时间和运动模式的特定模型来编码运动模式。此外,我们引入了我们的Transformer-based方法,称为 Motion Transformer(MoT),它使用Divided Space-time Attention有效地将不同骨骼关节在时间和空间上的整合。受到最近文本到图像/视频匹配的进展启发,我们尝试了两种广泛应用的度量学习损失函数。最后,我们建立了一种通用的评估协议,通过定义定性度量来评估检索运动的质量,主要针对最近引入的KIT Motion-Language和HumanML3D数据集。代码复制我们的结果可在上述httpsURL上获取。
https://arxiv.org/abs/2305.15842
Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning. Existing methods often rely on empirical label selection strategies, such as confidence thresholding, to generate beneficial pseudo-labels for model training. This approach may, however, hinder the comprehensive exploitation of unlabeled data points. We hypothesize that this selective usage arises from the noise in pseudo-labels generated on unlabeled data. The noise in pseudo-labels may result in significant discrepancies between pseudo-labels and model predictions, thus confusing and affecting the model training greatly. To address this issue, we propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for weakly supervised learning in 3D segmentation tasks, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, it reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation network and the 3D segmentation network simultaneously. Despite the simplicity, our method promisingly improves the performance. We validate the effectiveness through extensive experiments on various baselines and large-scale datasets. Results show that ERDA effectively enables the effective usage of all unlabeled data points for learning and achieves state-of-the-art performance under different settings. Remarkably, our method can outperform fully-supervised baselines using only 1% of true annotations. Code and model will be made publicly available.
在弱监督的三维分割任务中,大量的 ground-truth 标签只可用来学习,而训练数据集非常稀疏。现有的方法往往依赖于经验标签选择策略,如置信阈值,来生成有益的伪标签用于模型训练。然而,这种方法可能会妨碍全面利用未标注数据点。我们假设这种选择是源于伪标签在未标注数据上的噪声。伪标签上的噪声可能导致伪标签和模型预测之间的显著差异,因此会混淆和影响模型训练。为了解决这一问题,我们提出了一种新的学习策略, regularize 生成的伪标签,并有效地缩小伪标签和模型预测之间的差距。具体来说,我们的方法引入了熵Regularization Loss和分布对齐 Loss,以弱监督的三维分割任务为例,生成 ERDA 学习策略。有趣的是,通过使用KL距离来制定分布对齐 Loss,它简化成一个看似简单的交叉熵基函数 loss,同时优化伪标签生成网络和三维分割网络。尽管简单,我们的方法却显著提高了性能。我们通过广泛的实验,对各种不同的基准值和大型数据集进行了验证。结果显示,ERDA 有效地使所有未标注数据点的学习有效利用,并在不同设置下实现最先进的性能。值得注意的是,我们的方法只需使用1%的真实标注数据就能超越完全监督基准线。代码和模型将公开可用。
https://arxiv.org/abs/2305.15832
Generating and editing a 3D scene guided by natural language poses a challenge, primarily due to the complexity of specifying the positional relations and volumetric changes within the 3D space. Recent advancements in Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities across various domains. Surprisingly, these models also show great potential in realizing and interpreting the 3D space. In light of this, we propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter into the off-the-shelf layout-to-3D generative models, allowing users to flexibly and interactively generate visual content. Specifically, we design a versatile layout structure base on the bounding boxes and semantics to prompt the LLMs to model the spatial generation and reasoning from language. Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content. We validate the effectiveness of LI3D, primarily in 3D generation and editing through multi-round interactions, which can be flexibly extended to 2D generation and editing. Various experiments demonstrate the potential benefits of incorporating LLMs in generative AI for applications, e.g., metaverse. Moreover, we benchmark the layout reasoning performance of LLMs with neural visual artist tasks, revealing their emergent ability in the spatial layout domain.
由自然语言指导生成和编辑3D场景是一项挑战,主要原因是指定3D空间中的位图和体积变化的复杂性。近年来,大型语言模型(LLMs)的进步已经展示了在各种领域中出色的推理、对话和零次生成能力。令人惊讶地,这些模型也表现出在实现和解释3D空间的巨大潜力。基于这一点,我们提出了一种 novel 语言-指导的交互3D生成系统,称为LI3D,它将LLMs作为3D布局解释器融入现有的布局-3D生成模型中,使用户可以灵活和交互式地生成视觉内容。具体来说,我们基于边界框和语义设计了一个多功能的布局结构,以促使LLMs从语言模型中预测空间生成和推理。我们的系统还集成了LLaVA,一个大型语言和视觉助手,以提供从视觉方面的生成反馈,以提高生成内容的视觉质量。我们通过多次交互验证LI3D的有效性,主要是在3D生成和编辑方面,可以灵活扩展到2D生成和编辑。各种实验展示了将LLMs融入生成AI应用程序中的潜在好处,例如元宇宙。此外,我们比较了LLMs的布局推理性能,与神经网络视觉艺术家任务基准,揭示了它们在空间布局领域的 emergent ability。
https://arxiv.org/abs/2305.15808
A successful tactic that is followed by the scientific community for advancing AI is to treat games as problems, which has been proven to lead to various breakthroughs. We adapt this strategy in order to study Rocket League, a widely popular but rather under-explored 3D multiplayer video game with a distinct physics engine and complex dynamics that pose a significant challenge in developing efficient and high-performance game-playing agents. In this paper, we present Lucy-SKG, a Reinforcement Learning-based model that learned how to play Rocket League in a sample-efficient manner, outperforming by a notable margin the two highest-ranking bots in this game, namely Necto (2022 bot champion) and its successor Nexto, thus becoming a state-of-the-art agent. Our contributions include: a) the development of a reward analysis and visualization library, b) novel parameterizable reward shape functions that capture the utility of complex reward types via our proposed Kinesthetic Reward Combination (KRC) technique, and c) design of auxiliary neural architectures for training on reward prediction and state representation tasks in an on-policy fashion for enhanced efficiency in learning speed and performance. By performing thorough ablation studies for each component of Lucy-SKG, we showed their independent effectiveness in overall performance. In doing so, we demonstrate the prospects and challenges of using sample-efficient Reinforcement Learning techniques for controlling complex dynamical systems under competitive team-based multiplayer conditions.
科学界采取的成功策略是将游戏视为问题,这一策略已被证明可以带来各种突破。我们适应这一策略是为了研究备受欢迎但被忽视的3D多人互动视频游戏《 Rocket League》,这个游戏有一个独特的物理引擎和复杂的动态系统,在开发高效、高性能的游戏扮演代理方面面临着巨大的挑战。在本文中,我们介绍了Lucy-SKG,这是一种基于强化学习模型的代理,通过高效的样本学习,以显著优势超越了这个游戏中排名最高的两个机器人,分别是Necto(2022年机器人冠军)和其后继者Nexto,因此成为最先进的代理。我们的贡献包括:a)开发奖励分析和可视化库;b) novel可参数化的奖励形状函数,通过我们提出的触觉奖励组合(KRC)技术捕捉复杂的奖励类型的价值;c)设计辅助神经网络架构,以在政策模式下训练奖励预测和状态表示任务,以提高学习速度和表现效率。通过进行每个组件的全面 ablation研究,我们展示了它们在整体表现中的独立有效性。在此过程中,我们展示了使用高效的样本学习技术控制复杂动态系统在竞争团队间多人互动条件下的前景和挑战。
https://arxiv.org/abs/2305.15801
This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language to the targeted region in LiDAR point clouds. Previous approaches for REC usually focus on the 2D or 3D-indoor domain, which is not suitable for accurately predicting the location of the queried 3D region in an autonomous driving scene. In addition, the upper-bound limitation and the heavy computation cost motivate us to explore a better solution. In this work, we propose a new multi-modal visual grounding task, termed LiDAR Grounding. Then we devise a Multi-modal Single Shot Grounding (MSSG) approach with an effective token fusion strategy. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector without any post-processing. Moreover, the image feature can be flexibly integrated into our approach to provide rich texture and color information. The cross-modal learning enforces the detector to concentrate on important regions in the point cloud by considering the informative language expressions, thus leading to much better accuracy and efficiency. Extensive experiments on the Talk2Car dataset demonstrate the effectiveness of the proposed methods. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
本论文探讨了在自动驾驶场景中的3D指代表达理解(REC)问题,旨在在LiDAR点云上建立自然语言到目标区域的 ground 线。以往的 REC 方法通常只关注2D或3D室内区域,不适合在自动驾驶场景中准确预测 query 的3D区域的位置。此外,限制上限和高昂的计算成本也激励我们探索更好的解决方案。在这项工作中,我们提出了一种新的多模态视觉grounding任务,称为LiDARgrounding,然后开发了一种有效的 token fusion 策略来联合学习LiDAR基于物体检测器和语言特征,并从检测器直接预测目标区域,不需要任何后处理。此外,图像特征可以灵活地集成到我们的方法和提供丰富的纹理和颜色信息。跨模态学习强迫检测器集中关注点云中的重要区域,考虑 informative 语言表达方式,从而带来更好的精度和效率。在Talk2Car数据集上进行广泛的实验证明了所提出的方法的有效性。我们的工作深入探究了基于LiDAR的grounding任务,我们期望它为自动驾驶社区提供了一个有前途的方向。
https://arxiv.org/abs/2305.15765
In recent years, 3D models have been utilized in many applications, such as auto-driver, 3D reconstruction, VR, and AR. However, the scarcity of 3D model data does not meet its practical demands. Thus, generating high-quality 3D models efficiently from textual descriptions is a promising but challenging way to solve this problem. In this paper, inspired by the ability of human beings to complement visual information details from ambiguous descriptions based on their own experience, we propose a novel text-3D generation model (T2TD), which introduces the related shapes or textual information as the prior knowledge to improve the performance of the 3D generation model. In this process, we first introduce the text-3D knowledge graph to save the relationship between 3D models and textual semantic information, which can provide the related shapes to guide the target 3D model generation. Second, we integrate an effective causal inference model to select useful feature information from these related shapes, which removes the unrelated shape information and only maintains feature information that is strongly relevant to the textual description. Meanwhile, to effectively integrate multi-modal prior knowledge into textual information, we adopt a novel multi-layer transformer structure to progressively fuse related shape and textual information, which can effectively compensate for the lack of structural information in the text and enhance the final performance of the 3D generation model. The final experimental results demonstrate that our approach significantly improves 3D model generation quality and outperforms the SOTA methods on the text2shape datasets.
近年来,三维模型被广泛应用于许多应用,例如自动驾驶、三维重建、虚拟现实和增强现实等。然而,三维模型数据的稀缺性并没有满足其实际需求。因此,从文本描述中生成高质量三维模型是一个有前途但具有挑战性的方法来解决这个问题。在本文中,基于人类从不确定描述中补充视觉信息细节的能力,我们提出了一种新的文本-三维生成模型(T2TD),该模型引入了相关的形状或文本信息作为先验知识,以提高三维生成模型的性能。在这个过程中,我们首先介绍了文本-三维知识图,以保存三维模型和文本语义信息之间的关系,可以提供相关的形状来指导目标三维模型生成。其次,我们集成了有效的因果推断模型,从这些相关的形状中选择有用的特征信息,删除了不相关的形状信息,仅保留与文本描述密切相关的特征信息。同时,为了有效地将多模态先验知识集成到文本信息中,我们采用了一种新的多层Transformer结构,逐步融合相关的形状和文本信息,可以 effectively弥补文本中的结构信息缺失,并提高三维生成模型的最终性能。最终的实验结果显示,我们的方法显著提高了三维模型生成质量,在文本2shape数据集上优于最先进的方法。
https://arxiv.org/abs/2305.15753
When virtual agents interact with humans, gestures are crucial to delivering their intentions with speech. Previous multimodal co-speech gesture generation models required encoded features of all modalities to generate gestures. If some input modalities are removed or contain noise, the model may not generate the gestures properly. To acquire robust and generalized encodings, we propose a novel framework with a multimodal pre-trained encoder for co-speech gesture generation. In the proposed method, the multi-head-attention-based encoder is trained with self-supervised learning to contain the information on each modality. Moreover, we collect full-body gestures that consist of 3D joint rotations to improve visualization and apply gestures to the extensible body model. Through the series of experiments and human evaluation, the proposed method renders realistic co-speech gestures not only when all input modalities are given but also when the input modalities are missing or noisy.
当虚拟代理与人类交互时,手势是传达其意图的关键。以前的多模式同时语音识别手势生成模型需要所有模式编码的特征来生成手势。如果某些输入模式被删除或包含噪声,模型可能无法正常生成手势。为了获得稳健且泛化能力强的编码,我们提出了一个 novel 框架,该框架使用多视角注意力编码器来进行同时语音识别手势生成。在提出的方法中,使用自监督学习训练多视角注意力编码器来包含每个模式的信息。此外,我们收集全身手势,包括三维关节旋转,以提高可视化并应用于扩展身体模型。通过实验和人类评估,该方法不仅可以在给定所有输入模式时实现真实同时语音识别手势,还可以在输入模式缺失或噪声时实现。
https://arxiv.org/abs/2305.15740