In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the greater challenges posed by 3D city generation, mainly because humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose CityDreamer, a compositional generative model designed specifically for unbounded 3D cities, which separates the generation of building instances from other background objects, such as roads, green lands, and water areas, into distinct modules. Furthermore, we construct two datasets, OSM and GoogleEarth, containing a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. Through extensive experiments, CityDreamer has proven its superiority over state-of-the-art methods in generating a wide range of lifelike 3D cities.
过去几年,大量研究集中在生成3D自然场景,但3D城市生成领域并没有得到同样的探索。这主要是因为生成3D城市面临更大的挑战,主要是因为人类在城市环境中对结构扭曲更为敏感。此外,生成3D城市比生成3D自然场景更加复杂,因为建筑物作为同类别的对象,表现出比自然场景中的类似物体更加广泛的外貌, compared to the relatively consistent appearance of objects like trees in natural scenes. 为了解决这些挑战,我们提出了 CityDreamer,一个专门设计用于无边界3D城市的构成生成模型,它将建筑实例的生成与其他背景对象,如道路、绿色空间和水域,分为基础模块和子模块。此外,我们建立了两个数据集,OSM和Google Earth,其中包含大量真实的城市图像,以增强生成的3D城市的真实感和外观。通过广泛的实验,CityDreamer已经证明了它在生成各种逼真的3D城市方面的优越性。
https://arxiv.org/abs/2309.00610
Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout configuration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. To be specific, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. We further propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
得益于扩散模型的迅速发展,图像合成领域取得了前所未有的进展。以前的工作主要依赖于预先训练的语言学模型,但文本往往过于抽象,无法正确指定图像的所有空间属性,例如场景的布局配置,从而导致了复杂的场景生成结果不佳。在本文中,我们提出了一种语义可控的场景排布扩散模型,称为law-扩散,并称之为law-扩散。与只探索类别意识的关系的先前场景-图像生成方法不同,law-扩散引入了空间依赖解析器,将跨对象的位置意识语义连贯编码为场景嵌入,生成具有感知上和谐的物体样式和上下文关系的场景。具体来说,我们精细地实例化每个对象的区域语义,将其作为对象区域地图,并利用位置意识跨对象注意力模块捕获这些分离表示中的空间依赖。我们还提出了一种自适应的场景指导计划,以减轻区域语义对齐和生成物体纹理精度之间的权衡。此外,law-扩散还允许在合成图像中重新配置,同时保持其他区域,通过引入位置意识的潜在拼接机制,重新组合其局部区域语义。为了更准确地验证生成场景的可行性,我们提出了一种新的L2I任务评价指标,称为场景关系分数(SRS),以衡量图像如何保留上下文对象之间的理性和和谐关系。全面的实验表明,我们的law-扩散生成了最先进的生成性能,特别是在具有连贯物体关系的情况下。
https://arxiv.org/abs/2308.06713
An event-based camera outputs an event whenever a change in scene brightness of a preset magnitude is detected at a particular pixel location in the sensor plane. The resulting sparse and asynchronous output coupled with the high dynamic range and temporal resolution of this novel camera motivate the study of event-based cameras for navigation and landing applications. However, the lack of real-world and synthetic datasets to support this line of research has limited its consideration for onboard use. This paper presents a methodology and a software pipeline for generating event-based vision datasets from optimal landing trajectories during the approach of a target body. We construct sequences of photorealistic images of the lunar surface with the Planet and Asteroid Natural Scene Generation Utility at different viewpoints along a set of optimal descent trajectories obtained by varying the boundary conditions. The generated image sequences are then converted into event streams by means of an event-based camera emulator. We demonstrate that the pipeline can generate realistic event-based representations of surface features by constructing a dataset of 500 trajectories, complete with event streams and motion field ground truth data. We anticipate that novel event-based vision datasets can be generated using this pipeline to support various spacecraft pose reconstruction problems given events as input, and we hope that the proposed methodology would attract the attention of researchers working at the intersection of neuromorphic vision and guidance navigation and control.
一种基于事件的相机每当在传感器平面上特定像素位置的亮度变化检测到时就会输出事件。这种稀疏和非同步的输出与该新型相机的高动态范围和时间分辨率激励了研究基于事件相机的导航和着陆应用。然而,缺乏支持这种研究方向的现实世界和合成数据集限制了将其考虑为船载使用的考虑。本文介绍了一种方法器和软件管道,用于从最佳着陆轨迹中生成基于事件的视数据集。通过 varying 边界条件,我们生成了一系列基于最佳下降轨迹的 Lunar 表面上的逼真图像,并使用 Planet 和小行星自然场景生成 Utility 在多个视角下沿着一组最佳下降轨迹生成了图像序列。生成的图像序列随后通过基于事件相机模拟器转换为事件流。我们证明了该管道可以生成实际的事件based表面上特征的逼真表示,通过构建一个数据集包含500条轨迹,并包括事件流和运动场 ground truth数据。我们预计,使用该管道可以使用事件流支持各种航天器姿态重建问题,并提供基于事件的视数据集,以支持各种航天器的姿态重建问题,并希望该建议方法将吸引研究人员在神经视觉和导航指导和控制交叉领域的注意。
https://arxiv.org/abs/2308.00394
Extracting object-level representations for downstream reasoning tasks is an emerging area in AI. Learning object-centric representations in an unsupervised setting presents multiple challenges, a key one being binding an arbitrary number of object instances to a specialized object slot. Recent object-centric representation methods like Slot Attention utilize iterative attention to learn composable representations with dynamic inference level binding but fail to achieve specialized slot level binding. To address this, in this paper we propose Unsupervised Conditional Slot Attention using a novel Probabilistic Slot Dictionary (PSD). We define PSD with (i) abstract object-level property vectors as key and (ii) parametric Gaussian distribution as its corresponding value. We demonstrate the benefits of the learnt specific object-level conditioning distributions in multiple downstream tasks, namely object discovery, compositional scene generation, and compositional visual reasoning. We show that our method provides scene composition capabilities and a significant boost in a few shot adaptability tasks of compositional visual reasoning, while performing similarly or better than slot attention in object discovery tasks
提取面向后续推理任务的对象级表示是一个 AI 新兴领域。在无监督条件下学习对象中心表示面临多种挑战,其中一个重要的挑战是将任意数量的对象实例绑定到特定的对象槽上。最近的对象中心表示方法如Slot Attention 利用迭代注意力学习可组合的表示,具有动态推理级别绑定的能力,但无法实现专门的槽级别绑定。为了解决这个问题,在本文中我们提出了无监督条件Slot Attention 方法,使用一种新的概率性slot字典(PSD)。我们定义 PSD 以(i) 抽象的对象级属性向量作为关键键,(ii) 参数型高斯分布作为对应值。我们证明在多个后续任务中,如对象发现、组合场景生成和组合视觉推理中,学习特定的对象级条件分布具有多种好处,这些好处包括在几个片段适应任务中显著增强组合视觉推理的能力,而对象发现任务中的表现与slot attention类似或更好。
https://arxiv.org/abs/2307.09437
Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at this https URL.
模拟是现代自主开发的基础。模拟器帮助开发、测试和改进系统,而不会危及人类、车辆或它们的环境。然而,模拟器面临一个主要挑战:他们依赖实际、可扩展且有趣的内容。尽管最近的渲染和场景重建技术在创造静态场景资产方面取得了很大进展,但建模它们的布局、动态和行为仍然具有挑战性。在这项工作中,我们将语言作为动态交通场景生成的监督来源。我们的模型LCTGen结合了大型语言模型和基于Transformer的解码架构,从地图数据集中选择可能的位置,并产生初始交通分布和每个车辆的动态。LCTGen在无条件和条件交通场景生成方面的表现都比先前的工作更为真实和逼真。代码和视频将在此httpsURL上提供。
https://arxiv.org/abs/2307.07947
Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.
近年来的三维场景理解进展使得可以在大量不同场景的dataset上实现 scalable 的表示学习。这导致可以推广到未展示的场景和对象,从单个或少数输入图像生成全新的视角,以及支持编辑的控制场景生成,现在可以实现。然而,将大量场景一起训练通常会导致渲染质量下降,与单个场景优化模型如 NeRFs 相比是如此。在本文中,我们利用扩散模型的最新进展为三维场景表示学习模型提供渲染高保真新视角的能力,同时保留了对象级场景编辑的好处。特别是,我们提出了 DORSal,它适应视频扩散架构,以基于场景对象位的表示方式生成三维场景。在复杂的合成多物体场景和真实的大规模街道图像数据集上,我们展示了 DORSal 能够以对象级编辑的方式 scalable 地生成三维场景,并改进了现有的方法。
https://arxiv.org/abs/2306.08068
Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{this https URL}.
文本到图像生成模型已经越来越受关注,通过用户提供的描述进行灵活的图像编辑。然而,仅仅提供文本描述还不够,常常牺牲 subjects 的身份或需要对每个 subjects 进行额外的个体微调。我们提出了一个新的框架,名为 \textit{ paste, Inpaint, and Harmonize via Denoising} (博士学位),它利用示例图像并利用文本描述来指定用户意图。在粘贴步骤中,一个现有的分割模型被用来在示例图像内识别用户指定的 subject,并将其插入到背景图像中,作为初始化,以捕捉场景上下文和 subject 身份。为了确保生成或编辑图像的视觉连贯性,我们引入了一个 painting 和 harmonizing 模块,指导训练好的扩散模型自然地将插入的 subject 融入场景。只要我们保持训练好的扩散模型冻结,就保留了其强大的图像合成能力和文本驱动能力,从而取得了高质量的结果和多样化的文本编辑。在我们的实验中,我们应用博士学位两种任务,并探索基于参考 subject 的文本驱动场景生成。 quantitative 和定性的比较与基准方法表明,我们的 approach 在两种任务中取得了最先进的表现。更多定性结果可以在 \url{this https URL} 找到。
https://arxiv.org/abs/2306.07596
Slot attention has shown remarkable object-centric representation learning performance in computer vision tasks without requiring any supervision. Despite its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured scene generation. For each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from the global scene representation to ensure coherent scene structures. Our extensive evaluation of the scene generation ability indicates that Slot-VAE outperforms slot representation-based generative baselines in terms of sample quality and scene structure accuracy.
窗口注意力在计算机视觉任务中表现出显著的对象中心表示学习性能,而不需要任何监督。尽管作为一种确定性模块,窗口注意力具有对象中心绑定能力,但它缺乏生成新场景的能力。在本文中,我们提出了slot-VAE,一种生成模型,它将窗口注意力与HierarchicalVAE框架相结合,用于对象中心结构化场景生成。对于每个图像,模型同时推断一个全局场景表示,以捕捉高层次的场景结构和对象中心窗口表示,以嵌入个体对象组件。在生成过程中,窗口表示从全局场景表示中生成,以确保连贯的场景结构。我们对场景生成能力进行了广泛的评估,表明slot-VAE在样本质量和场景结构准确性方面比基于窗口表示的生成基线表现更好。
https://arxiv.org/abs/2306.06997
Capturing and labeling real-world 3D data is laborious and time-consuming, which makes it costly to train strong 3D models. To address this issue, previous works generate randomized 3D scenes and pre-train models on generated data. Although the pre-trained models gain promising performance boosts, previous works have two major shortcomings. First, they focus on only one downstream task (i.e., object detection). Second, a fair comparison of generated data is still lacking. In this work, we systematically compare data generation methods using a unified setup. To clarify the generalization of the pre-trained models, we evaluate their performance in multiple tasks (e.g., object detection and semantic segmentation) and with different pre-training methods (e.g., masked autoencoder and contrastive learning). Moreover, we propose a new method to generate 3D scenes with spherical harmonics. It surpasses the previous formula-driven method with a clear margin and achieves on-par results with methods using real-world scans and CAD models.
收集和标注现实世界的三维数据需要进行繁琐且费时的工作,这使得训练强大的三维模型变得昂贵。为了解决这一问题,过去的作品生成随机的三维场景,并对生成的数据进行预训练。虽然预训练模型取得了令人期望的性能提升,但过去的作品有两个主要缺点。首先,它们只关注一个后续任务(即对象检测)。其次,生成的数据仍然缺乏公正的比较。在本文中,我们使用一种统一的 setup systematic 地比较生成方法。为了阐明预训练模型的泛化能力,我们评估它们在多个任务(如对象检测和语义分割)和不同预训练方法(如掩码神经网络和对比学习)中的表现。此外,我们提出了一种新方法,以生成球谐函数形式的三维场景。它与过去的基于公式的方法相比有明确的优势,并与使用现实世界扫描和CAD模型的方法取得了与水平相当的结果。
https://arxiv.org/abs/2306.04237
In this paper we describe a learned method of traffic scene generation designed to simulate the output of the perception system of a self-driving car. In our "Scene Diffusion" system, inspired by latent diffusion, we use a novel combination of diffusion and object detection to directly create realistic and physically plausible arrangements of discrete bounding boxes for agents. We show that our scene generation model is able to adapt to different regions in the US, producing scenarios that capture the intricacies of each region.
在本文中,我们描述了一种通过学习来生成交通场景的方法,旨在模拟自动驾驶汽车感知系统的输出。基于潜在扩散的灵感,我们在我们的“场景扩散”系统中使用了扩散和物体检测的创新性组合,直接创建了逼真且物理上可信的离散边界框,以处理 agents 的身份。我们表明,我们的场景生成模型能够适应美国不同的地区,生成能够捕捉每个地区细节的场景。
https://arxiv.org/abs/2305.18452
Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts.
文本驱动的三维场景生成可以适用于游戏、电影产业和虚拟现实应用等领域,这些领域对三维场景有很高的需求。然而,现有的文本到三维生成方法却只能生成简单的几何形状和缺乏真实感的梦幻风格三维物体。在本文中,我们提出了Text2NeRF,它能够从文本提示中生成各种复杂的几何结构和高保真的纹理,完全基于文本提示。为此,我们采用了NeRF作为三维表示,并利用预先训练的文本到图像扩散模型来约束NeRF的三维重构,以反映场景描述。具体来说,我们使用扩散模型来推断与文本相关的图像作为内容先验,并使用单目深度估计方法提供几何先验。内容先验和几何先验都用于更新NeRF模型。为了确保不同视角下纹理和几何一致性,我们引入了渐进的场景覆盖和更新策略,用于场景新视角合成。我们的方法和不需要额外的训练数据,只需要场景的自然语言描述作为输入。广泛的实验结果表明,我们的Text2NeRF在从各种自然语言提示生成逼真、多视角一致性和多样性的三维场景方面比现有方法表现更好。
https://arxiv.org/abs/2305.11588
This document serves as a position paper that outlines the authors' vision for a potential pathway towards generalist robots. The purpose of this document is to share the excitement of the authors with the community and highlight a promising research direction in robotics and AI. The authors believe the proposed paradigm is a feasible path towards accomplishing the long-standing goal of robotics research: deploying robots, or embodied AI agents more broadly, in various non-factory real-world settings to perform diverse tasks. This document presents a specific idea for mining knowledge in the latest large-scale foundation models for robotics research. Instead of directly adapting these models or using them to guide low-level policy learning, it advocates for using them to generate diversified tasks and scenes at scale, thereby scaling up low-level skill learning and ultimately leading to a foundation model for robotics that empowers generalist robots. The authors are actively pursuing this direction, but in the meantime, they recognize that the ambitious goal of building generalist robots with large-scale policy training demands significant resources such as computing power and hardware, and research groups in academia alone may face severe resource constraints in implementing the entire vision. Therefore, the authors believe sharing their thoughts at this early stage could foster discussions, attract interest towards the proposed pathway and related topics from industry groups, and potentially spur significant technical advancements in the field.
本文件作为一份立场文件,概述了作者对于通用机器人可能的路径规划愿景。本文件的目的是与社区分享作者的兴奋,并突出在机器人和人工智能领域的有前途的研究方向。作者相信,所提出的范式是一个实现长期目标可行之路:将机器人广泛部署到各种非制造的实际场景下,以执行各种任务。本文件提出了一种具体的想法,用于挖掘在机器人研究最新大规模基础模型中的知识。它不建议直接适应这些模型或使用它们来指导低级别的政策学习,而是建议使用它们来生成大规模的多样化任务和场景,从而扩大低级别技能学习,并最终导致机器人基础模型,赋予通用机器人权力。作者正在积极追求这一方向,但与此同时,他们认识到,建造大规模政策训练的通用机器人的雄心勃勃的目标需要大量资源,例如计算资源和硬件,仅学术界的研究小组就可能在实现整个愿景时面临严重的资源限制。因此,作者相信,在这个阶段分享自己的想法可以推动讨论,吸引来自行业小组的相关主题的兴趣,并可能促进领域的重大技术进步。
https://arxiv.org/abs/2305.10455
A promise of Generative Adversarial Networks (GANs) is to provide cheap photorealistic data for training and validating AI models in autonomous driving. Despite their huge success, their performance on complex images featuring multiple objects is understudied. While some frameworks produce high-quality street scenes with little to no control over the image content, others offer more control at the expense of high-quality generation. A common limitation of both approaches is the use of global latent codes for the whole image, which hinders the learning of independent object distributions. Motivated by SemanticStyleGAN (SSG), a recent work on latent space disentanglement in human face generation, we propose a novel framework, Urban-StyleGAN, for urban scene generation and manipulation. We find that a straightforward application of SSG leads to poor results because urban scenes are more complex than human faces. To provide a more compact yet disentangled latent representation, we develop a class grouping strategy wherein individual classes are grouped into super-classes. Moreover, we employ an unsupervised latent exploration algorithm in the $\mathcal{S}$-space of the generator and show that it is more efficient than the conventional $\mathcal{W}^{+}$-space in controlling the image content. Results on the Cityscapes and Mapillary datasets show the proposed approach achieves significantly more controllability and improved image quality than previous approaches on urban scenes and is on par with general-purpose non-controllable generative models (like StyleGAN2) in terms of quality.
生成对抗网络(GANs)的承诺是提供廉价的逼真数据,用于训练和验证自动驾驶中的AI模型。尽管GANs的巨大成功,但它们在包含多个物体的复杂图像方面的表现仍进行研究。虽然某些框架产生高质量的街道场景,但几乎没有控制图像内容,而另一些则提供了更多的控制,但牺牲生成质量。这两种方法的共同限制是对整个图像使用的全局隐态编码,这阻碍了独立物体分布的学习。基于语义样式GAN(SSG)最近的工作,我们提出了一种新框架,称为城市样式GAN,用于城市场景生成和操纵。我们发现,简单地应用SSG会导致不良结果,因为城市场景比人类面部更复杂。为了提供一个紧凑但分离的隐态表示,我们开发了一种类分组策略,其中每个类被分组为超级类。此外,我们使用无监督的隐态探索算法在生成器的${\cal S}$空间中应用,并表明它比传统的${\cal W}^{+}$空间在控制图像内容方面更有效。在城市景观和Mapillary数据集上的结果表明, proposed approach 在城市场景方面实现 significantly more controllability 和改善图像质量方面比过去的方法和类似风格GAN2的一般用途无控制生成模型更高,在质量上处于同一水平。
https://arxiv.org/abs/2305.09602
Generative AI (AIGC, a.k.a. AI generated content) has made remarkable progress in the past few years, among which text-guided content generation is the most practical one since it enables the interaction between human instruction and AIGC. Due to the development in text-to-image as well 3D modeling technologies (like NeRF), text-to-3D has become a newly emerging yet highly active research field. Our work conducts the first yet comprehensive survey on text-to-3D to help readers interested in this direction quickly catch up with its fast development. First, we introduce 3D data representations, including both Euclidean data and non-Euclidean data. On top of that, we introduce various foundation technologies as well as summarize how recent works combine those foundation technologies to realize satisfactory text-to-3D. Moreover, we summarize how text-to-3D technology is used in various applications, including avatar generation, texture generation, shape transformation, and scene generation.
生成人工智能(AIGC,或AI生成内容)在过去几年中取得了显著进展,其中文本引导的内容生成是最实用的,因为它可以实现人类指令和AIGC之间的交互。由于文本到图像和3D建模技术(如NeRF)的发展,文本到3D已成为一个新兴的但非常活跃的研究领域。我们的工作进行了第一次全面的文本到3D调查,以帮助对这一方向感兴趣的读者尽快跟上它的快速发展。首先,我们介绍了3D数据表示,包括欧几里得数据和非欧几里得数据。此外,我们介绍了各种基础技术和简要总结了最近的研究成果如何将这些基础技术结合起来,实现满意的文本到3D。此外,我们总结了文本到3D技术在各种应用程序中的用途,包括虚拟角色生成、纹理生成、形状转换和场景生成。
https://arxiv.org/abs/2305.06131
Neural Radiance Fields has become a prominent method of scene generation via view synthesis. A critical requirement for the original algorithm to learn meaningful scene representation is camera pose information for each image in a data set. Current approaches try to circumnavigate this assumption with moderate success, by learning approximate camera positions alongside learning neural representations of a scene. This requires complicated camera models, causing a long and complicated training process, or results in a lack of texture and sharp details in rendered scenes. In this work we introduce Hash Color Correction (HashCC) -- a lightweight method for improving Neural Radiance Fields rendered image quality, applicable also in situations where camera positions for a given set of images are unknown.
神经网络辐射场已经成为通过视图合成生成场景的重要方法。原算法学习有意义的场景表示的一个关键要求是每个数据集中的图像的相机姿态信息。当前的方法尝试以适度的成功绕过这个假设,通过学习近似的相机位置并学习场景神经网络表示来同时实现。这需要复杂的相机模型,导致较长的复杂训练过程或导致渲染场景中缺乏纹理和锐利细节。在本文中,我们介绍了哈希颜色纠正(HashCC) - 一种轻量级方法,以提高神经网络辐射场渲染图像质量,同时也适用于已知数据集中特定图像相机位置的情况。
https://arxiv.org/abs/2305.04296
We present a continuation to our previous work, in which we developed the MR-CKR framework to reason with knowledge overriding across contexts organized in multi-relational hierarchies. Reasoning is realized via ASP with algebraic measures, allowing for flexible definitions of preferences. In this paper, we show how to apply our theoretical work to real autonomous-vehicle scene data. Goal of this work is to apply MR-CKR to the problem of generating challenging scenes for autonomous vehicle learning. In practice, most of the scene data for AV learning models common situations, thus it might be difficult to capture cases where a particular situation occurs (e.g. partial occlusions of a crossing pedestrian). The MR-CKR model allows for data organization exploiting the multi-dimensionality of such data (e.g., temporal and spatial). Reasoning over multiple contexts enables the verification and configuration of scenes, using the combination of different scene ontologies. We describe a framework for semantically guided data generation, based on a combination of MR-CKR and Algebraic Measures. The framework is implemented in a proof-of-concept prototype exemplifying some cases of scene generation.
我们提出了我们以前的工作的续篇,其中我们开发了MR-CKR框架,通过在多关系层级结构的上下文中克服知识 override 来推理。推理通过使用代数测量来实现,允许灵活的偏好定义。在本文中,我们展示了如何将我们的理论工作应用于真实的自动驾驶车辆场景数据。该工作的的目标是应用MR-CKR到生成自动驾驶车辆学习所需的挑战场景的问题。在实践中,大多数场景数据对于自动驾驶模型是常见的情况,因此可能难以捕捉特定情况(例如,行人的不完全遮挡)的情况。MR-CKR模型允许利用这些数据的多维性进行数据组织(例如,时间和空间)。在多个上下文中推理可以实现场景的验证和配置,使用不同场景概念的组合。我们描述了基于MR-CKR和代数测量的组合的语义指导数据生成框架。框架被实现为一个概念验证原型,以展示场景生成的一些案例。
https://arxiv.org/abs/2305.02255
Despite the growing adoption of mixed reality and interactive AI agents, it remains challenging for these systems to generate high quality 2D/3D scenes in unseen environments. The common practice requires deploying an AI agent to collect large amounts of data for model training for every new task. This process is costly, or even impossible, for many domains. In this study, we develop an infinite agent that learns to transfer knowledge memory from general foundation models (e.g. GPT4, DALLE) to novel domains or scenarios for scene understanding and generation in the physical or virtual world. The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK), which leverages knowledge-memory to generate scenes in unseen physical world and virtual reality environments. The knowledge interactive emergent ability (Figure 1) is demonstrated as the observation learns i) micro-action of cross-modality: in multi-modality models to collect a large amount of relevant knowledge memory data for each interaction task (e.g., unseen scene understanding) from the physical reality; and ii) macro-behavior of reality-agnostic: in mix-reality environments to improve interactions that tailor to different characterized roles, target variables, collaborative information, and so on. We validate the effectiveness of ArK on the scene generation and editing tasks. We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes, compared to baselines, demonstrating the potential benefit of incorporating ArK in generative AI for applications such as metaverse and gaming simulation.
尽管Mixed Reality和交互式人工智能代理的广泛应用正在增加,但在 unseen 环境中生成高质量的 2D/3D 场景仍然是一个挑战。通常的做法是部署一个人工智能代理来收集大量的数据用于模型训练,以完成每个新任务。这个过程对于许多领域来说成本高昂,甚至不可能。在本研究中,我们开发了一个无限代理,它学习将一般基础模型(如 GPT4 和DALLE)的知识记忆转移到新的领域或场景,以在物理或虚拟世界中理解场景和生成场景。我们的方法的核心是一种新的机制,被称为增强现实知识和推理交互(ArK),它利用知识记忆生成在 unseen 物理世界和虚拟 Reality 环境中的场景。我们的研究方法的核心是 ArK 机制,其表现是观察学习 i)跨媒体微动作:在多媒体模型中从物理现实收集每个交互任务(如 unseen 场景理解)相关的知识记忆数据; ii)现实无关宏观行为:在混合现实环境中改善针对不同类型的角色、目标变量、协作信息等不同的标志性作用的互动。我们验证了 ArK 在场景生成和编辑任务中的有效性。我们证明了我们的 ArK 方法,结合大型基础模型, significantly Improves 生成的 2D/3D 场景的质量,相比基准,展示了将 ArK 集成到生成 AI 中,如元世界和游戏模拟应用程序的潜在好处。
https://arxiv.org/abs/2305.00970
We target a 3D generative model for general natural scenes that are typically unique and intricate. Lacking the necessary volumes of training data, along with the difficulties of having ad hoc designs in presence of varying scene characteristics, renders existing setups intractable. Inspired by classical patch-based image models, we advocate for synthesizing 3D scenes at the patch level, given a single example. At the core of this work lies important algorithmic designs w.r.t the scene representation and generative patch nearest-neighbor module, that address unique challenges arising from lifting classical 2D patch-based framework to 3D generation. These design choices, on a collective level, contribute to a robust, effective, and efficient model that can generate high-quality general natural scenes with both realistic geometric structure and visual appearance, in large quantities and varieties, as demonstrated upon a variety of exemplar scenes.
我们的目标是生成一般自然场景的三维生成模型,这些场景通常具有独特而精细的特点。缺乏必要的训练数据量,以及在存在不同场景特征的情况下采用特化设计的困难,使得现有的 setups 难以处理。受到经典的像素图像模型的启发,我们倡导在像素级别上合成三维场景。在这个工作中,重要的是场景表示和生成像素相邻模块的重要算法设计,该设计解决了从将经典的2D像素框架升级到3D生成所带来的独特挑战。这些设计选择 collective level 上的贡献了一个强大、有效和高效的模型,能够生成大量多样的高质量一般自然场景,具有真实的几何结构和视觉效果,就像在许多示例场景上所示。
https://arxiv.org/abs/2304.12670
Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.
自动生成高质量的真实世界3D场景对于虚拟现实和机器人模拟等应用非常感兴趣。为了实现这个目标,我们引入了NeuralField-LDM,一个能够合成复杂3D环境的生成模型。我们利用已经成功地应用于高效高质量2D内容创作的LatentDiffusion模型。我们首先训练了一个场景自编码器,以表达一组图像和姿态对作为神经网络,并将其表示为密度和特征点立方网格,可以投影以产生场景的新视角。为了进一步压缩这种表示,我们训练了一个隐层自编码器,将立方网格映射到一组隐表示。然后,我们 fit a hierarchical diffusion model to thelatent,以完成场景生成的管道。我们比现有的先进场景生成模型实现了巨大的改进。此外,我们展示了NeuralField-LDM如何用于多种3Dcontent creation应用,包括条件场景生成、场景涂鸦和场景风格操纵。
https://arxiv.org/abs/2304.09787
We address the important problem of generalizing robotic rearrangement to clutter without any explicit object models. We first generate over 650K cluttered scenes - orders of magnitude more than prior work - in diverse everyday environments, such as cabinets and shelves. We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture. CabiNet is a collision model that accepts object and scene point clouds, captured from a single-view depth observation, and predicts collisions for SE(3) object poses in the scene. Our representation has a fast inference speed of 7 microseconds per query with nearly 20% higher performance than baseline approaches in challenging environments. We use this collision model in conjunction with a Model Predictive Path Integral (MPPI) planner to generate collision-free trajectories for picking and placing in clutter. CabiNet also predicts waypoints, computed from the scene's signed distance field (SDF), that allows the robot to navigate tight spaces during rearrangement. This improves rearrangement performance by nearly 35% compared to baselines. We systematically evaluate our approach, procedurally generate simulated experiments, and demonstrate that our approach directly transfers to the real world, despite training exclusively in simulation. Robot experiment demos in completely unknown scenes and objects can be found at this http this https URL
我们解决了在没有任何明确物体模型的情况下将机器人重构为碎片的问题,这是一个重要的问题。我们首先在各种各样的日常环境中,如柜子和书架上生成了超过650万碎片场景,比先前的工作数量高出了数百倍。我们从这些数据生成了合成的局部点云,并使用它训练了我们的CabiNet模型架构。CabiNet是一个碰撞模型,可以接受物体和场景的局部点云,从单视图深度观察中捕获,并预测在场景中SE(3)物体姿态的碰撞。我们的表示具有快速推理速度,每秒查询7微秒,在挑战性环境中的性能几乎比基线方法提高了20%。我们使用这个碰撞模型与模型预测路径积分规划器(MPPI)一起生成在碎片中选择和放置的无碰撞路径。CabiNet还预测了从场景的符号距离场(SDF)计算的点路径,使机器人在重构期间 navigate 紧凑空间。这比基线方法提高了重构性能几乎35%。我们系统地评估了我们的方法和生成模拟实验,并生成了模拟实验,证明了我们的方法直接转移到现实世界,尽管训练仅限于模拟。完全未知的场景和物体的机器人实验演示可以在这个http://和https://URL上找到。
https://arxiv.org/abs/2304.09302