Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches. See the project page for more results: this https URL.
将基于高空影像的场景生成集成到诸如自动驾驶和游戏等应用中,可以增强三维环境中现实感,但仍然需要创建详细的遮蔽区域内容,并确保实时和一致的渲染。在本文中,我们介绍了Skyeyes,一种新方法,可以使用仅包括高空影像输入来生成逼真的地面视图序列,从而实现地面漫游体验。具体来说,我们结合了3D表示和一致生成视图模型,确保生成的图像之间保持连贯性。这种方法允许在视图缺口较大时创建几何一致的地面视图图像,增强场景理解和可视化。据我们所知,目前没有公开可用的数据集包含成对的高空和地面影像的地理对齐。因此,我们使用Unreal Engine构建了一个大型、合成和地理对齐的数据集。这个合成数据集的质量和数量分析都显示出了比其他顶级合成方法更好的结果。更多结果请查看项目页面:https://this URL。
https://arxiv.org/abs/2409.16685
There is increased interest in using generative AI to create 3D spaces for Virtual Reality (VR) applications. However, today's models produce artificial environments, falling short of supporting collaborative tasks that benefit from incorporating the user's physical context. To generate environments that support VR telepresence, we introduce SpaceBlender, a novel pipeline that utilizes generative AI techniques to blend users' physical surroundings into unified virtual spaces. This pipeline transforms user-provided 2D images into context-rich 3D environments through an iterative process consisting of depth estimation, mesh alignment, and diffusion-based space completion guided by geometric priors and adaptive text prompts. In a preliminary within-subjects study, where 20 participants performed a collaborative VR affinity diagramming task in pairs, we compared SpaceBlender with a generic virtual environment and a state-of-the-art scene generation framework, evaluating its ability to create virtual spaces suitable for collaboration. Participants appreciated the enhanced familiarity and context provided by SpaceBlender but also noted complexities in the generative environments that could detract from task focus. Drawing on participant feedback, we propose directions for improving the pipeline and discuss the value and design of blended spaces for different scenarios.
越来越多的人对使用生成式AI创建虚拟现实(VR)应用中的3D空间感兴趣。然而,目前的模型产生的假定环境不足以支持用户物理上下文中的协作任务。为了生成支持VR远程存在的环境,我们引入了SpaceBlender,一种利用生成式AI技术将用户周围物理环境混合到统一虚拟空间的新管道。这个管道通过深度估计、网格对齐和扩散引导的迭代过程,将用户提供的2D图像转换为上下文丰富的3D环境。在初步的自我对照研究中,20名参与者在一对中执行了协同VR偏好图任务,我们将SpaceBlender与通用虚拟环境和最先进的场景生成框架进行了比较,以评估其创建适合协作的虚拟空间的能力。参与者赞赏SpaceBlender提供的增强熟悉度和上下文,但也指出在生成环境中存在复杂性,可能会分散任务关注力。根据参与者的反馈,我们提出了改进管道的方向,并讨论了不同场景下混合空间的值和设计。
https://arxiv.org/abs/2409.13926
Text-to-scene generation, transforming textual descriptions into detailed scenes, typically relies on generating key scenarios along predetermined paths, constraining environmental diversity and limiting customization flexibility. To address these limitations, we propose a novel text-to-traffic scene framework that leverages a large language model to generate diverse traffic scenarios within the Carla simulator based on natural language descriptions. Users can define specific parameters such as weather conditions, vehicle types, and road signals, while our pipeline can autonomously select the starting point and scenario details, generating scenes from scratch without relying on predetermined locations or trajectories. Furthermore, our framework supports both critical and routine traffic scenarios, enhancing its applicability. Experimental results indicate that our approach promotes diverse agent planning and road selection, enhancing the training of autonomous agents in traffic environments. Notably, our methodology has achieved a 16% reduction in average collision rates. Our work is made publicly available at this https URL.
文本到场景生成通常依赖于生成预定义路径中的关键场景,限制环境多样性和限制自定义灵活性。为了应对这些限制,我们提出了一个新颖的文本到交通场景框架,它依赖于大型语言模型在Carla仿真器中生成多样化的交通场景,基于自然语言描述。用户可以定义具体的参数,如天气条件、车辆类型和道路信号,而我们的管道可以自主选择起始点和场景细节,生成场景,而不依赖于预定义的位置或轨迹。此外,我们的框架支持关键和日常交通场景,提高了其在交通环境中的应用价值。实验结果表明,我们的方法促进了多样化的代理规划和管理,提高了交通环境中自动驾驶代理的训练效果。值得注意的是,我们的方法取得了平均碰撞率降低16%的显著成果。我们的工作现在公开发布在https://www. this URL上。
https://arxiv.org/abs/2409.09575
Automatic scene generation is an essential area of research with applications in robotics, recreation, visual representation, training and simulation, education, and more. This survey provides a comprehensive review of the current state-of-the-arts in automatic scene generation, focusing on techniques that leverage machine learning, deep learning, embedded systems, and natural language processing (NLP). We categorize the models into four main types: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion Models. Each category is explored in detail, discussing various sub-models and their contributions to the field. We also review the most commonly used datasets, such as COCO-Stuff, Visual Genome, and MS-COCO, which are critical for training and evaluating these models. Methodologies for scene generation are examined, including image-to-3D conversion, text-to-3D generation, UI/layout design, graph-based methods, and interactive scene generation. Evaluation metrics such as Frechet Inception Distance (FID), Kullback-Leibler (KL) Divergence, Inception Score (IS), Intersection over Union (IoU), and Mean Average Precision (mAP) are discussed in the context of their use in assessing model performance. The survey identifies key challenges and limitations in the field, such as maintaining realism, handling complex scenes with multiple objects, and ensuring consistency in object relationships and spatial arrangements. By summarizing recent advances and pinpointing areas for improvement, this survey aims to provide a valuable resource for researchers and practitioners working on automatic scene generation.
自动场景生成是一个重要的研究领域,其应用包括机器人学、娱乐、视觉表示、训练和仿真、教育等。这项调查对自动场景生成的当前状态进行了全面回顾,重点关注利用机器学习、深度学习、嵌入式系统和自然语言处理(NLP)的技术。我们将模型分为四类:变分自编码器(VAEs)、生成对抗网络(GANs)、Transformer和扩散模型。每种模型都详细讨论了各种子模型及其对领域的贡献。我们还审查了最常用的数据集,如COCO-Stuff、Visual Genome和MS-COCO,这些数据集对于训练和评估这些模型至关重要。 研究方法包括图像到3D转换、文本到3D生成、UI/布局设计、图论方法以及交互式场景生成。在讨论这些技术在评估模型性能时使用的评估指标(如Frechet Inception Distance(FID)、Kullback-Leibler(KL)散度、Inception Score(IS)、Intersection over Union(IoU)和Mean Average Precision(mAP)时,我们也对其进行了讨论。 调查发现,该领域的一个关键挑战和限制是保持真实感,处理具有多个对象的复杂场景,并确保对象关系和空间布置的一致性。通过总结最近的研究进展并指出需要改进的领域,这项调查旨在为致力于自动场景生成的研究人员和实践者提供有价值的资源。
https://arxiv.org/abs/2410.01816
We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.
我们提出了LT3SD,一种用于大规模3D场景生成的新颖潜在扩散模型。最近扩散模型的进步在3D物体生成方面取得了令人印象深刻的成果,但在扩展到3D场景时,其空间范围和质量有限。为了生成复杂和多样性的3D场景结构,我们引入了一个潜在树表示来有效地在粗到细的层次结构中编码低频几何和高频细节。然后,我们可以在这个潜在的3D场景空间中学习生成扩散过程,建模场景中每个分辨率级别的潜在组件。通过在场景补丁上训练扩散模型,并在多个场景补丁上共享扩散生成,我们合成大小不同的大规模场景。通过广泛的实验,我们证明了LT3SD在大规模、高质量的有条件3D场景生成和概率性完成部分场景观察方面的有效性和优势。
https://arxiv.org/abs/2409.08215
In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.
在这项工作中,我们提出了一种新的多尺度广域生成地形建模方法。其核心是基于超分辨率扩散模型的级联,这些模型可以组合以在多个分辨率上产生一致的图像。将这一概念与镶嵌生成方法相结合,得到一个可以在高分辨率下生成数千平方千米真实地球表面的可扩展系统。我们在来自Bing Maps的数据集上评估我们的方法,结果表明其在极端超分辨率任务(1024x zoom)上超过了超分辨率基线。我们还通过一个交互式的大像素级生成地图展示了其创造多样性和谐场景的能力。最后,我们还展示了我们的系统如何扩展以实现包括可控制世界生成和3D场景生成在内的 novel 内容创作应用。
https://arxiv.org/abs/2409.01491
Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at this https URL.
设计和艺术品在各种创意领域无处不在,需要图形设计技能和专用的软件来创建包括许多图形元素的画面,如标志、图标、符号和艺术场景,这些元素对于视觉叙事至关重要。自动化生成这样的视觉元素可以提高图形设计师的生产力,民主化并创新创意产业,并为相关任务生成更真实的合成数据。这些插图元素主要是具有不规则形状和剪切的RGBA图像,有利于混合和场景构图。然而,大多数图像生成模型无法生成这样的图像,实现这一功能需要昂贵的计算资源、特定的训练食谱或后处理解决方案。在本文中,我们提出了一种完全自动化的方法,通过修改预训练的扩散Transformer模型的推理时间行为,利用其提供的提示指导控制和视觉质量,无需额外计算成本来获得RGBA插图。我们将整个主题的内容生成出来,其背景可以轻松地被用于无缝集成到设计项目或艺术场景中。我们通过用户研究来证明,在大多数情况下,用户更喜欢我们的解决方案,而不是生成和然后Matting图像。我们还证明了我们的生成插图在作为复合场景生成管道的输入时也能产生良好的结果。我们将代码发布在此处:https://www.url。
https://arxiv.org/abs/2408.14826
We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes. Project Page: \url{this https URL}
我们提出了基于扩散的文本到图像(T2I)生成方法,带有交互式3D布局控制。布局控制已经被广泛研究,以解决T2I扩散模型的不足,即从文本描述中理解物体位置和关系。然而,现有的布局控制方法仅限于2D布局,需要用户在布局变化前提供静态布局,并且无法在布局变化时保留生成的图像。这就使得这些方法不适用于需要3D物体级控制和迭代改进的应用程序,例如室内设计和复杂场景生成。为此,我们利用深度条件下的T2I模型的最新进展,提出了一种新的交互式3D布局控制方法。我们用3D框替换了传统的布局控制框。此外,我们将T2I任务重新设计为多阶段生成过程,在每阶段,用户可以在3D中插入、更改和移动物体,同时保留从早期阶段保留的物体。我们通过所提出的动态自注意力(DSA)模块和一致的3D物体平移策略来实现这一目标。实验证明,我们的方法可以根据3D布局生成复杂场景,将标准深度条件下的T2I方法的物体生成成功率提高2倍。此外,与其他方法相比,它在保留物体在布局变化方面的表现优异。项目页面:\url{https://this https URL}
https://arxiv.org/abs/2408.14819
Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene' itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.
近年来在文本到图像扩散模型的进步中,已经展示了出色的图像质量。然而,复杂场景生成仍然相对未经探索,而且`复杂场景`的定义甚至仍然模糊不清。在本文中,我们通过提供对复杂场景的准确定义,并基于这一定义引入了复杂分解标准(CDS),来填补这一空白。我们受到艺术家绘画过程的启发,提出了一个无需训练的扩散框架,称为复杂扩散(CxD),将过程分为三个阶段:构图、绘画和修复。我们的方法利用大型语言模型的强大思考能力,根据CDS分解复杂提示,并管理构图和布局。然后,我们开发了一种关注模量方法,引导简单的提示指向特定的区域,以完成复杂场景的绘画。最后,我们将LLM的详细输出注入修复模型中,增强图像细节,从而实现修复阶段。大量实验证明,我们的方法超越了以前的SOTA方法,显著提高了复杂场景生成高质量、语义一致性和视觉多样性的图像,即使使用复杂的提示。
https://arxiv.org/abs/2408.13858
Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \url{this https URL}.
文本驱动的三维场景生成最近取得了显著的进步。然而,大多数现有方法使用生成模型生成单视图图像,然后将它们在三维空间中拼接在一起。为解决这个挑战,我们提出了一个新颖的文本驱动三维一致性场景生成模型:SceneDreamer360。我们的方法利用一个基于文本的全景图像生成模型作为先验来生成三维场景,并采用3D高斯平铺(3DGS)来确保多视角全景图像的一致性。具体来说,SceneDreamer360通过三个阶段的全景增强来增强微调过的Panfusion生成器,从而生成高分辨率、细节丰富的高质量全景图像。在三维场景构建过程中,采用了一种新颖的点云融合初始化方法,产生了更高质量、更一致的点云。我们的大量实验结果表明,与其它方法相比,SceneDreamer360具有全景图像生成和3DGS功能,可以从任何文本提示生产出更高质量、更一致、且具有视觉吸引力的三维场景。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2408.13711
3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) we introduce a novel text-guided anchor view synthesis pipeline for high-quality, consistent panorama generation. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications.
3D 沉浸式场景生成是一个具有挑战性但关键的计算机视觉和图形学任务。期望的虚拟 3D 场景应该具备以下两个条件:1)显示全视野视差一致性,2)在复杂场景层次结构中允许自由探索。现有的方法要么通过修复来逐步扩展场景,要么使用全景表示来表示大 FOV 场景环境。然而,生成的场景在扩展过程中存在语义漂移,并且无法处理场景层次结构中的遮挡。为了应对这些挑战,我们引入了LayerPano3D,一种从单个文本提示生成全视野可探索全景 3D 场景的新框架。我们关键的见解是将参考 2D 全景分解成不同深度的多个层,其中每个层通过扩散先验从参考视图中揭示未见过的空间。LayerPano3D包括多个专用设计:1)我们引入了一种新的文本指导的锚视图合成管道,用于生成高质量、一致的全景。2)我们首创了分层 3D 全景作为底层表示来管理复杂场景层次结构,并将其提升到 3D 高斯分布以实现无约束的视路径下的详细 360 度全景场景。大量的实验证明,我们的框架在全景一致性和沉浸式探索体验方面都实现了最先进的 3D 全景生成。我们相信,LayerPano3D具有很大的潜力,可以推动 3D 全景场景创作的进步。
https://arxiv.org/abs/2408.13252
As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at this https URL.
随着人工智能生成内容(AIGC)的发展,已经开发了许多方法来从单或多模态输入中生成文本、图像、视频和3D对象,为模拟人类似认知内容创作做出了贡献。然而,从单个输入生成逼真的大规模场景仍然具有挑战性,因为涉及到模型扩展视图时确保一致性的复杂性。受益于最近的视频生成模型和隐式神经表示,我们提出了Scene123,一种3D场景生成模型,不仅通过视频生成框架确保真实性和多样性,还利用隐式神经场与遮罩自动编码器(MAE)的结合来有效地确保扩展视图中未见区域的视差一致性。具体来说,我们首先扭曲输入图像(或从文本生成的图像)以模拟相邻视图,用MAE模型填充透明区域。然而,这些填充的图像通常无法保持视图一致性,因此我们利用生成的视图优化神经辐射场,增强几何一致性。此外,为了进一步提高生成视图的细节和纹理保真度,我们还通过基于生成对抗网络(GAN)的损失函数对从输入图像通过视频生成模型生成的图像进行处理。大量实验证明,我们的方法可以从单个提示生成逼真的和一致的场景。定性和定量的结果表明,我们的方法超越了现有的最先进的方法。您可以在该链接处查看视频示例。
https://arxiv.org/abs/2408.05477
The integration of large language models (LLMs) into robotics significantly enhances the capabilities of embodied agents in understanding and executing complex natural language instructions. However, the unmitigated deployment of LLM-based embodied systems in real-world environments may pose potential physical risks, such as property damage and personal injury. Existing security benchmarks for LLMs overlook risk awareness for LLM-based embodied agents. To address this gap, we propose RiskAwareBench, an automated framework designed to assess physical risks awareness in LLM-based embodied agents. RiskAwareBench consists of four modules: safety tips generation, risky scene generation, plan generation, and evaluation, enabling comprehensive risk assessment with minimal manual intervention. Utilizing this framework, we compile the PhysicalRisk dataset, encompassing diverse scenarios with associated safety tips, observations, and instructions. Extensive experiments reveal that most LLMs exhibit insufficient physical risk awareness, and baseline risk mitigation strategies yield limited enhancement, which emphasizes the urgency and cruciality of improving risk awareness in LLM-based embodied agents in the future.
将大型语言模型(LLMs)集成到机器人中显著增强了 embodied 代理在理解和执行复杂自然语言指令方面的能力。然而,将基于 LLM 的 embodied 系统无限制地部署到现实环境中可能会产生潜在的身体风险,例如财产损失和个人伤害。现有的 LLM-based embodied 系统的安全基准审查忽略了 LLM-based embodied 代理的风险意识。为了填补这一空白,我们提出了 RiskAwareBench,一种自动框架,旨在评估基于 LLM 的 embodied 代理的物理风险意识。RiskAwareBench 包括四个模块:安全建议生成、风险场景生成、计划生成和评估,可以全面评估风险,且最小限度需要手动干预。利用这个框架,我们汇总了 PhysicalRisk 数据集,涵盖各种场景和相关安全建议、观察和指令。大量实验证明,大多数 LLM 表现出不足的物理风险意识,基线风险缓解策略产生了有限的增强,这强调了在 LLM-based embodied 代理未来改进风险意识的重要性。
https://arxiv.org/abs/2408.04449
Immersive scene generation, notably panorama creation, benefits significantly from the adaptation of large pre-trained text-to-image (T2I) models for multi-view image generation. Due to the high cost of acquiring multi-view images, tuning-free generation is preferred. However, existing methods are either limited to simple correspondences or require extensive fine-tuning to capture complex ones. We present PanoFree, a novel method for tuning-free multi-view image generation that supports an extensive array of correspondences. PanoFree sequentially generates multi-view images using iterative warping and inpainting, addressing the key issues of inconsistency and artifacts from error accumulation without the need for fine-tuning. It improves error accumulation by enhancing cross-view awareness and refines the warping and inpainting processes via cross-view guidance, risky area estimation and erasing, and symmetric bidirectional guided generation for loop closure, alongside guidance-based semantic and density control for scene structure preservation. In experiments on Planar, 360°, and Full Spherical Panoramas, PanoFree demonstrates significant error reduction, improves global consistency, and boosts image quality without extra fine-tuning. Compared to existing methods, PanoFree is up to 5x more efficient in time and 3x more efficient in GPU memory usage, and maintains superior diversity of results (2x better in our user study). PanoFree offers a viable alternative to costly fine-tuning or the use of additional pre-trained models. Project website at this https URL.
沉浸式场景生成,特别是全景创建,从大型预训练的文本到图像(T2I)模型对多视角图像生成进行适应具有显著好处。由于获取多视角图像的成本较高,因此无调节生成是更受欢迎的。然而,现有的方法要么局限于简单的对应关系,要么需要对复杂的关系进行广泛的微调。我们提出了PanoFree,一种无需调节的多视角图像生成新方法,支持广泛的对应关系。PanoFree通过迭代变形和修复来顺序生成多视角图像,解决了不需要微调时的关键问题:不一致性和伪影。它通过跨视角意识、跨视角指导、风险区域估计和消除以及对称双向指导生成环路来优化错误累积。在平滑、360度和全球形全景的实验中,PanoFree展示了显著的误差减少、全局一致性和图像质量提升,而无需额外微调。与现有方法相比,PanoFree在时间和GPU内存使用上分别具有5倍和3倍的效率,并保持了卓越的多样性结果(在我们用户研究中,2倍更好)。PanoFree为昂贵微调或使用额外预训练模型提供了可行的替代方案。请点击此链接查看项目网站。
https://arxiv.org/abs/2408.02157
3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at this https URL.
3D内容创作一直是一个复杂且耗时的过程,通常需要专业技能和资源。虽然最近的技术进步允许通过文本指导生成和渲染自定义3D物体和场景,但他们仍然无法提供对生成过程的充分控制,导致用户创意与生成结果之间存在差距。在本文中,我们提出了iControl3D,一种新颖的交互系统,用户可以在精确控制下生成和渲染自定义3D场景。为此,我们开发了一个3D创作者界面,为用户提供了对创作过程的细粒度控制。从技术上讲,我们利用3D网格作为中间代理,通过迭代将单个2D扩散生成的图像合并成一个凝聚和统一的3D场景表示。为了确保3D网格的无缝集成,我们在将新生成网格与现有3D空间中的现有网格融合之前,进行了边界感知深度对齐。此外,为了有效地管理远程内容与前景的深度差异,我们使用环境图而不是3D网格来对远程内容进行建模。最后,我们的神经渲染接口使用户能够在在线构建其场景的 radiance 场,并导航整个场景。已经进行了大量实验来证明我们系统的有效性。代码将在这个链接上发布:https://www.example.com/。
https://arxiv.org/abs/2408.01678
Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.
从无监督视频中学习对象中心表示具有挑战性。与大多数先前的方法不同,我们提出了一个名为DynaVol-S的3D生成模型,用于在可变体积渲染框架内实现动态场景的对象中心学习。关键思想是执行对象中心体素化以捕捉场景的3D性质,从而在个体空间位置推断每个对象的占有概率。这些体素特征通过规范空间变形函数演化,并在反向渲染流程中通过合成NeRF进行优化。此外,我们的方法还引入2D语义特征,创建了3D语义网格,通过多个分离的体素网格表示场景。DynaVol-S在动态场景的新视图合成和无监督分解任务方面显著优于现有模型。通过共同考虑几何结构和语义特征,它有效地解决了涉及复杂物体交互的具有挑战性的现实场景。此外,经过训练,显式有意义体素特征还有许多额外的功能,例如通过编辑几何形状或操纵物体的运动轨迹生成新的场景。
https://arxiv.org/abs/2407.20908
Designing high-quality indoor 3D scenes is important in many practical applications, such as room planning or game development. Conventionally, this has been a time-consuming process which requires both artistic skill and familiarity with professional software, making it hardly accessible for layman users. However, recent advances in generative AI have established solid foundation for democratizing 3D design. In this paper, we propose a pioneering approach for text-based 3D room design. Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it. With an additional text prompt the users can change the appearance of the entire scene or of individual objects in it. Built using in-context learning, CAD model retrieval and 3D-Gaussian-Splatting-based stylization, our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices. Our project page is available at this https URL.
设计高质量的室内3D场景在许多实际应用中非常重要,如房间规划或游戏开发。通常,这是一个耗时且需要艺术技能和专业软件熟悉度的高过程,这让普通人难以接触。然而,随着生成性AI的最近进步,为民主化3D设计奠定了坚实的基础。在本文中,我们提出了一个文本基于的3D房间设计的前沿方法。给定一个自然语言描述房间内物体放置的提示,我们的方法生成相应的3D场景。通过添加文本提示,用户可以改变整个场景或其中个别对象的外观。使用上下文学习、CAD模型检索和3D高斯聚类为基础的润色,我们的全套流程可以轻松生产出最先进的3D场景,而即使是对新手来说,也很容易使用。我们的项目页面可以在https://www.example.com/上找到。
https://arxiv.org/abs/2407.20727
Generating a realistic, large-scale 3D virtual city remains a complex challenge due to the involvement of numerous 3D assets, various city styles, and strict layout constraints. Existing approaches provide promising attempts at procedural content generation to create large-scale scenes using Blender agents. However, they face crucial issues such as difficulties in scaling up generation capability and achieving fine-grained control at the semantic layout level. To address these problems, we propose a novel multi-modal controllable procedural content generation method, named CityX, which enhances realistic, unbounded 3D city generation guided by multiple layout conditions, including OSM, semantic maps, and satellite images. Specifically, the proposed method contains a general protocol for integrating various PCG plugins and a multi-agent framework for transforming instructions into executable Blender actions. Through this effective framework, CityX shows the potential to build an innovative ecosystem for 3D scene generation by bridging the gap between the quality of generated assets and industrial requirements. Extensive experiments have demonstrated the effectiveness of our method in creating high-quality, diverse, and unbounded cities guided by multi-modal conditions. Our project page: this https URL.
https://arxiv.org/abs/2407.17572
3D scene generation is in high demand across various domains, including virtual reality, gaming, and the film industry. Owing to the powerful generative capabilities of text-to-image diffusion models that provide reliable priors, the creation of 3D scenes using only text prompts has become viable, thereby significantly advancing researches in text-driven 3D scene generation. In order to obtain multiple-view supervision from 2D diffusion models, prevailing methods typically employ the diffusion model to generate an initial local image, followed by iteratively outpainting the local image using diffusion models to gradually generate scenes. Nevertheless, these outpainting-based approaches prone to produce global inconsistent scene generation results without high degree of completeness, restricting their broader applications. To tackle these problems, we introduce HoloDreamer, a framework that first generates high-definition panorama as a holistic initialization of the full 3D scene, then leverage 3D Gaussian Splatting (3D-GS) to quickly reconstruct the 3D scene, thereby facilitating the creation of view-consistent and fully enclosed 3D scenes. Specifically, we propose Stylized Equirectangular Panorama Generation, a pipeline that combines multiple diffusion models to enable stylized and detailed equirectangular panorama generation from complex text prompts. Subsequently, Enhanced Two-Stage Panorama Reconstruction is introduced, conducting a two-stage optimization of 3D-GS to inpaint the missing region and enhance the integrity of the scene. Comprehensive experiments demonstrated that our method outperforms prior works in terms of overall visual consistency and harmony as well as reconstruction quality and rendering robustness when generating fully enclosed scenes.
https://arxiv.org/abs/2407.15187
Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promising direction. However, training-based approaches are limited by the need for extensively annotated datasets, leading to high data acquisition costs and a constrained conceptual scope. Conversely, training-free methods face challenges in accurately locating and generating semantically similar objects within complex compositions. This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. By refining intra-token loss with selective sampling and enhancing the diffusion process with attention redistribution, we propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity. Code is available at this https URL.
近年来在文本到图像扩散模型的突破已经显著提高了从文本描述生成高保真度、照片写实图像的能力。然而,这些模型通常在解释文本中的空间布局方面遇到困难,阻碍了它们产生精确的空间布局的图像。为了弥合这一空白,布局到图像生成已成为一个有前景的方向。然而,基于训练的方法存在局限性,需要大量注释的训练数据,导致高数据获取成本和有限的观念范围。相反,无需训练的方法在准确地定位和生成具有相似语义特征的物体方面面临挑战。本文介绍了一种新颖的无需训练的方法,旨在在扩散调整阶段克服对抗性语义交集。通过通过选择性采样优化内词损失并使用注意分配增强扩散过程,我们提出了两种创新约束:1)一个内词约束,以解决词与词之间的冲突,确保准确的概念合成;2)自注意约束,以改善像素与像素之间的关系。我们的评估证实了利用布局信息指导扩散过程生成内容丰富、保真度高的图像的效果,以及增强图像的复杂性和丰富性。代码可在此处访问:https://www.example.com/
https://arxiv.org/abs/2407.13609