We study the problem of collision-free humanoid traversal in cluttered indoor scenes, such as hurdling over objects scattered on the floor, crouching under low-hanging obstacles, or squeezing through narrow passages. To achieve this goal, the humanoid needs to map its perception of surrounding obstacles with diverse spatial layouts and geometries to the corresponding traversal skills. However, the lack of an effective representation that captures humanoid-obstacle relationships during collision avoidance makes directly learning such mappings difficult. We therefore propose Humanoid Potential Field (HumanoidPF), which encodes these relationships as collision-free motion directions, significantly facilitating RL-based traversal skill learning. We also find that HumanoidPF exhibits a surprisingly negligible sim-to-real gap as a perceptual representation. To further enable generalizable traversal skills through diverse and challenging cluttered indoor scenes, we further propose a hybrid scene generation method, incorporating crops of realistic 3D indoor scenes and procedurally synthesized obstacles. We successfully transfer our policy to the real world and develop a teleoperation system where users could command the humanoid to traverse in cluttered indoor scenes with just a single click. Extensive experiments are conducted in both simulation and the real world to validate the effectiveness of our method. Demos and code can be found in our website: this https URL.
我们研究了在充满障碍物的室内环境中实现人类机器人无碰撞行走的问题,包括跨越地面上散落的物体、低矮障碍物下蹲通过或穿过狭窄通道等情况。为了达到这一目标,人类机器人需要将其对周围障碍物的各种空间布局和几何形状的认知映射到相应的行走技能上。然而,在避免碰撞的过程中缺乏有效的表示方法来捕捉人形机器人与障碍物之间的关系,这使得直接学习这种映射变得困难。因此,我们提出了“人形势场”(HumanoidPF),它将这些关系编码为无碰撞运动方向,从而显著促进了基于强化学习的行走技能的学习过程。此外,我们发现作为感知表示方法,HumanoidPF在模拟环境与真实环境之间的差异非常小。 为了进一步使机器人能够在多样且具有挑战性的充满障碍物的室内环境中具备泛化的行走能力,我们还提出了一种混合场景生成方法,结合了现实世界3D室内环境的截屏以及程序合成的障碍物。我们成功地将策略转移到实际环境中,并开发了一个遥操作系统,在该系统中用户只需点击一下即可指挥人形机器人在充满障碍物的室内环境中行走。 我们在模拟和真实世界环境中进行了广泛的实验以验证我们的方法的有效性。演示视频和代码可以在我们的网站上找到:此链接。
https://arxiv.org/abs/2601.16035
Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.
最近使用扩散模型生成三维物体取得了显著的成功,但生成逼真的三维城市场景依然具有挑战性。现有的仅依赖于3D扩散模型的方法往往会损失外观细节的清晰度,而那些仅仅利用2D扩散模型的方法则会牺牲相机控制能力。为了克服这一局限性,我们提出了ScenDi方法,这是一种结合了3D和2D扩散模型的城市场景生成技术。 首先,我们训练一个三维潜在扩散模型来生成三维高斯分布,这使得以相对较低的分辨率渲染图像成为可能。为了实现可控合成,这个三维高斯生成过程可以被指定输入(如三维边界框、道路地图或文本提示)所条件化。接着,我们训练了一个2D视频扩散模型,在此基础上提升从3D高斯渲染出的图像中的外观细节。 通过利用粗略的3D场景作为引导来指导2D视频扩散,ScenDi能够根据输入条件生成所需的场景,并且成功地遵循精确的相机轨迹。我们在两个具有挑战性的真实数据集(Waymo和KITTI-360)上的实验结果证明了我们方法的有效性。
https://arxiv.org/abs/2601.15221
Natural walking enhances immersion in virtual environments (VEs), but physical space limitations and obstacles hinder exploration, especially in large virtual scenes. Redirected Walking (RDW) techniques mitigate this by subtly manipulating the virtual camera to guide users away from physical collisions within pre-defined VEs. However, RDW efficacy diminishes significantly when substantial geometric divergence exists between the physical and virtual environments, leading to unavoidable collisions. Existing scene generation methods primarily focus on object relationships or layout aesthetics, often neglecting the crucial aspect of physical compatibility required for effective RDW. To address this, we introduce HCVR (High Compatibility Virtual Reality Environment Generation), a novel framework that generates virtual scenes inherently optimized for alignment-based RDW controllers. HCVR first employs ENI++, a novel, boundary-sensitive metric to evaluate the incompatibility between physical and virtual spaces by comparing rotation-sensitive visibility polygons. Guided by the ENI++ compatibility map and user prompts, HCVR utilizes a Large Language Model (LLM) for context-aware 3D asset retrieval and initial layout generation. The framework then strategically adjusts object selection, scaling, and placement to maximize coverage of virtually incompatible regions, effectively guiding users towards RDW-feasible paths. User studies evaluating physical collisions and layout quality demonstrate HCVR's effectiveness with HCVR-generated scenes, resulting in 22.78 times fewer physical collisions and received 35.89\% less on ENI++ score compared to LLM-based generation with RDW, while also receiving 12.5\% higher scores on user feedback to layout design.
自然行走可以增强虚拟环境(VE)中的沉浸感,但物理空间的限制和障碍物会阻碍探索,尤其是在大型虚拟场景中。重定向行走(RDW)技术通过微妙地操控虚拟摄像机来引导用户远离预定义虚拟环境中不可避免的身体碰撞,从而缓解这一问题。然而,当物理和虚拟环境之间存在显著的几何差异时,RDW的有效性会大幅下降,导致无法避免的碰撞发生。现有的场景生成方法主要关注对象之间的关系或布局美学,常常忽略了有效实现RDW所需的至关重要的物理兼容性方面。 为解决这个问题,我们提出了HCVR(高兼容性的虚拟现实环境生成),这是一种新颖的框架,能够根据基于对齐的重定向行走控制器的需求来生成优化后的虚拟场景。HCVR首先使用ENI++这一新提出的、具备边界敏感特性的度量标准来评估物理空间和虚拟空间之间的不兼容性,通过对比旋转敏感可见多边形来实现这一点。在ENI++兼容地图和用户提示的指导下,HCVR利用大型语言模型(LLM)进行上下文感知的3D资产检索以及初始布局生成。 随后,该框架策略地调整对象选择、缩放和平移以最大化虚拟不兼容区域的覆盖面积,从而有效地引导用户走向可行的重定向行走路径。通过评估物理碰撞和布局质量的用户研究显示,与仅使用LLM进行基于RDW的场景生成相比,HCVR生成的场景中的物理碰撞减少了22.78倍,并且在ENI++评分上降低了35.89%,同时从用户对布局设计反馈的评分中也获得了12.5%的增长。
https://arxiv.org/abs/2601.14679
Most 3D scene generation methods are limited to only generating object bounding box parameters while newer diffusion methods also generate class labels and latent features. Using object size or latent feature, they then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes. We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features, by employing a pioneering $\textit{class-partitioned codebook}$ where codevectors are labeled by class. To address the problem of $\textit{codebook collapse}$, we propose a $\textit{class-aware}$ running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels, both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation, are consumed by the CPVQ-VAE. The CPVQ-VAE's class-aware inverse look-up then maps generated latents to codebook entries that are decoded to class-specific point cloud shapes. Thereby, we achieve pure point cloud generation without relying on an external objects database for retrieval. Extensive experiments reveal that our method reliably recovers plausible point cloud scenes, with up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes.
大多数的三维场景生成方法仅限于生成物体边界框参数,而较新的扩散方法还能生成类别标签和潜在特征。利用这些对象大小或潜在特征信息,它们从预定义数据库中检索对象。对于包含多样且多类别的复杂场景而言,基于扩散的方法无法通过当前自动编码器有效解码为符合目标类别的点云物体。 我们引入了一种名为分类分区向量量化变分自编码器(Class-Partitioned Vector Quantized Variational Autoencoder, CPVQ-VAE)的模型。它通过使用一种创新性的按类别分割代码簿,其中每个代码向量都由其所属类别标记来进行训练,从而能够有效地解码对象潜在特征。 为了解决“代码簿塌陷”问题,我们提出了一种“类别感知”的运行平均更新方法,这种方法可以重新初始化每个分区内的无效代码向量。在推理阶段,CPVQ-VAE会接收由专门用于场景生成的潜在空间流匹配模型(Latent-space Flow Matching Model, LFMM)产生的对象特征和类别标签。 通过使用具有类别感知逆查找功能的CPVQ-VAE,所生成的潜在特性会被映射到代码簿中的相应条目,并且这些条目随后被解码为特定类别的点云形状。这样一来,我们就能实现纯粹基于点云的场景生成,而无需依赖外部对象数据库。 大量的实验表明,我们的方法能够可靠地恢复出合理的点云场景,在复杂的客厅环境中分别减少了高达70.4%和72.3%的Chamfer误差和Point2Mesh误差。
https://arxiv.org/abs/2601.12391
Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.
人类的视觉系统结合了来自视网膜周边区域的低分辨率“概览”信息与固定注视点处稀疏但高分辨率的信息,从而构建出对视觉场景的一致理解。本文介绍了MetamerGen这一工具,它用于生成符合潜在的人类场景表示的场景图像。MetamerGen是一种潜扩散模型,结合了从视网膜周边区域获取的概览信息和从固定注视点处获得的信息,以生成与人类在观看某一场景后所形成的理解一致的图像代用品(metamers)。同时从高分辨率(即“中央凹”)和低分辨率输入中生成图像构成了一个新的图到图合成问题。为了解决这一问题,我们引入了双重流表示形式来描述固定注视区域的场景:DINOv2令牌融合来自固定注视点详细特征与周边区域捕捉场景背景信息的降级特征。 为了评估MetamerGen生成的图像与其潜在的人类场景表征之间的感知一致性,我们进行了一个相同/不同行为实验,在该实验中参与者被要求在生成图像和原始图像之间做出“相同”或“不同”的判断。通过这种方式,我们可以识别出与观众形成的场景表示一致的场景代用品。MetamerGen是一个理解场景理解的强大工具。我们的概念验证分析揭示了视觉处理多个层级中的特定特征对人类判断的影响。尽管它可以基于随机注视点生成图像代用品,我们发现当产生的场景条件设定为观众自己的固定区域时,高层次语义一致性最能预测其是否为图像代用品。 该研究展示了通过结合不同层次的视觉信息来合成逼真的场景图像的新方法,并揭示了这些合成图像与人类感知之间的关联。MetamerGen不仅在技术上具有创新性,在理解人类如何处理复杂视觉场景方面也提供了重要的见解。
https://arxiv.org/abs/2601.11675
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at this https URL.
LiDAR场景合成是解决机器人任务(如自动驾驶)中三维数据稀缺问题的一种新兴解决方案。最近的方法使用扩散模型或流匹配模型来生成逼真的场景,但与包含数百万样本的RGB数据集相比,3D数据仍然非常有限。我们引入了R3DPA,这是第一个利用图像预训练先验解锁LiDAR点云潜力,并采用自我监督三维表示以实现最佳性能的LiDAR场景生成方法。具体来说: 1. 我们将生成模型的中间特征与自监督三维特征对齐,这显著提高了生成质量。 2. 将大规模图像预训练生成模型的知识转移到LiDAR生成中,从而缓解了有限的LiDAR数据集问题。 3. 仅使用无条件模型就能够在推理过程中实现点云控制,进行对象修复和场景混合。 在KITTI-360基准测试上,R3DPA实现了最佳性能。代码和预训练模型可在提供的链接中获得。
https://arxiv.org/abs/2601.07692
We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
我们介绍了Gen3R方法,该方法将基础重建模型的强先验与视频扩散模型相结合,以实现场景级别的三维生成。通过在VGGT重建模型的令牌上训练一个适配器,我们将VGGT模型重新用于产生几何潜变量,并且这些令牌被正则化以与预训练的视频扩散模型的外观潜变量对齐。通过联合生成这些分离但又对齐的潜变量,Gen3R可以同时生成RGB视频和对应的三维几何结构,包括相机姿态、深度图和全局点云。 实验表明,在单图像和多图像条件下的三维场景生成中,我们的方法达到了最先进的性能水平。此外,我们的方法可以通过利用生成先验来增强重建的鲁棒性,从而展示出紧密耦合重建模型与生成模型的相互收益。
https://arxiv.org/abs/2601.04090
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub this https URL.
随着生成模型的进步,驾驶世界模型(DWMs)正在迅速发展。然而,现有的DWM缺乏对3D场景的理解能力,并且只能根据输入数据生成内容,而无法解释或推理关于驾驶环境的信息。此外,当前的方法用点云或BEV特征表示3D空间信息,这些方法不能准确地将文本信息与底层的3D场景相匹配。 为了解决这些问题,我们提出了一种基于3D高斯场景表示的新统一DWM框架,它能够同时实现对3D场景的理解和多模态场景生成,并且还支持理解和生成任务的情境丰富化。我们的方法通过将丰富的语言特征嵌入到每个高斯原语中直接将文本信息与3D场景进行对齐,从而实现了早期的模式对齐。 此外,我们设计了一种新颖的任务感知的语言引导采样策略,该策略可以去除冗余的3D高斯模型,并注入准确且紧凑的3D令牌至LLM(大型语言模型)。另外,我们还设计了一个双条件多模态生成模型,在其中我们的视觉-语言模型捕获的信息被用作高级语言条件与低级图像条件结合使用,共同引导多模态生成过程。 我们在nuScenes和NuInteract数据集上进行了全面的研究以验证框架的有效性。我们的方法达到了最先进的性能。我们将在GitHub上公开发布代码,该链接为[此URL](请将此URL替换为您实际的发布地址)。
https://arxiv.org/abs/2512.23180
This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.
这份手稿探讨了多模态对齐、翻译、融合和迁移,以增强机器对复杂输入的理解。我们将这项工作分为五章,每章都针对多模态机器学习中的独特挑战。 第三章介绍了Spatial-Reasoning Bert,它能够将基于文本的空间关系转换为剪贴画之间的二维排列。这使得空间语言的有效解码成为可能,并转化为视觉表示,从而开启了与人类空间理解一致的自动场景生成之路。 第四章提出了一种方法,用于将医学文本翻译成解剖图谱中的特定三维位置。我们引入了一个损失函数,利用医学术语的空间共现来创建可解释映射,显著增强了对医学文本导航的理解能力。 第五章解决了将结构化文本转换为知识图中规范事实的问题。我们建立了一个基准,用以连接自然语言与实体和谓语,解决文本提取中的歧义问题,提供更清晰、可行的见解。 第六章探讨了用于组成性动作识别的多模态融合方法。我们提出了一种结合视频帧和对象检测表示的方法,提高了识别的鲁棒性和准确性。 第七章研究了面向第一人称视角动作识别的多模态知识迁移。我们展示了如何通过多模态知识蒸馏使RGB-only模型模仿基于多模态融合的能力,从而在减少计算需求的同时保持性能。 这些贡献推进了空间语言理解、医学文本解读、知识图谱丰富以及行动识别的方法论,增强了计算系统处理复杂多元输入的能力,涵盖各种应用。
https://arxiv.org/abs/2512.20501
Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at this https URL .
通用型机器人学习仍受数据限制:在现实世界中收集大规模、多样化和高质量的交互数据成本高昂。尽管仿真已经成为扩大数据采集的一种有前景的方法,包括任务设计、感知任务的场景生成、专家演示合成以及从仿真到实际环境的数据转移等相关的任务仍然需要大量的手工劳动。我们提出了AnyTask,这是一个自动化的框架,它结合了大规模并行GPU仿真和基础模型来设计多样化的操作任务,并合成人机交互数据。我们介绍了三种用于生成旨在解决尽可能多的任务的专家演示的AnyTask代理:1)ViPR,一种新颖的任务与运动规划代理,采用了视觉语言模型循环中的平行细化;2)ViPR-Eureka,一个强化学习代理,具有生成密集奖励和大型语言模型引导接触采样的功能;3)ViPR-RL,一种混合型计划和学习方法,仅用稀疏奖励就能产生高质量的演示。我们在合成数据上训练行为克隆策略,在仿真中验证这些策略,并直接部署在真实的机器人硬件上。所训练的政策能够泛化到新的物体姿态,实现在一系列真实世界的抓取放置、抽屉打开、接触密集型推动以及长时程操作任务中的平均44%的成功率。 我们项目的官方网站位于[这里](https://this-is-the-url-of-your-project-website.com)(请将URL替换为实际项目网站地址)。
https://arxiv.org/abs/2512.17853
Recent advances in 3D scene generation produce visually appealing output, but current representations hinder artists' workflows that require modifiable 3D textured mesh scenes for visual effects and game development. Despite significant advances, current textured mesh scene reconstruction methods are far from artist ready, suffering from incorrect object decomposition, inaccurate spatial relationships, and missing backgrounds. We present 3D-RE-GEN, a compositional framework that reconstructs a single image into textured 3D objects and a background. We show that combining state of the art models from specific domains achieves state of the art scene reconstruction performance, addressing artists' requirements. Our reconstruction pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains. Obtaining occluded objects is treated as an image editing task with generative models to infer and reconstruct with scene level reasoning under consistent lighting and geometry. Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization and provides a foundation for realistic lighting and simulation tasks in visual effects and games. To obtain physically realistic layouts, we employ a novel 4-DoF differentiable optimization that aligns reconstructed objects with the estimated ground plane. 3D-RE-GEN~achieves state of the art performance in single image 3D scene reconstruction, producing coherent, modifiable scenes through compositional generation guided by precise camera recovery and spatial optimization.
最近在三维场景生成领域的进展产生了视觉上吸引人的输出,但当前的表示方法阻碍了艺术家的工作流程,这些工作需要可修改的具有纹理的三维网格场景用于视觉效果和游戏开发。尽管已经取得了重大进步,目前的纹理网格场景重建方法仍远未达到适合艺术家使用的要求,它们在对象分解、空间关系准确性以及背景缺失方面存在不足。 我们提出了3D-RE-GEN,这是一种组合框架,可以将单张图像重构为具有纹理的三维物体和背景。研究表明,结合来自特定领域的最先进的模型可以实现最佳场景重建性能,并满足艺术家的需求。我们的重建管道集成了资产检测、重建和放置的模型,在某些情况下超越了原定的应用领域。获取被遮挡的对象被视为一项图像编辑任务,使用生成模型进行推理并在一致的光照和几何结构下重构整个场景。与当前的方法不同,3D-RE-GEN 生成全面的背景,该背景在优化过程中空间上限制物体,并为视觉效果和游戏中的现实主义照明和仿真提供基础。 为了获得物理上真实的布局,我们采用了一种新颖的四自由度可微分优化方法,将重建的对象与估计的地平面对齐。3D-RE-GEN 在单图像三维场景重构方面取得了最先进的性能,通过由精确相机恢复和空间优化引导的组合生成方法产生了连贯且可修改的场景。
https://arxiv.org/abs/2512.17459
Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at this https URL
合成的3D场景对于开发物理人工智能和生成模型至关重要。现有的程序化生成方法往往具有较低的输出吞吐量,成为扩大数据集创建规模的一个重要瓶颈。在这项工作中,我们介绍了Sceniris,这是一个高效的程序化场景生成框架,可以快速生成大规模且无碰撞的场景变化。此外,Sceniris还提供了一个可选的机器人可达性检查功能,为机器人的任务提供了操作可行性的场景。 Sceniris通过解决先前方法Scene Synthesizer的主要性能限制而设计成了效率最高的状态。它利用批处理采样和cuRobo中的更快碰撞检测来实现至少比Scene Synthesizer快234倍的速度提升。此外,Sceniris还将之前工作中可用的对象级空间关系扩展到支持多样化的场景需求。 我们的代码可以在以下链接中获取:[此URL](请将"[此URL]"替换为实际提供的GitHub或其它代码托管平台的URL)。
https://arxiv.org/abs/2512.16896
Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.
前馈3D高斯点阵模型(3DGS)能够实现实时场景生成,但由于像素对齐的基本元素放置不理想而受限。这种限制依赖于密集且刚性的网格结构,从而影响了质量和效率的提升。我们引入了一种新的前馈架构,该架构能够在亚像素级别检测3D高斯基本体,并用一种自适应、"离网"(Off The Grid)分布取代传统的像素网格。受关键点检测的启发,我们的多分辨率解码器能够学习在图像补丁中分配这些基本元素。这一模块与用于三维重建的骨干网络一起,使用自我监督学习方法进行端到端训练。 我们得到的结果是没有姿态约束的模型可以在几秒钟内生成逼真的场景,并且在前馈模型的新视角合成方面达到了最先进的水平。它不仅超越了竞争对手,在使用远少于其他模型的基本体数量的同时,展示了一种更为准确和高效的分配方式,该方法能够捕捉到细节并减少伪影。此外,我们观察到通过学习渲染3D高斯分布,我们的三维重建骨干网络可以提高相机姿态估计的准确性,这表明有可能在没有标签的情况下训练这些基础模型。
https://arxiv.org/abs/2512.15508
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL
泛化仍然是互动式三维场景生成的核心挑战。现有的基于学习的方法将空间理解建立在有限的场景数据集上,从而限制了其对新布局的一般化能力。我们则重新编程了一个预训练的3D实例生成器,使其作为场景级的学习者来运作,并用模型中心的空间监督取代了局限于数据集的监督。这种重编程解锁了生成器可迁移的空间知识,使它能够推广到未见过的布局和新物体组合上。值得注意的是,即使在由随机组合的对象构成的训练场景中,空间推理仍然出现。这表明生成器的可转移场景先验提供了丰富的学习信号,可以从纯粹的几何线索中推断出接近性、支撑性和对称性。 通过用以视角为中心的形式化替代广泛使用的标准空间概念,我们实现了这一洞察,并提出了一种完全前馈的、通用化的场景生成器,该生成器直接从实例模型中学得空间关系。定量和定性的结果表明,一个3D实例生成器是一个隐含的空间学习者和推理工具,这指向了用于交互式3D场景理解和生成的基础模型的发展方向。 项目页面: [此链接](this https URL)
https://arxiv.org/abs/2512.13683
Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
生成可控且互动的室内场景对于游戏开发、建筑可视化以及具身人工智能训练等应用来说至关重要。然而,现有的方法要么只能处理有限种类的输入模态(如文本描述或CAD平面图),要么依赖于随机过程来创建这些场景,这使得难以实现精确控制。为了克服这些问题,我们引入了RoomPilot——一个统一框架,它能够解析多种多样的多模态输入(包括文本描述和CAD楼层计划)并将其转换为室内专用领域的语言(IDSL),进而生成结构化的室内场景。关键的理念是:经过精心设计的IDSL可以作为共享语义表示,从单一模态中生成一致且高质量的场景,并保持交互性。 与传统的程序化方法不同,这些方法虽然能够产生视觉上可信但功能相对静止的布局,RoomPilot利用一个注有互动数据的数据集来合成表现出真实物体行为的环境。大量的实验进一步验证了它在多模态理解、场景生成中的精细控制以及物理一致性和视觉逼真度方面的卓越表现,标志着向着通用可控3D室内场景生成迈出了一大步。
https://arxiv.org/abs/2512.11234
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at this https URL.
在本文中,我们提出了一种称为SceneMaker的解耦3D场景生成框架。由于缺乏足够的开放集去遮挡和姿态估计先验知识,在严重遮挡和开放集设置下同时产生高质量几何形状和准确的姿态方面,现有方法面临挑战。为了解决这些问题,我们首先将去遮挡模型从三维物体生成中分离出来,并通过利用图像数据集和收集的去遮挡数据集来增强其多样性,以处理更加多样化的开放集遮挡模式。此外,我们提出了一种统一的姿态估计模型,该模型结合了全局和局部机制的自我注意和交叉注意力,以提高准确性。另外,我们构建了一个开放集3D场景数据集,进一步扩展姿态估计模型的泛化能力。全面的实验表明,在室内和开放集场景中,我们的解耦框架具有优越性。我们的代码和数据集可在[此处](https://此URL/)获取。
https://arxiv.org/abs/2512.10957
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
从单张图像生成高质量、具有纹理的三维场景仍然是计算机视觉和图形学中的一个基本挑战。最近的一些图像到三维模型的生成器能够从单一视角恢复合理的几何结构,但它们以对象为中心的训练方式限制了其在复杂大规模场景中实现忠实结构和纹理的一般化能力。我们提出了EvoScene,这是一个自我演化的、无需训练的框架,它可以逐步地从单张图片中重建完整的三维场景。核心思想是结合现有模型的优势:即三维生成模型中的几何推理能力和视频生成模型中的视觉知识。 通过三个迭代阶段——空间先验初始化、基于视觉引导的三维场景网格生成和基于空间引导的新视角生成——EvoScene在二维和三维域之间交替进行,逐步提升结构和外观的质量。实验结果表明,在多样化的场景中,EvoScene相比于强大的基准模型实现了更优的几何稳定性、视图一致性的纹理以及未见区域的完成度,并且可以生成适合实际应用的即用型3D网格。
https://arxiv.org/abs/2512.08905
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
真实且多样化的多代理驾驶场景对于评估自动驾驶车辆至关重要,但这些场景中关键的安全事件(对任务来说是必不可少的)在驾驶数据集中非常罕见并且代表性不足。基于数据驱动的场景生成提供了一种低成本的选择,通过从现有的驾驶日志中合成复杂的交通行为来实现这一点。然而,现有模型往往缺乏可控性或产生违反物理和社会约束的样本,限制了它们的应用。 我们提出了OMEGA框架,这是一个优化引导、无需训练的框架,在基于扩散的方法进行场景生成采样过程中强制执行结构一致性和交互意识。通过在每个反向扩散步骤中应用有约束条件的优化方法,OMEGA重新定位每次迭代,使生成的过程朝着物理上合理且行为连贯的方向前进。 在此框架基础上,我们将自我车辆与攻击者之间的互动作为分布空间中的博弈论优化问题来处理,逼近纳什均衡以生成现实且具有安全挑战性的对抗性场景。在nuPlan和Waymo数据集上的实验表明,OMEGA提高了生成的场景的真实感、一致性和可控性,在自由探索能力方面将物理和行为上有效场景的比例从32.35%提升至72.27%,而在聚焦于可控性生成的情况下,这一比例则从11%大幅提升到80%。此外,我们的方法还能在保持整体场景真实性的前提下,产生五倍数量的接近碰撞帧,这些帧的时间距离(time-to-collision)小于三秒。 通过这样的改进,OMEGA不仅提高了合成驾驶场景的质量和真实性,还为自动驾驶车辆的安全性和性能评估提供了更有价值的数据。
https://arxiv.org/abs/2512.07661
Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: this https URL
组合性对于生成三维物体和场景至关重要,但现有的具有部件感知能力的3D生成方法在增加组件数量时由于二次全局注意力成本而面临扩展性的挑战。在这项工作中,我们提出了MoCA,这是一种组合式3D生成模型,具备两个关键设计:(1) 基于重要性进行组件路由,选择前k个相关的组件用于稀疏的全局注意力;(2) 对不重要的组件进行压缩,在减少全局注意力计算复杂度的同时保留未选组件的上下文先验。通过这两个设计,MoCA能够实现高效、细粒度的组合式3D资产创建,并且支持可扩展数量的组件。大量的实验表明,MoCA在组合式物体和场景生成任务上超越了基准方法。项目页面:[此链接](this https URL)
https://arxiv.org/abs/2512.07628
We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.
我们介绍了MVRoom,这是一种用于室内三维场景的可控新颖视角合成(NVS)流水线,它使用基于粗略3D布局的多视图扩散。MVRoom采用两阶段设计,在整个过程中利用3D布局来强制执行多视图一致性。第一阶段采用了新的表示方法,有效地将3D布局与一致的图像条件信号结合起来,用于多视图生成。第二阶段进行受图像条件限制的多视图生成,并引入了一种具有视角意识的极线注意力机制,在扩散过程中增强多视图的一致性。 此外,我们还介绍了一个迭代框架,该框架通过递归执行多视图生成(MVRoom)来生成包含不同数量对象和场景复杂度的3D场景,支持从文本到场景的生成。实验结果表明,我们的方法在NVS方面实现了高质量且可控的三维场景生成,在定量和定性评估中均优于现有的基准方法。消融研究进一步验证了我们生成管道中关键组件的有效性。
https://arxiv.org/abs/2512.04248