We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.
https://arxiv.org/abs/2604.14302
Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.
https://arxiv.org/abs/2604.11808
3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.
https://arxiv.org/abs/2604.11331
3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.
https://arxiv.org/abs/2604.10772
The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.
https://arxiv.org/abs/2604.10578
Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.
https://arxiv.org/abs/2604.10125
Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.
https://arxiv.org/abs/2604.09036
We present Genie Sim PanoRecon, a feed-forward Gaussian-splatting pipeline that delivers high-fidelity, low-cost 3D scenes for robotic manipulation simulation. The panorama input is decomposed into six non-overlapping cube-map faces, processed in parallel, and seamlessly reassembled. To guarantee geometric consistency across views, we devise a depth-aware fusion strategy coupled with a training-free depth-injection module that steers the monocular feed-forward network to generate coherent 3D Gaussians. The whole system reconstructs photo-realistic scenes in seconds and has been integrated into Genie Sim - a LLM-driven simulation platform for embodied synthetic data generation and evaluation - to provide scalable backgrounds for manipulation tasks. For code details, please refer to: this https URL.
https://arxiv.org/abs/2604.07105
Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textit{V-Scan} dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion.
https://arxiv.org/abs/2604.07010
Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $\Sigma$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $\Sigma$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.
可扩展的户外驾驶场景生成需要具备多视角一致性和大规模扩展能力的3D表示。现有方案要么依赖蒸馏至3D空间的图像或视频生成模型,损害了几何一致性并将渲染限制在训练视角内;要么局限于小规模3D场景或以物体为中心的生成。本研究提出一种基于Σ-Voxfield网格的3D生成框架——这是一种离散表示,每个占据的体素存储固定数量的着色表面样本。为生成该表示,我们训练了一个语义条件扩散模型,该模型在局部体素邻域上运行,并使用3D位置编码捕捉空间结构。我们通过重叠区域的渐进式空间扩展绘制来实现大规模场景生成。最后,我们通过延迟渲染模块对生成的Σ-Voxfield网格进行渲染,获得逼真图像,实现了无需逐场景优化的大规模多视角一致3D场景生成。大量实验表明,我们的方法能够生成多样的大规模城市户外场景,可渲染为具有各种传感器配置和相机轨迹的逼真图像,同时相较于现有方法保持适中的计算成本。
https://arxiv.org/abs/2604.06113
Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at this https URL.
从单视图进行组合式3D场景生成需要同时恢复场景布局和3D资产。现有方法主要分为两类:前馈生成方法和逐实例生成方法。前者通过高效网络推理直接预测具有显式6自由度姿态的3D资产,但在复杂场景中泛化能力较差;后者通过分而治之策略提升泛化性,但存在姿态优化耗时的问题。为弥合这一差距,我们提出了3D-Fixer,一种新颖的就位补全范式。具体而言,3D-Fixer将3D对象生成先验扩展为在原始位置处基于部分可见点云生成完整3D资产,这些点云从几何估计方法得到的碎片化几何体中裁剪得到。与需要显式姿态对齐的先前工作不同,3D-Fixer利用碎片化几何体作为空间锚点以保持布局保真度。其核心是提出了一种由粗到细的生成方案,以解决遮挡下的边界模糊问题,并辅以双分支条件网络和遮挡鲁棒特征对齐策略以实现稳定训练。此外,为应对数据稀缺瓶颈,我们推出了迄今为止最大的场景级数据集ARSG-110K,包含超过11万个多样化场景和300万张带高保真3D真值标注的图像。大量实验表明,3D-Fixer实现了最先进的几何精度,显著优于MIDI和Gen3DSR等基线方法,同时保持了扩散过程的效率。代码和数据将在此https URL公开。
https://arxiv.org/abs/2604.04406
3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial this http URL by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual this http URL, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic this http URL, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via this http URL experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene this http URL will be publicly available.
https://arxiv.org/abs/2604.01972
We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at this http URL.
我们介绍了LivingWorld,这是一个从单张图像生成具有环境动态的4D世界的交互式框架。尽管近期3D场景生成的进展实现了大规模环境创建,但大多数方法主要侧重于重建静态几何,而场景尺度的环境动态(如云、水或烟雾)仍 largely unexplored。建模此类动态具有挑战性,因为运动必须在场景扩展过程中保持连贯,同时支持低延迟的用户反馈。LivingWorld通过随着场景扩展渐进构建全局连贯的运动场来解决这一挑战。为在扩展过程中维持全局一致性,我们引入了几何感知对齐模块,以解决跨视图的方向和尺度模糊性。我们进一步采用基于哈希的紧凑运动场来表示运动,实现高效查询以及动态在场景中的稳定传播。该表示还支持渲染过程中的双向运动传播,无需依赖昂贵的基于视频的细化即可生成长时间且时序连贯的4D序列。在单块RTX 5090 GPU上,每个新场景扩展步骤的生成需要9秒,随后进行3秒的运动对齐与运动场更新,从而实现具有全局连贯环境动态的交互式4D世界生成。视频演示可在此http URL查看。
https://arxiv.org/abs/2604.01641
We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.
我们提出ReinDriveGen框架,该框架可实现对动态驾驶场景的完全控制,允许用户自由编辑参与者轨迹以模拟安全关键边缘案例,如前车碰撞、车辆漂移、车辆失控旋转、行人乱穿马路及自行车横穿车道等。我们的方法通过多帧激光雷达数据构建动态3D点云场景,引入车辆补全模块以从部分观测中重建完整360°几何结构,并将编辑后的场景渲染为2D条件图像,从而指导视频扩散模型合成逼真的驾驶视频。由于此类编辑场景必然超出训练分布范围,我们进一步提出一种基于强化学习的后训练策略,结合成对偏好模型与成对奖励机制,可在无真实值监督的条件下实现分布外场景的稳健质量提升。大量实验表明,ReinDriveGen在编辑驾驶场景中优于现有方法,并在新颖自车视角合成任务上取得最先进结果。
https://arxiv.org/abs/2604.01129
In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.
本文提出Extend3D,一种基于以物体为中心的3D生成模型、无需训练的从单张图像生成3D场景的流程。为克服以物体为中心的模型在表示广阔场景时潜在空间尺寸固定的局限,我们在$x$和$y$方向上扩展潜在空间。随后,通过将扩展后的潜在空间划分为重叠的图块,我们将以物体为中心的3D生成模型应用于每个图块,并在每个时间步进行耦合。由于带图像条件的分块3D生成需要图像与潜在图块间严格的空间对齐,我们利用单目深度估计器的点云先验初始化场景,并通过SDEdit迭代优化遮挡区域。我们发现,在3D细化过程中将3D结构的不完整性视为噪声,能够通过一种我们称为“欠去噪”的概念实现3D补全。此外,为解决以物体为中心的模型在子场景生成上的次优性,我们在去噪过程中优化扩展潜在空间,确保去噪轨迹与子场景动态保持一致。为此,我们引入了3D感知的优化目标,以提升几何结构与纹理保真度。实验表明,我们的方法在人类偏好评估与定量实验中均优于先前方法。
https://arxiv.org/abs/2603.29387
Unbounded 3D world generation is emerging as a foundational task for scene modeling in computer vision, graphics, and robotics. In this work, we present WorldFlow3D, a novel method capable of generating unbounded 3D worlds. Building upon a foundational property of flow matching - namely, defining a path of transport between two data distributions - we model 3D generation more generally as a problem of flowing through 3D data distributions, not limited to conditional denoising. We find that our latent-free flow approach generates causal and accurate 3D structure, and can use this as an intermediate distribution to guide the generation of more complex structure and high-quality texture - all while converging more rapidly than existing methods. We enable controllability over generated scenes with vectorized scene layout conditions for geometric structure control and visual texture control through scene attributes. We confirm the effectiveness of WorldFlow3D on both real outdoor driving scenes and synthetic indoor scenes, validating cross-domain generalizability and high-quality generation on real data distributions. We confirm favorable scene generation fidelity over approaches in all tested settings for unbounded scene generation. For more, see this https URL.
无界3D世界生成正逐渐成为计算机视觉、图形学与机器人领域场景建模的基础任务。本研究提出WorldFlow3D,一种能够生成无界3D世界的新方法。基于流匹配的核心特性——即定义两个数据分布间的传输路径——我们将3D生成更一般地建模为流经3D数据分布的问题,而不局限于条件去噪。我们发现无隐式流方法能生成因果准确的三维结构,并以此作为中间分布来引导更复杂结构与高质量纹理的生成,同时收敛速度超越现有方法。我们通过矢量场景布局条件实现生成场景的可控性,并利用场景属性进行视觉纹理控制。该方法在真实户外驾驶场景与合成室内场景上均验证了有效性,证明了跨领域泛化能力及对真实数据分布的高质量生成。在所有测试的无界场景生成设置中,WorldFlow3D均展现出优于对比方法的场景生成保真度。更多详情请访问此链接。
https://arxiv.org/abs/2603.29089
The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.
基于新颖的视频生成模型与前馈式三维重建技术,从文本生成沉浸式三维场景的方法正迅速成熟,在增强现实/虚拟现实及世界建模领域展现出巨大潜力。尽管全景图像已被证明能有效初始化场景,但现有方法在视觉保真度与可探索性之间存在权衡:自回归扩展面临上下文漂移问题,而全景视频生成则受限于低分辨率。本文提出Stepper——一个统一的文本驱动沉浸式三维场景合成框架,通过分步全景场景扩展规避了上述限制。Stepper采用新型多视角360°扩散模型实现一致的高分辨率扩展,并结合几何重建流程确保几何连贯性。该框架在新构建的大规模多视角全景数据集上训练,实现了最先进的保真度与结构一致性,超越了先前方法,为沉浸式场景生成树立了新标杆。
https://arxiv.org/abs/2603.28980
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: this https URL
https://arxiv.org/abs/2603.28757
Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.
近期3D生成建模领域的前沿进展大多基于扩散或流匹配(flow-matching)框架。本文则探索了一种完全自回归的替代方案,并提出了GaussianGPT——一种基于Transformer的模型,通过下一个token预测直接生成3D高斯元,从而实现完整的3D场景生成。我们首先使用带有向量量化的稀疏3D卷积自编码器,将高斯基元压缩为离散的潜在网格。所得token被序列化,并采用具备三维旋转位置编码的因果Transformer进行建模,从而支持空间结构与外观的逐序生成。与需要整体优化场景的扩散方法不同,我们的方法以逐步构建的方式生成场景,自然支持场景补全、扩展绘制、通过温度参数实现可控采样,以及灵活的生成范围。该方法在利用自回归建模的组合归纳偏置与可扩展性的同时,操作于与现代神经渲染管线兼容的显式表示,使自回归Transformer成为可控且具备上下文感知能力的3D生成的一种有效补充范式。
https://arxiv.org/abs/2603.26661
Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.
尽管视频生成技术已取得显著进展,但在重访先前探索区域时维持长期场景一致性仍具挑战性。现有方案要么依赖显式构建3D几何结构(易产生误差累积与尺度模糊),要么采用简单的相机视场检索(在复杂遮挡场景下通常失效)。为突破这些局限,本文提出I3DM——一种用于生成一致视频场景的新型隐式3D感知记忆机制,该机制无需显式3D重建。其核心在于3D感知记忆检索策略:通过利用预训练前馈 novel view synthesis(FF-NVS)模型的中间特征对视角相关性进行评分,从而在高度遮挡场景中实现鲁棒检索。此外,为充分利用检索到的历史帧,我们设计了3D对齐记忆注入模块。该模块可将历史内容隐式扭曲至目标视角,并自适应地基于可靠扭曲区域进行生成条件控制,从而提升重访一致性与相机控制精度。大量实验表明,本方法超越现有最优方案,在重访一致性、生成保真度及相机控制精度上均取得更优表现。
https://arxiv.org/abs/2603.23413