Abstract
Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at this https URL.
Abstract (translated)
将透视图像和视频转换为360°全景图能够实现沉浸式的三维世界生成。现有的方法通常依赖于透视视图与等距矩形投影(ERP)空间之间的显式几何对齐。然而,这需要已知的相机元数据,在野外的数据中,这种校准通常是缺失或有噪声的。我们提出了一种名为360Anything的新框架,该框架基于预训练的扩散变换器构建,并且不需要任何几何信息。通过将透视输入和全景图目标视为简单的令牌序列,360Anything能够以完全数据驱动的方式学习透视到等距矩形映射,从而消除了对相机信息的需求。 我们的方法在图像和视频从透视视图到360°生成的性能上达到了最先进的水平,并且超越了那些使用真实相机信息的方法。我们还追溯到了ERP边界处的接缝瑕疵的根本原因——VAE编码器中的零填充处理,并引入了圆形潜在编码以促进无缝生成。 最后,我们在无提示相机视野和方向估计基准测试中展示了具有竞争力的结果,这表明360Anything具备深刻的几何理解能力以及在计算机视觉任务中的更广泛实用性。更多的研究成果可以访问此链接:[提供的URL]。 简而言之,这项工作展示了一种创新的方法来处理没有明确几何对齐信息的图像和视频数据,并且证明了这种方法在多种应用中的有效性和广泛的适用性。
URL
https://arxiv.org/abs/2601.16192