Abstract
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.
Abstract (translated)
我们解决的是对单张图像或视频进行重新光照的问题,这是一个需要精确场景内在理解以及高质量光线传输合成的任务。现有的端到端重新光照模型通常受限于多光照配对数据的稀缺性,这限制了它们在不同场景中的泛化能力。相反,结合逆向和正向渲染的两阶段管道可以缓解数据需求问题,但容易出现错误累积,并且常常无法在复杂的照明条件下或使用复杂材料时产生逼真的输出结果。 在这项工作中,我们提出了一种通用方法,该方法能够在一次通过中同时估计反射率并合成重新光照后的输出,利用视频扩散模型的生成能力。这种联合公式提高了隐式场景理解的能力,并促进了现实光线效果和复杂材质交互(如阴影、反射和透明度)的创建。 我们的模型在合成多光照数据和大量自动标注的真实世界视频上进行训练,在多种领域中显示出强大的泛化能力,并且在视觉保真度和时间一致性方面超过了之前的方法。
URL
https://arxiv.org/abs/2506.15673