Abstract
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at this https URL.
Abstract (translated)
图像去遮挡(或称模态完成)的目标是恢复出图中被遮挡物体不可见区域的形状和外观。尽管近年来取得了进展,但缺乏一种既能保证多样性、合理性又能保持精确性的高质量数据依然是一个主要障碍。为解决这一挑战,我们识别出了三个关键要素:利用野外图像数据来实现多样性和可变性;引入人类专业知识以确保合理性和真实性;以及通过生成式先验知识提升精度和完整性。 为此,我们提出了SynergyAmodal框架,这是一种新颖的协同合成模态数据集的方法,涵盖了全面的形状和外观注释。该方法结合了这三个关键要素,并且是通过对数据-人-模型三元协作来实现。首先,设计了一个基于遮挡的自监督学习算法,以利用野外图像数据中的多样性,将一个修复扩散模型微调为部分完成扩散模型。接下来,建立了一条协同合成流水线,该流水线通过迭代过滤、精炼和标注初始解遮挡结果,并根据人类专家指导以及先验模型约束确保合理性和精确性来生成高质量的成对模态数据集,涵盖约16K对图像。 此外,我们还在合成的数据集上训练了一个完整的完成扩散模型,引入文本提示作为条件信号。广泛的实验表明,我们的框架在零样本泛化能力和文本可控性方面表现出了卓越的效果。代码、数据集和模型将在[此处提供的URL]公开发布。
URL
https://arxiv.org/abs/2504.19506