Abstract
Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
Abstract (translated)
对象中心学习的目标是使用对象实体(也称为槽)来代表视觉数据,提供结构性表示,从而实现系统性泛化。利用Transformer等高级架构,最近的方法在未监督对象发现方面取得了显著进展。此外,基于槽的表示在生成模型方面具有巨大的潜力,例如在图像编辑中可控制的图像生成和对象操纵。然而,当前基于槽的方法往往产生模糊的图像和扭曲的对象,表现出生成模型能力的不足。在本文中,我们关注改进槽到图像解码,这是高质量视觉生成的关键方面。我们介绍了slotDiffusion——一个针对图像和视频数据的 object-centric Latent Diffusion Model(LDM)。由于LDM的强大建模能力,slotDiffusion在六 datasets 的未监督对象分割和视觉生成方面超越了以前的槽模型。此外,我们学习的对象特征可以由现有的对象中心动态模型使用,提高视频预测质量和后续的时间推理任务。最后,我们展示了slotDiffusion对无约束的现实世界数据集如PASCAL VOC和COCO的 scalability。在与自监督预训练图像编码器集成时,我们证明了slotDiffusion的可扩展性。
URL
https://arxiv.org/abs/2305.11281