Paper Reading AI Learner

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

2023-05-18 19:56:20
Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg

Abstract

Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.

Abstract (translated)

对象中心学习的目标是使用对象实体(也称为槽)来代表视觉数据,提供结构性表示,从而实现系统性泛化。利用Transformer等高级架构,最近的方法在未监督对象发现方面取得了显著进展。此外,基于槽的表示在生成模型方面具有巨大的潜力,例如在图像编辑中可控制的图像生成和对象操纵。然而,当前基于槽的方法往往产生模糊的图像和扭曲的对象,表现出生成模型能力的不足。在本文中,我们关注改进槽到图像解码,这是高质量视觉生成的关键方面。我们介绍了slotDiffusion——一个针对图像和视频数据的 object-centric Latent Diffusion Model(LDM)。由于LDM的强大建模能力,slotDiffusion在六 datasets 的未监督对象分割和视觉生成方面超越了以前的槽模型。此外,我们学习的对象特征可以由现有的对象中心动态模型使用,提高视频预测质量和后续的时间推理任务。最后,我们展示了slotDiffusion对无约束的现实世界数据集如PASCAL VOC和COCO的 scalability。在与自监督预训练图像编码器集成时,我们证明了slotDiffusion的可扩展性。

URL

https://arxiv.org/abs/2305.11281

PDF

https://arxiv.org/pdf/2305.11281.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot