Paper Reading AI Learner

Grouped Discrete Representation for Object-Centric Learning

2024-11-04 17:25:10
Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Abstract

Object-Centric Learning (OCL) can discover objects in images or videos by simply reconstructing the input. For better object discovery, representative OCL methods reconstruct the input as its Variational Autoencoder (VAE) intermediate representation, which suppresses pixel noises and promotes object separability by discretizing continuous super-pixels with template features. However, treating features as units overlooks their composing attributes, thus impeding model generalization; indexing features with scalar numbers loses attribute-level similarities and differences, thus hindering model convergence. We propose \textit{Grouped Discrete Representation} (GDR) for OCL. We decompose features into combinatorial attributes via organized channel grouping, and compose these attributes into discrete representation via tuple indexes. Experiments show that our GDR improves both Transformer- and Diffusion-based OCL methods consistently on various datasets. Visualizations show that our GDR captures better object separability.

Abstract (translated)

对象中心学习(OCL)可以通过简单地重构输入来发现图像或视频中的对象。为了更好地发现对象,典型的OCL方法将输入重建为其变分自编码器(VAE)的中间表示形式,通过用模板特征对连续超像素进行离散化处理,以抑制像素噪声并促进对象可分离性。然而,将特征视为单元会忽略它们的组成属性,从而阻碍模型泛化;使用标量数索引特征则会丢失属性级别的相似性和差异性,进而妨碍模型收敛。我们提出了用于OCL的\textit{分组离散表示}(GDR)。通过组织化的通道分组将特征分解为组合属性,并通过元组索引将其组成离散表示形式。实验表明,我们的GDR在各种数据集上一致提高了基于Transformer和扩散模型的OCL方法的性能。可视化结果表明,我们的GDR能够更好地捕捉对象可分离性。

URL

https://arxiv.org/abs/2411.02299

PDF

https://arxiv.org/pdf/2411.02299.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot