Abstract
Learning structured representations of the visual world in terms of objects promises to significantly improve the generalization abilities of current machine learning models. While recent efforts to this end have shown promising empirical progress, a theoretical account of when unsupervised object-centric representation learning is possible is still lacking. Consequently, understanding the reasons for the success of existing object-centric methods as well as designing new theoretically grounded methods remains challenging. In the present work, we analyze when object-centric representations can provably be learned without supervision. To this end, we first introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility. Under this generative process, we prove that the ground-truth object representations can be identified by an invertible and compositional inference model, even in the presence of dependencies between objects. We empirically validate our results through experiments on synthetic data. Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models' compositionality and invertibility and their empirical identifiability.
Abstract (translated)
学习以对象为中心的视觉世界的结构化表示,有望显著改善当前机器学习模型的泛化能力。尽管最近的努力表明已经取得了令人瞩目的 empirical 进展,但缺乏关于在没有监督的情况下学习对象中心表示的理论解释仍使我们感到挑战。因此,我们在本文中探讨了何时对象中心表示可以显然地学习 without supervision。为此,我们首先介绍了几个对象组成的场景生成过程的假设,我们称之为组合性和不可逆性。在这些生成过程中,我们证明,即使存在对象之间的依赖关系,一个可逆性和组合性推理模型仍然可以识别到对象的基元表示。通过实验合成数据验证我们的结果。最后,我们提供证据表明,我们的理论对现有的对象中心模型具有预测能力,通过展示模型的组合性和逆转性以及它们的 empirical 可辨识度之间的密切关系。
URL
https://arxiv.org/abs/2305.14229