Abstract
The challenge in learning abstract concepts from images in an unsupervised fashion lies in the required integration of visual perception and generalizable relational reasoning. Moreover, the unsupervised nature of this task makes it necessary for human users to be able to understand a model's learnt concepts and potentially revise false behaviours. To tackle both the generalizability and interpretability constraints of visual concept learning, we propose Pix2Code, a framework that extends program synthesis to visual relational reasoning by utilizing the abilities of both explicit, compositional symbolic and implicit neural representations. This is achieved by retrieving object representations from images and synthesizing relational concepts as lambda-calculus programs. We evaluate the diverse properties of Pix2Code on the challenging reasoning domains, Kandinsky Patterns and CURI, thereby testing its ability to identify compositional visual concepts that generalize to novel data and concept configurations. Particularly, in stark contrast to neural approaches, we show that Pix2Code's representations remain human interpretable and can be easily revised for improved performance.
Abstract (translated)
学习无监督方式下从图像中抽象概念的挑战在于视觉感知和可扩展的关系推理的所需整合。此外,任务的非监督性质意味着人类用户需要能够理解模型学到的概念,并可能纠正错误的行为。为了解决视觉概念学习的泛化和可解释性约束,我们提出了Pix2Code,一个通过利用显式、组合式符号和隐式神经表示的两种能力的框架来扩展程序合成来解决视觉关系推理的问题。通过从图像中检索物体表示并生成关系概念作为lambda-calculus程序,我们实现了这一目标。我们在具有挑战性的推理领域Kandinsky模式和CURI上评估了Pix2Code的各种属性,从而测试了它识别可扩展视觉概念的能力,以及是否能够识别适用于新数据和概念配置的组合视觉概念。与神经方法相比,我们证明了Pix2Code的表示仍然具有人类可解释性,并且可以很容易地进行修改以提高性能。
URL
https://arxiv.org/abs/2402.08280