Abstract
Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
Abstract (translated)
当前的视觉系统通常给图像分配固定长度的表示,而不考虑信息内容。这与人类智能——甚至是大型语言模型——形成对比,后者会根据熵、上下文和熟悉程度分配不同的表示能力。受到这一现象的启发,我们提出了一种学习2D图像可变长度标记表示的方法。我们的编码器-解码器架构递归处理2D图像标记,并在多次循环展开中将它们提炼为1D潜在标记。每个迭代过程都会细化2D标记,更新现有的1D潜在标记,并通过添加新标记自适应地增加表示能力。这使得可以将图像压缩成一个可变数量的标记,范围从32到256个不等。我们使用重建损失和FID指标验证了我们的标记器,表明标记数与图像熵、熟悉程度以及下游任务需求相吻合。随着每次迭代中表示能力的逐步增加,递归处理标记显示出标记专业化的迹象,揭示了发现对象/部件的潜力。
URL
https://arxiv.org/abs/2411.02393