Paper Reading AI Learner

Adaptive Length Image Tokenization via Recurrent Allocation

2024-11-04 18:58:01
Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman

Abstract

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

Abstract (translated)

当前的视觉系统通常给图像分配固定长度的表示,而不考虑信息内容。这与人类智能——甚至是大型语言模型——形成对比,后者会根据熵、上下文和熟悉程度分配不同的表示能力。受到这一现象的启发,我们提出了一种学习2D图像可变长度标记表示的方法。我们的编码器-解码器架构递归处理2D图像标记,并在多次循环展开中将它们提炼为1D潜在标记。每个迭代过程都会细化2D标记,更新现有的1D潜在标记,并通过添加新标记自适应地增加表示能力。这使得可以将图像压缩成一个可变数量的标记,范围从32到256个不等。我们使用重建损失和FID指标验证了我们的标记器,表明标记数与图像熵、熟悉程度以及下游任务需求相吻合。随着每次迭代中表示能力的逐步增加,递归处理标记显示出标记专业化的迹象,揭示了发现对象/部件的潜力。

URL

https://arxiv.org/abs/2411.02393

PDF

https://arxiv.org/pdf/2411.02393.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot