Paper Reading AI Learner

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

2023-11-30 07:00:14
Zhiwei Deng, Ting Chen, Yang Li

Abstract

Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

Abstract (translated)

人类视觉识别系统表现出将视觉信息压缩成包含丰富表示的一组标记的令人惊讶的能力,而不受标签监督。其背后的关键推动力是感知聚类。尽管在2010年初期计算机视觉中广泛使用,但仍然是一个谜:是否可以利用感知聚类从监督中提取神经视觉识别骨架并生成具有如此强大表示的模型。在本文中,我们提出了Perceptual Group Tokenizer,一种完全依赖聚类操作来提取视觉特征并执行自监督表示学习的模型,其中一系列聚类操作用于迭代猜测像素或子像素的上下文以优化特征表示。我们证明了与最先进的视觉架构相比,所提出的模型具有竞争优势,并继承了有利的特性,包括无需重新训练的自适应计算和可解释性。具体来说,Perceptual Group Tokenizer在ImageNet-1K自监督学习基准上实现了80.3%的分数,标志着在当前范式下取得了新的进展。

URL

https://arxiv.org/abs/2311.18296

PDF

https://arxiv.org/pdf/2311.18296.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot