Paper Reading AI Learner

Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements

2024-04-12 15:15:39
Robin Weiler, Matthias Brucklacher, Cyriel M. A. Pennartz, Sander M. Bohté

Abstract

To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-\allowbreak super-\allowbreak vised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. From a theoretical angle, we find that MIM disentangles neurons in latent space, a property that has been suggested to structure visual representations in primates, without explicit regulation. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under this https URL

Abstract (translated)

为了理解他们周围的环境,智能系统必须将复杂的感官输入转换为结构化的代码,以便精简为任务相关的信息,如物体类别。生物智能体在很大程度上是自适应的,可能是通过自监督学习中的self-allowbreak super-allowbreak visibility学习来实现的。而之前尝试建模底层机制的尝试在很大程度上是歧视性的,证据表明,大脑采用了一种生成型的世界模型。在这里,我们提出,眼动和灵长类视觉集中精力的事实构成了一个生成、自监督的任务,预测和揭示视觉信息。我们从一个通用的图像建模(MIM)框架开始构建证明原则的模型,这是深度表示学习中的常见方法。为此,我们分析MIM中核心组件如遮罩技术和数据增强如何影响类别特定表示的形成。这使我们不仅能够更好地理解MIM的原理,而且能够重新构建一个更符合生物感知聚焦特点的MIM。从理论角度来看,我们发现MIM解离了潜在空间中的神经元,这种性质在灵长类动物中建议了视觉表示的结构。结合之前的惯性学习发现,这揭示了MIM与自监督学习中潜在规范方法之间的有趣联系。源代码可以在https://这个 URL上找到。

URL

https://arxiv.org/abs/2404.08526

PDF

https://arxiv.org/pdf/2404.08526.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot