Paper Reading AI Learner

Bridging Vision and Language Spaces with Assignment Prediction

2024-04-15 10:04:15
Jungin Park, Jiyoung Lee, Kwanghoon Sohn

Abstract

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.

Abstract (translated)

本文介绍了一种名为VLAP的新方法,它将预训练的视觉模型(VMs)和大语言模型(LLMs)相桥,使冻结的LLMs能够理解视觉世界。VLAP通过使用单线性层将预训练的视觉模型的嵌入空间转换为LLMs的词向量空间,实现高效的视觉和语言理解。具体来说,我们利用已经确立的词向量来连接两个模态的嵌入空间。通过将分配过程表示为最优传输问题,将视觉和文本表示同时分配给预训练的LLM中的一个单词向量集合。我们预测从另一个模态的表示中分配一个模态,强制保持成对多模态数据的相似分配。这使得视觉和语言表示包含相同的信息,将冻结的LLM的词嵌入空间 grounded in visual data。此外,通过视觉数据可以保留LLM的语义分类器,因为LLM解释并推理单词嵌入之间的相关性。实验结果表明,VLAP在各种视觉-语言任务上的改进都超过了基于线性变换的先前方法。我们还证明了学习到的视觉表示具有LLM的语义分类器,使视觉语义算术成为可能。

URL

https://arxiv.org/abs/2404.09632

PDF

https://arxiv.org/pdf/2404.09632.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot