Paper Reading AI Learner

Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion

2024-05-08 08:32:34
Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao

Abstract

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose Molecule-Space, an idea that treats multimodal representation spaces as "molecules", and augments pre-trained unified space by integrating knowledge from extra expert spaces via "molecules space reactions". Specifically, we introduce two kinds of basic space reactions: 1) Space Displacement Reaction and 2) Space Combination Reaction. Based on these defined basic reactions, we design Complex Sequential & Parallel Reactions to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we fuse the audio-image-text space of ImageBind with the image-text and audio-text expert spaces. The resulting space outperforms ImageBind on 5 downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the used image-text and audio-text expert spaces.

Abstract (translated)

统一的多模态表示空间是多模态理解和生成的基础。然而,数十亿个模型参数和灾难性遗忘问题使得进一步增强预训练的统一空间变得具有挑战性。在这项工作中,我们提出了Molecule-Space,一种将多模态表示空间视为“分子”的想法,并通过“分子空间反应”将额外的专家空间的知识整合到预训练的统一空间中。具体来说,我们引入了两种基本空间反应:1)空间位移反应和2)空间组合反应。基于这些定义的基本反应,我们设计了一系列复杂序列与并行反应,以同时有效地整合多个空间。通过模块化概念,我们进一步提出了一个粗略到细化的自定义推理策略,以灵活地调整增强的统一空间的不同目的。实验证明,我们将图像-文本空间与图像-文本和音频-文本专家空间融合,得到的系统在9个数据集的5个下游任务上超过了ImageBind。此外,通过自定义推理,它甚至超过了使用的图像-文本和音频-文本专家空间。

URL

https://arxiv.org/abs/2405.04883

PDF

https://arxiv.org/pdf/2405.04883.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot