Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Abstract
Abstract (translated)
URL
PDF

Abstract

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose Molecule-Space, an idea that treats multimodal representation spaces as "molecules", and augments pre-trained unified space by integrating knowledge from extra expert spaces via "molecules space reactions". Specifically, we introduce two kinds of basic space reactions: 1) Space Displacement Reaction and 2) Space Combination Reaction. Based on these defined basic reactions, we design Complex Sequential & Parallel Reactions to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we fuse the audio-image-text space of ImageBind with the image-text and audio-text expert spaces. The resulting space outperforms ImageBind on 5 downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the used image-text and audio-text expert spaces.

Abstract (translated)

统一的多模态表示空间是多模态理解和生成的基础。然而，数十亿个模型参数和灾难性遗忘问题使得进一步增强预训练的统一空间变得具有挑战性。在这项工作中，我们提出了Molecule-Space，一种将多模态表示空间视为“分子”的想法，并通过“分子空间反应”将额外的专家空间的知识整合到预训练的统一空间中。具体来说，我们引入了两种基本空间反应：1）空间位移反应和2）空间组合反应。基于这些定义的基本反应，我们设计了一系列复杂序列与并行反应，以同时有效地整合多个空间。通过模块化概念，我们进一步提出了一个粗略到细化的自定义推理策略，以灵活地调整增强的统一空间的不同目的。实验证明，我们将图像-文本空间与图像-文本和音频-文本专家空间融合，得到的系统在9个数据集的5个下游任务上超过了ImageBind。此外，通过自定义推理，它甚至超过了使用的图像-文本和音频-文本专家空间。

URL

https://arxiv.org/abs/2405.04883

PDF

https://arxiv.org/pdf/2405.04883.pdf

Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Abstract

Abstract (translated)

URL

PDF Copy

PDF