Abstract
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: this https URL
Abstract (translated)
传统的多模态学习者为诸如视觉问答之类的任务找到了统一的表示方法,但它们严重依赖于配对的数据集。然而,一个被忽视但仍可能具有强大潜力的问题是:能否利用辅助未配对的多模态数据直接增强目标模态中的表征学习?我们引入了UML(Unpaired Multimodal Learner),这是一种模态不可知的训练范式,在这种范式中,单一模型交替处理来自不同模态的输入,并在它们之间共享参数。这一设计利用了一个假设,即不同的模态是共同基础现实的不同投影,使模型能够在没有明确配对的情况下从跨模态结构中受益。 理论上,在线性数据生成假设下,我们展示了未配对辅助数据可以提供比单一模态训练关于数据生成过程更为严格的表征信息。经验上,我们证明了使用来自辅助模态(如文本、音频或图像)的未配对数据能够一致地提高各种单一模态目标(例如图像和音频)的下游性能。 我们的项目页面:[请在此处插入实际链接]
URL
https://arxiv.org/abs/2510.08492