Abstract
The multifaceted nature of human perception and comprehension indicates that, when we think, our body can naturally take any combination of senses, a.k.a., modalities and form a beautiful picture in our brain. For example, when we see a cattery and simultaneously perceive the cat's purring sound, our brain can construct a picture of a cat in the cattery. Intuitively, generative AI models should hold the versatility of humans and be capable of generating images from any combination of modalities efficiently and collaboratively. This paper presents ImgAny, a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images. Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities, ranging from language, audio to vision modalities, including image, point cloud, thermal, depth, and event data. Our key idea is inspired by human-level cognitive processes and involves the integration and harmonization of multiple input modalities at both the entity and attribute levels without specific tuning across modalities. Accordingly, our method brings two novel training-free technical branches: 1) Entity Fusion Branch ensures the coherence between inputs and outputs. It extracts entity features from the multi-modal representations powered by our specially constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly preserves and processes the attributes. It efficiently amalgamates distinct attributes from diverse input modalities via our proposed attribute knowledge graph. Lastly, the entity and attribute features are adaptively fused as the conditional inputs to the pre-trained Stable Diffusion model for image generation. Extensive experiments under diverse modality combinations demonstrate its exceptional capability for visual content creation.
Abstract (translated)
多样的感知和理解表明,当我们思考时,我们的身体可以自然地采取任何组合的感觉,也就是模态,并在我们的大脑中形成美丽的画面。例如,当我们看到一个养猫处并且同时听到猫的咕噜声时,我们的大脑可以在猫笼子里构建出一幅猫的形象。直觉上,生成型人工智能模型应该具有人类多才多艺的能力,并能够高效且协同地生成任何组合的模态图像。本文介绍了一种新颖的端到端多模态生成模型——ImgAny,它能够模仿人类的推理并生成高质量图像。我们的方法在七种模态(从语言、音频到视觉模态)的结合方面具有高效性和灵活性,包括图像、点云、热成像、深度和事件数据。我们关键的想法源于人类级别的认知过程,并涉及在实体和属性级别整合和协调多个输入模态,而无需对模态进行特定调谐。因此,我们的方法带来了两个新的无训练技术分支:1)实体融合分支确保输入和输出之间的连贯性。它从我们特别构建的实体知识图中提取实体特征;2)属性融合分支巧妙地保留和处理属性。它通过我们提出的属性知识图有效地整合了来自不同输入模态的显著属性。最后,实体和属性特征作为条件输入到预训练的Stable Diffusion模型,用于图像生成。在多样模态组合的广泛实验中,它表现出惊人的图像内容创作能力。
URL
https://arxiv.org/abs/2401.17664