Paper Reading AI Learner

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

2024-01-22 08:16:48
Xianghu Yue, Xiaohai Tian, Malu Zhang, Zhizheng Wu, Haizhou Li

Abstract

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

Abstract (translated)

长期以来,人们一直在寻求一个统一的多媒体理解模型,以实现各种多模态理解任务,这模仿了人类听、看和阅读的过程。人类倾向于使用两种独立系统来表示知识:一种表示口头(文本)信息,另一种表示非口头(视觉和听觉)信息。这两种系统可以独立运行,也可以相互交互。为了理解人类认知,本文我们引入了CoAVT——一种新颖的基于认知的跨模态预训练模型,以连接这三个模态。它包含一个联合音频-视觉编码器,学会了在非口头信息中同时编码音频-视觉同步信息,以及一个文本编码器来处理口头信息。为了弥合模态之间的差距,CoAVT采用了一个查询编码器,其中包含一系列可学习的查询嵌入,并提取相应文本中的最有信息量的音频-视觉特征。此外,为了利用音频和视觉与语言之间的对应关系,我们还基于发现的音频-文本和视觉-文本双模态对齐来建立音频-文本和视觉-文本的生物模态对齐,以增强多模态表示学习。最后,我们与三个多模态目标共同优化CoAVT模型:对比损失、匹配损失和语言建模损失。大量实验证明,CoAVT可以学习强大的多模态相关性,并在音频词视频检索任务OnAudioCaps上实现零散和微调设置,以及AudioSet和VGGSound上的音频-视觉事件分类和音频-视觉检索任务。

URL

https://arxiv.org/abs/2401.12264

PDF

https://arxiv.org/pdf/2401.12264.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot