Paper Reading AI Learner

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

2024-04-22 03:05:32
Mingjie Ma, Zhihuan Yu, Yichao Ma, Guohui Li

Abstract

Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.

Abstract (translated)

视觉共同推理(VCR)是一个具有挑战性的认知任务,要求模型回答视觉问题,并解释为什么答案是正确的。随着大型语言模型的(LLMs)的出现,自然且必要地研究它们应用于VCR的可行性。然而,VCR任务需要更多的外部知识来解决其具有挑战性的问题,因此需要特殊的设计来激活LLMs的常识推理能力。此外,现有的多模态LLM通常采用对整个输入图像的抽象,这使得VCR在图像区域和文本之间的独特共指标签难以理解,为精微对齐带来挑战。为了应对这些问题,我们提出了EventLens,它利用了事件感知预训练和跨模态链接来增强VCR。首先,通过模拟人类推理的认知过程,引入了一个事件感知的预训练辅助任务,以更好地激活LLM对复杂场景的全面理解。其次,在微调期间,我们进一步利用参考标签将RoI特征与文本连接起来,同时保留模态语义。最后,我们使用指导式提示来缩小预训练和微调之间的差距,并使用任务特定适配器更好地将LLM固有的知识与新的常识集成。实验结果表明,我们提出的辅助任务和细粒度链接策略的有效性得到了验证。

URL

https://arxiv.org/abs/2404.13847

PDF

https://arxiv.org/pdf/2404.13847.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot