EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Abstract
Abstract (translated)
URL
PDF

Abstract

Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.

Abstract (translated)

视觉共同推理（VCR）是一个具有挑战性的认知任务，要求模型回答视觉问题，并解释为什么答案是正确的。随着大型语言模型的（LLMs）的出现，自然且必要地研究它们应用于VCR的可行性。然而，VCR任务需要更多的外部知识来解决其具有挑战性的问题，因此需要特殊的设计来激活LLMs的常识推理能力。此外，现有的多模态LLM通常采用对整个输入图像的抽象，这使得VCR在图像区域和文本之间的独特共指标签难以理解，为精微对齐带来挑战。为了应对这些问题，我们提出了EventLens，它利用了事件感知预训练和跨模态链接来增强VCR。首先，通过模拟人类推理的认知过程，引入了一个事件感知的预训练辅助任务，以更好地激活LLM对复杂场景的全面理解。其次，在微调期间，我们进一步利用参考标签将RoI特征与文本连接起来，同时保留模态语义。最后，我们使用指导式提示来缩小预训练和微调之间的差距，并使用任务特定适配器更好地将LLM固有的知识与新的常识集成。实验结果表明，我们提出的辅助任务和细粒度链接策略的有效性得到了验证。

URL

https://arxiv.org/abs/2404.13847

PDF

https://arxiv.org/pdf/2404.13847.pdf

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Abstract

Abstract (translated)

URL

PDF Copy

PDF