Paper Reading AI Learner

MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval

2026-02-10 17:44:57
Delvin Ce Zhang, Suhan Cui, Zhelin Chu, Xianren Zhang, Dongwon Lee

Abstract

Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.

Abstract (translated)

验证声明的真实性通常需要对文本和视觉证据进行多模态推理,例如分析文字描述和图表图像以验证声明。此外,为了使推理过程透明化,还需要通过文本解释来证明验证结果的合理性。然而,大多数声明验证工作主要集中在仅基于文本证据的推理上,或者忽略了可解释性,这导致了验证结果的不准确和缺乏说服力。为了解决这一问题,我们提出了一种新型模型,该模型可以同时实现证据检索、多模态声明验证以及生成解释。 在证据检索方面,我们构建了一个双层的多模态图来关联声明与证据,其中设计了图像到文本和文本到图像的推理机制来进行多模态检索。对于声明验证,我们提出了令牌级和证据级融合方法,以整合声明和证据的嵌入特征进行多模态验证。在解释生成方面,我们引入了Decoder中的多模态Fusion-in-Decoder来增强可解释性。 最后,鉴于几乎所有数据集都属于通用领域,我们在人工智能领域创建了一个新的科学数据集AIChartClaim,以此来补充和完善声明验证社区的需求。实验结果展示了我们的模型的优势和有效性。

URL

https://arxiv.org/abs/2602.10023

PDF

https://arxiv.org/pdf/2602.10023.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot