Paper Reading AI Learner

CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification

2024-05-06 17:37:23
Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal

Abstract

Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.

Abstract (translated)

零 shot学习在广泛的视觉识别领域得到了广泛研究,并吸引了最近显著的关注。然而,在文档图像分类领域,零 shot 学习的现有研究仍然很少。现有研究要么只专注于零 shot 推理,要么它们的评估标准与视觉识别域中的 established criteria 不相符。我们在 Zero-Shot Learning (ZSL) 和一般零 shot学习 (GZSL) 设置中提供了全面的文档图像分类分析,以填补这一空白。我们的方法和评估与该领域的 established practices 保持一致。此外,我们还提出了 RVL-CDIP 数据集中的零 shot 划分。此外,我们引入了 CICA(发音为 'ki-ka'),一种增强 CLIP 零 shot 学习能力的框架。CICA 包括一个新颖的“内容模块”,用于利用任何文档相关文本信息。这个模块提取的判别特征与 CLIP 的文本和图像特征通过一种新颖的“耦合对比”损失进行对齐。我们的模块在 RVL-CDIP 数据集上提高了 CLIP 的 ZSL top-1 准确率 by 6.7%,GZSL 均方误差 by 24%。我们的模块轻量级,并为 CLIP 添加了仅 3.3% 的参数。我们的工作为未来的零 shot 文献分类研究奠定了方向。

URL

https://arxiv.org/abs/2405.03660

PDF

https://arxiv.org/pdf/2405.03660.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot