CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification

2024-05-06 17:37:23
Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal


Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.

Abstract (translated)

零 shot学习在广泛的视觉识别领域得到了广泛研究,并吸引了最近显著的关注。然而,在文档图像分类领域,零 shot 学习的现有研究仍然很少。现有研究要么只专注于零 shot 推理,要么它们的评估标准与视觉识别域中的 established criteria 不相符。我们在 Zero-Shot Learning (ZSL) 和一般零 shot学习 (GZSL) 设置中提供了全面的文档图像分类分析,以填补这一空白。我们的方法和评估与该领域的 established practices 保持一致。此外,我们还提出了 RVL-CDIP 数据集中的零 shot 划分。此外,我们引入了 CICA(发音为 'ki-ka'),一种增强 CLIP 零 shot 学习能力的框架。CICA 包括一个新颖的“内容模块”,用于利用任何文档相关文本信息。这个模块提取的判别特征与 CLIP 的文本和图像特征通过一种新颖的“耦合对比”损失进行对齐。我们的模块在 RVL-CDIP 数据集上提高了 CLIP 的 ZSL top-1 准确率 by 6.7%,GZSL 均方误差 by 24%。我们的模块轻量级,并为 CLIP 添加了仅 3.3% 的参数。我们的工作为未来的零 shot 文献分类研究奠定了方向。



