Abstract
We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
Abstract (translated)
我们提出了一个形式化的信息论框架来处理图像标题任务,将其视为一种表示学习任务。我们的框架定义了三个关键目标:任务完备性、最小冗余性和人可解释性。在此基础上,我们提出了一个新的金字塔式标题方法(PoCa) ,通过为缩放的图像补丁生成局部标题,并使用大型语言模型将它们与全局标题信息集成来构建标题金字塔。这种方法利用直觉,即对局部补丁的详细检查可以降低错误风险并解决全局标题的不准确性,或者通过纠正幻觉或添加缺失细节来解决。根据我们的理论框架,我们形式化了这个直觉,并提供了形式化的证明,证明在某些假设下,PoCa具有有效性。用各种图像标题模型和大型语言模型进行实证测试,结果表明,PoCa始终产生更有信息量和语义一致性的标题,保持简短和可解释性。
URL
https://arxiv.org/abs/2405.00485