Abstract
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
Abstract (translated)
识别字符并预测对话发言者对于幽默处理任务(如语音生成或翻译)非常重要。然而,因为漫画作品不同,需要为每个漫画作品提供特定注释的监督学习方法(如训练角色分类器)是不切实际的。这激励我们提出了一种新颖的零 shot方法,使机器仅基于未注释的漫画图像识别字符并预测发言者名称。尽管在现实应用中这些任务具有重要性,但它们尚未得到充分的探索,因为理解故事情节和多模态集成方面的挑战。近年来,大型语言模型(LLMs)在文本理解和推理方面表现出巨大能力,但将它们应用于多模态内容分析仍然是未解决的问题。为了解决这个问题,我们提出了一个多模态框架,是第一个将多模态信息用于角色识别和发言者预测任务的。我们的实验证明了所提出的框架的有效性,为这些任务建立了稳健的基线。此外,由于我们的方法无需训练数据或注释,因此可以将其直接应用于任何漫画系列。
URL
https://arxiv.org/abs/2404.13993