Paper Reading AI Learner

Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion

2024-04-22 08:59:35
Yingxuan Li, Ryota Hinami, Kiyoharu Aizawa, Yusuke Matsui

Abstract

Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.

Abstract (translated)

识别字符并预测对话发言者对于幽默处理任务(如语音生成或翻译)非常重要。然而,因为漫画作品不同,需要为每个漫画作品提供特定注释的监督学习方法(如训练角色分类器)是不切实际的。这激励我们提出了一种新颖的零 shot方法,使机器仅基于未注释的漫画图像识别字符并预测发言者名称。尽管在现实应用中这些任务具有重要性,但它们尚未得到充分的探索,因为理解故事情节和多模态集成方面的挑战。近年来,大型语言模型(LLMs)在文本理解和推理方面表现出巨大能力,但将它们应用于多模态内容分析仍然是未解决的问题。为了解决这个问题,我们提出了一个多模态框架,是第一个将多模态信息用于角色识别和发言者预测任务的。我们的实验证明了所提出的框架的有效性,为这些任务建立了稳健的基线。此外,由于我们的方法无需训练数据或注释,因此可以将其直接应用于任何漫画系列。

URL

https://arxiv.org/abs/2404.13993

PDF

https://arxiv.org/pdf/2404.13993.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot