Abstract
Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.
Abstract (translated)
命名实体识别(NER)是自然语言处理中的一个基本任务,涉及将句子中的实体识别并分类到预定义的类型中。它在各种研究领域中都扮演着关键角色,包括实体链接、问答和在线产品推荐。近年来,研究表明,纳入多语言和多模态数据集可以增强NER的有效性。这是由于语言迁移学习和不同模态之间共享隐含特征的结果。然而,缺乏一个结合多语言性和多模态性的数据集限制了研究探索这两个方面的结合,因为多模态可以帮助NER在多种语言上同时进行识别。在本文中,我们旨在解决一个更具挑战性的任务:多语言和多模态命名实体识别(MMNER),考虑其潜力和影响。具体来说,我们构建了一个大规模MMNER数据集(包括英语、法语、德语和西班牙语)和两种模式(文本和图像)。为了在数据集上解决这个具有挑战性的MMNER任务,我们引入了一个名为2M-NER的新模型,它通过对比学习将文本和图像表示对齐,并集成了一个多模态合作模块,有效地描绘了两种模式之间的相互作用。大量的实验结果表明,与一些比较年和代表性的基线相比,我们的模型在多语言和多模态NER任务中获得了最高的F1分数。此外,在具有挑战性的分析中,我们发现句子级别对齐极大地影响了NER模型,这表明在我们的数据集中,困难程度更高。
URL
https://arxiv.org/abs/2404.17122