Paper Reading AI Learner

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

2024-04-26 02:34:31
Dongsheng Wang, Xiaoqin Feng, Zeming Liu, Chuan Wang

Abstract

Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.

Abstract (translated)

命名实体识别(NER)是自然语言处理中的一个基本任务,涉及将句子中的实体识别并分类到预定义的类型中。它在各种研究领域中都扮演着关键角色,包括实体链接、问答和在线产品推荐。近年来,研究表明,纳入多语言和多模态数据集可以增强NER的有效性。这是由于语言迁移学习和不同模态之间共享隐含特征的结果。然而,缺乏一个结合多语言性和多模态性的数据集限制了研究探索这两个方面的结合,因为多模态可以帮助NER在多种语言上同时进行识别。在本文中,我们旨在解决一个更具挑战性的任务:多语言和多模态命名实体识别(MMNER),考虑其潜力和影响。具体来说,我们构建了一个大规模MMNER数据集(包括英语、法语、德语和西班牙语)和两种模式(文本和图像)。为了在数据集上解决这个具有挑战性的MMNER任务,我们引入了一个名为2M-NER的新模型,它通过对比学习将文本和图像表示对齐,并集成了一个多模态合作模块,有效地描绘了两种模式之间的相互作用。大量的实验结果表明,与一些比较年和代表性的基线相比,我们的模型在多语言和多模态NER任务中获得了最高的F1分数。此外,在具有挑战性的分析中,我们发现句子级别对齐极大地影响了NER模型,这表明在我们的数据集中,困难程度更高。

URL

https://arxiv.org/abs/2404.17122

PDF

https://arxiv.org/pdf/2404.17122.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot