Paper Reading AI Learner

3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset

2024-04-29 04:01:30
Xinyu Ma, Xuebo Liu, Derek F. Wong, Jun Rao, Bei Li, Liang Ding, Lidia S. Chao, Dacheng Tao, Min Zhang

Abstract

Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT research. This paper presents a novel solution to this issue by introducing 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese, each with corresponding images. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. We utilize a word sense disambiguation model to select ambiguous data from vision-and-language datasets, resulting in a more challenging dataset. We further benchmark several state-of-the-art MMT models on our proposed dataset. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets. Our work provides a valuable resource for researchers in the field of multimodal learning and encourages further exploration in this area. The data, code and scripts are freely available at this https URL.

Abstract (translated)

多模态机器翻译(MMT)是一项具有挑战性的任务,旨在通过融入视觉信息来提高翻译质量。然而,最近的研究表明,现有MMT数据集中的视觉信息不足以改善模型的性能,导致模型忽视它并夸大其能力。这个问题对MMT研究的未来发展构成了严重的障碍。本文通过引入3AM数据集,为解决这个问题提供了一种新的解决方案。3AM数据集是一个包含26,000个并行句子对的英语和中文的多模态数据集,每个句子对都配有相应的图像。我们的数据集特意设计为包括更多的歧义和更多不同种类的图像,与其他MMT数据集相比具有更大的差异。我们利用语义距离模型从视觉和语言数据集中选择歧义数据,从而形成了一个更具挑战性的数据集。我们还对我们的数据集上的一些最先进的MMT模型进行了实验比较。实验结果表明,在我们提出的数据集上训练的MMT模型比那些在其他MMT数据集上训练的模型具有更强的利用视觉信息的能力。我们的工作为该领域的研究人员提供了一个宝贵的资源,并鼓励在这个领域进行进一步的探索。数据、代码和脚本可免费在https://这个网址上获取。

URL

https://arxiv.org/abs/2404.18413

PDF

https://arxiv.org/pdf/2404.18413.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot