Paper Reading AI Learner

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

2023-01-29 22:30:36
Zhiqi Huang, Puxuan Yu, James Allan

Abstract

Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.

Abstract (translated)

得益于基于Transformer的预训练语言模型,神经网络排序模型已经取得了显著进展。近期,出现了多语言预训练语言模型,为设计神经网络跨语言检索模型提供了巨大的支持。然而,由于不同语言之间的预训练数据不平衡,多语言语言模型在许多后续任务中已经表现出了高资源和低资源语言之间性能差距。建立在这些预训练模型上的跨语言检索模型可以继承语言偏见,导致低资源语言的最优结果。此外,与英语到英语检索任务不同,存在大型文档排序训练集如MS MARCO,但对于低资源语言,缺乏跨语言检索数据,训练跨语言检索模型变得更加困难。在本文中,我们提出了光学(OPTICAL):低资源跨语言信息检索的最优传输蒸馏(Transport Distillation)方法。为了从高资源语言向低资源语言转移模型,光学将跨语言 token 对齐任务视为最优传输问题,从训练好的单语言检索模型学习。通过分离跨语言知识与查询文档匹配知识,光学只需要双文本数据进行蒸馏训练,对于低资源语言更为可行。实验结果显示,仅使用少量训练数据,光学在低资源语言中显著优于强大的基准模型,包括神经网络机器翻译。

URL

https://arxiv.org/abs/2301.12566

PDF

https://arxiv.org/pdf/2301.12566.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot