Paper Reading AI Learner

Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages

2024-11-04 18:59:51
Hoang Nguyen, Khyati Mahajan, Vikas Yadav, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
     

Abstract

Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.

Abstract (translated)

多语言大语言模型已经实现了显著的基准性能,但我们发现它们在当代大语言模型家族中对非拉丁书写系统的语言表现依然欠佳。这种差异源于这些大语言模型是在以拉丁字符为主的正写法基础上进行预训练的,这掩盖了与非拉丁文字系统共享的音系学特征。我们建议利用音位转写作为互补信号来诱导书写系统不变的表现形式。我们的研究表明,整合音位信号能够提升包括拉丁和非拉丁语言在内的整体表现,并特别显著地缩小了两种语言之间的性能差距。通过详细的实验,我们展示了音位与正写法脚本在上下文学习(ICL)中检索出不同的示例。这促使我们提出了混合ICL检索策略,在进一步聚合后,我们的方法对于拉丁书写系统语言的性能提升最高可达12.6%,而对于非拉丁书写系统语言则高达15.1%,相比随机的ICL检索有显著改善。

URL

https://arxiv.org/abs/2411.02398

PDF

https://arxiv.org/pdf/2411.02398.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot