Abstract
Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
Abstract (translated)
多语言大语言模型已经实现了显著的基准性能,但我们发现它们在当代大语言模型家族中对非拉丁书写系统的语言表现依然欠佳。这种差异源于这些大语言模型是在以拉丁字符为主的正写法基础上进行预训练的,这掩盖了与非拉丁文字系统共享的音系学特征。我们建议利用音位转写作为互补信号来诱导书写系统不变的表现形式。我们的研究表明,整合音位信号能够提升包括拉丁和非拉丁语言在内的整体表现,并特别显著地缩小了两种语言之间的性能差距。通过详细的实验,我们展示了音位与正写法脚本在上下文学习(ICL)中检索出不同的示例。这促使我们提出了混合ICL检索策略,在进一步聚合后,我们的方法对于拉丁书写系统语言的性能提升最高可达12.6%,而对于非拉丁书写系统语言则高达15.1%,相比随机的ICL检索有显著改善。
URL
https://arxiv.org/abs/2411.02398