Abstract
Large language models (LLMs) have shown superior capabilities in translating figurative language compared to neural machine translation (NMT) systems. However, the impact of different prompting methods and LLM-NMT combinations on idiom translation has yet to be thoroughly investigated. This paper introduces two parallel datasets of sentences containing idiomatic expressions for Persian$\rightarrow$English and English$\rightarrow$Persian translations, with Persian idioms sampled from our PersianIdioms resource, a collection of 2,200 idioms and their meanings. Using these datasets, we evaluate various open- and closed-source LLMs, NMT models, and their combinations. Translation quality is assessed through idiom translation accuracy and fluency. We also find that automatic evaluation methods like LLM-as-a-judge, BLEU and BERTScore are effective for comparing different aspects of model performance. Our experiments reveal that Claude-3.5-Sonnet delivers outstanding results in both translation directions. For English$\rightarrow$Persian, combining weaker LLMs with Google Translate improves results, while Persian$\rightarrow$English translations benefit from single prompts for simpler models and complex prompts for advanced ones.
Abstract (translated)
大型语言模型(LLMs)在翻译比喻性语言方面显示出优于神经机器翻译(NMT)系统的卓越能力。然而,不同的提示方法以及LLM-NMT组合对成语翻译的影响尚未得到彻底研究。本文引入了两个平行的数据集,其中包含波斯语→英语和英语→波斯语的含有成语表达的句子,波斯语成语从我们的PersianIdioms资源中采样,该资源包含了2,200个成语及其含义。利用这些数据集,我们评估了各种开源和闭源LLMs、NMT模型及它们的不同组合。翻译质量通过成语翻译准确性和流利度进行评估。我们还发现自动评估方法如以LLM作为评判者、BLEU 和 BERTScore 对比较不同方面的模型性能非常有效。我们的实验表明,Claude-3.5-Sonnet 在两个翻译方向上都取得了出色的结果。对于英语→波斯语的翻译,较弱的LLMs与Google Translate结合可以改善结果;而波斯语→英语的翻译则从简单模型的单个提示和先进模型的复杂提示中受益。
URL
https://arxiv.org/abs/2412.09993