Abstract
Inspired by the increasing interest in leveraging large language models for translation, this paper evaluates the capabilities of large language models (LLMs) represented by ChatGPT in comparison to the mainstream neural machine translation (NMT) engines in translating Chinese diplomatic texts into English. Specifically, we examine the translation quality of ChatGPT and NMT engines as measured by four automated metrics and human evaluation based on an error-typology and six analytic rubrics. Our findings show that automated metrics yield similar results for ChatGPT under different prompts and NMT systems, while human annotators tend to assign noticeably higher scores to ChatGPT when it is provided an example or contextual information about the translation task. Pairwise correlation between automated metrics and dimensions of human evaluation produces weak and non-significant results, suggesting the divergence between the two methods of translation quality assessment. These findings provide valuable insights into the potential of ChatGPT as a capable machine translator, and the influence of prompt engineering on its performance.
Abstract (translated)
受到利用大型语言模型进行翻译 increasing 兴趣的启发,本文评估了 ChatGPT 这样的大型语言模型在将中文外交文本翻译成英语与主流神经机器翻译(NMT)引擎之间的能力。具体来说,我们研究了 ChatGPT 和 NMT 引擎的翻译质量,通过四种自动指标和基于错误类型和六种分析指标的人际评价进行评估。我们的发现表明,对于不同的提示和翻译系统,自动指标在 ChatGPT 上产生了类似的结果,而当提供关于翻译任务的例子或上下文信息时,人类评估者倾向于给 ChatGPT 分配更高的分数。自动指标和人类评估指标之间成对相关,结果为弱和非显著,表明两种翻译质量评估方法的差异。这些发现为 ChatGPT 作为具有潜力的机器翻译工具以及其性能受到提示工程影响的分析提供了宝贵的见解。
URL
https://arxiv.org/abs/2401.05176