Abstract
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy. The re-sults indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabil-ities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.
Abstract (translated)
本研究评估了两种最先进的语言模型(LLMs)在法律领域中相对于传统神经机器翻译(NMT)系统的机器翻译(MT)质量。它通过将自动评估指标(AEMs)和职业翻译员的 human evaluation(HE)相结合来评估翻译排名、流畅度和充分性。研究结果表明,尽管 Google Translate 在 AEMs 中通常表现出色,但职业翻译员对 LLMs 的评估相对较好,尤其是 GPT-4,在产生上下文充分且流畅的翻译方面。这种差异表明 LLMs 在处理专业法律术语和上下文方面的潜在能力,突出了在评估 MT 质量中使用人类评估方法的重要性。该研究强调了 LLMs 在专业领域不断发展的能力,并呼吁重新评估传统的 AEMs,以更好地捕捉 LLM 生成的翻译的细微差别。
URL
https://arxiv.org/abs/2402.07681