Abstract
Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.
Abstract (translated)
大型语言模型(LLM)在基本算术方面取得了令人印象深刻的熟练度,在标准数值任务中表现可与人类相媲美。然而,很少有人关注这些模型在数值表达偏离其训练语料库中的现行惯例时的表现如何。在这项工作中,我们研究了各种数字脚本和格式的数值推理能力。研究表明,当数值输入以在其训练语料库中代表性不足的语言或格式呈现时,LLM 的准确性显著下降,尽管底层数学推理是相同的。此外,我们还展示了有针对性的提示策略(如少量样本提示和明确的数字映射)可以大大缩小这一差距。我们的研究结果揭示了多语言数值推理中的一个被忽视的挑战,并为与 LLM 一起工作以可靠地解释、操作和生成不同数字脚本和格式化样式的数字提供了实用见解。
URL
https://arxiv.org/abs/2601.15251