Abstract
Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{this https URL}.
Abstract (translated)
大型语言模型中的幻觉问题仍然是一个持续的挑战,尤其是在多语言和生成环境中,事实一致性难以维持。尽管最近的模型在以英语为中心的基准测试中表现出色,但它们在不同语言、任务以及各种幻觉类型上的行为尚不完全明了。在此项研究工作中,我们引入了Halluverse-M^3数据集,该数据集旨在实现对多种语言、生成任务和幻觉类别中的幻觉进行系统的分析。Halluverse-M^3涵盖了四种语言:英语、阿拉伯语、印地语和土耳其语,并支持两种生成任务:问答和对话总结。此数据集明确区分了实体级、关系级和句子级的幻觉。通过受控编辑过程构造出幻觉输出,再由人工标注员验证,确保原始内容与生成的内容之间有清晰的一致性。使用该数据集,我们评估了一组多样化的当代开源及专有语言模型在细粒度幻觉检测上的表现。我们的结果显示,在问答任务中,即使是较弱的模型也能较好地进行检测;而在对话总结的任务中则相对较难,并且即便是最强力的模型在句子级别的幻觉检测上仍面临挑战。总的来说,性能最高的依然是英语环境下的模型,而资源较少的语言如印地语,则表现出最低的检测准确率。 总体而言,Halluverse-M^3为研究多语言和多任务设置中的幻觉提供了一个现实且具有挑战性的基准测试平台。我们发布了该数据集以支持未来关于幻觉检测与缓解的研究工作[1](此链接指向发布数据集的网址)。
URL
https://arxiv.org/abs/2602.06920