Abstract
Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.
Abstract (translated)
大规模语言模型(LLMs)在当代自然语言处理中发挥了关键作用,并日益应用于各种行业。然而,这些大规模概率统计模型目前无法确保专业内容生成的必需质量。这些模型通常产生幻觉文本,从而降低了其在专业场景下的实际应用价值。为了评估LLMs在文本生成中的真实性可靠性,已经开发了许多用于评估幻觉现象的基准评估。然而,由于成本和时间限制,这些基准评估通常采用约束生成技术。这些技术包括使用指向性幻觉归纳和策略,有意识地改变真实文本以产生幻觉。这些方法与现实应用要求无限制的文本生成不相符。此外,一个专门用于评估文本生成中幻觉的中文数据集目前还缺乏。因此,我们开发了一个无限制幻觉生成评估(UHGEval)基准,旨在通过最小限制限制LLM的输出。同时,我们还建立了一个全面评估框架,帮助后续研究人员进行可扩展和可重复的实验。我们还进行了大量实验,评估了权威中文语言模型以及GPT系列模型在幻觉挑战方面的专业性能。
URL
https://arxiv.org/abs/2311.15296