Abstract
This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: this https URL.
Abstract (translated)
本次调查对多模态大型语言模型(MMLMs)的现象进行了全面分析,这些模型也被称为大型视觉语言模型(LVLMs),在多模态任务中取得了显著的进步和非凡的能力。尽管有这些鼓舞人心的发展和显著的进步,但MMLMs通常生成的输出与视觉内容不一致,这是一种称为幻觉的挑战,这对其实际部署造成了实质性的障碍,并对其可靠性在现实世界应用中提出了担忧。这个问题吸引了越来越多的关注,促使人们努力检测和缓解这种不准确。我们回顾了最近在识别、评估和缓解这种幻觉方面的最新进展,提供了对这个问题背后的原因、评估基准、指标和策略的详细概述。此外,我们分析了当前的挑战和局限性,提出了开放性问题,勾勒出未来研究的潜在路径。通过深入研究幻觉的原因、评估基准和缓解方法,本次调查旨在加深人们对MMLMs幻觉的理解,并为该领域的进一步发展提供有益的见解和资源。资源可在此链接中获取:https://this.url
URL
https://arxiv.org/abs/2404.18930