Abstract
Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).
Abstract (translated)
全向图像(ODI)以其360°的视野,为增强现实和具身人工智能等沉浸式应用提供了无与伦比的空间感知能力。然而,现有的多模态大型语言模型(MLLMs)在理解和推理这种全景场景方面的能力仍有待探索。本文通过介绍OmniVQA——首个用于全向视觉问答的数据集及基准测试来填补这一空白。我们对当前最先进的MLLM的评估显示,在处理全向视觉问答时存在显著局限性,尤其是在对象定位、特征提取以及抑制全景上下文中的幻觉生成方面仍然面临挑战。这些结果强调了现有MLLM能力与全向视觉理解需求之间的差距,并呼吁开发针对360°图像专门设计的架构或训练创新。 基于OmniVQA数据集和基准测试,我们进一步提出了一种基于Qwen2.5-VL-Instruct的规则强化学习方法——360-R1。具体而言,通过提议三种新颖的奖励函数来改进群相对策略优化(GRPO):(1)推理过程相似性奖励;(2)答案语义准确性奖励;以及(3)结构化格式合规性奖励。在我们OmniVQA数据集上的广泛实验表明,我们的方法在全向空间中表现出优越性能(+6%的改进)。
URL
https://arxiv.org/abs/2505.14197