Paper Reading AI Learner

Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method

2025-05-20 10:55:26
Xinshen Zhang, Zhen Ye, Xu Zheng

Abstract

Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).

Abstract (translated)

全向图像(ODI)以其360°的视野,为增强现实和具身人工智能等沉浸式应用提供了无与伦比的空间感知能力。然而,现有的多模态大型语言模型(MLLMs)在理解和推理这种全景场景方面的能力仍有待探索。本文通过介绍OmniVQA——首个用于全向视觉问答的数据集及基准测试来填补这一空白。我们对当前最先进的MLLM的评估显示,在处理全向视觉问答时存在显著局限性,尤其是在对象定位、特征提取以及抑制全景上下文中的幻觉生成方面仍然面临挑战。这些结果强调了现有MLLM能力与全向视觉理解需求之间的差距,并呼吁开发针对360°图像专门设计的架构或训练创新。 基于OmniVQA数据集和基准测试,我们进一步提出了一种基于Qwen2.5-VL-Instruct的规则强化学习方法——360-R1。具体而言,通过提议三种新颖的奖励函数来改进群相对策略优化(GRPO):(1)推理过程相似性奖励;(2)答案语义准确性奖励;以及(3)结构化格式合规性奖励。在我们OmniVQA数据集上的广泛实验表明,我们的方法在全向空间中表现出优越性能(+6%的改进)。

URL

https://arxiv.org/abs/2505.14197

PDF

https://arxiv.org/pdf/2505.14197.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot