Paper Reading AI Learner

EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

2025-10-07 18:37:32
Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling

Abstract

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

Abstract (translated)

大规模多模态模型在视觉问答(VQA)等任务上表现出色,但在需要基于文化常识和日常知识的查询时往往表现不佳,尤其是在资源匮乏和代表性不足的语言环境中。为了弥合这一差距,我们引入了日常生活中的跨模态、跨语言问答框架(Everyday Multimodal and Multilingual QA, 简称 EverydayMMQA),这是一个用于创建大规模、基于文化的口语和视觉问题回答(SVQA)数据集的框架。利用该框架,我们开发了OASIS,一个多模态数据集,整合了语音、图像和文本。OASIS包含超过92万张图片和1480万个问答对,并且有370万个口语问题,支持四种独特的输入组合:仅语音、仅文字、语音加图像以及文字加图像。 该数据集专注于英语和阿拉伯语变体,在来自18个国家的多样化现实世界场景中进行内容策划。OASIS测试模型在超越物体识别的任务上的表现,这些任务涉及实用主义推理、常识推理以及文化意识推理。我们对四个闭源模型、三个开源模型及一个微调后的模型进行了基准测试。 EverydayMMQA和OASIS共同提供了一个用于构建涵盖多种日常任务的文化背景下的多模态大语言模型(LLM)的评估标准和训练数据集。该框架和数据集将向社区公开发布,以促进研究和发展。

URL

https://arxiv.org/abs/2510.06371

PDF

https://arxiv.org/pdf/2510.06371.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot