Abstract
Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.
Abstract (translated)
大规模多模态模型在视觉问答(VQA)等任务上表现出色,但在需要基于文化常识和日常知识的查询时往往表现不佳,尤其是在资源匮乏和代表性不足的语言环境中。为了弥合这一差距,我们引入了日常生活中的跨模态、跨语言问答框架(Everyday Multimodal and Multilingual QA, 简称 EverydayMMQA),这是一个用于创建大规模、基于文化的口语和视觉问题回答(SVQA)数据集的框架。利用该框架,我们开发了OASIS,一个多模态数据集,整合了语音、图像和文本。OASIS包含超过92万张图片和1480万个问答对,并且有370万个口语问题,支持四种独特的输入组合:仅语音、仅文字、语音加图像以及文字加图像。 该数据集专注于英语和阿拉伯语变体,在来自18个国家的多样化现实世界场景中进行内容策划。OASIS测试模型在超越物体识别的任务上的表现,这些任务涉及实用主义推理、常识推理以及文化意识推理。我们对四个闭源模型、三个开源模型及一个微调后的模型进行了基准测试。 EverydayMMQA和OASIS共同提供了一个用于构建涵盖多种日常任务的文化背景下的多模态大语言模型(LLM)的评估标准和训练数据集。该框架和数据集将向社区公开发布,以促进研究和发展。
URL
https://arxiv.org/abs/2510.06371