Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Abstract
Abstract (translated)
URL
PDF

Abstract

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: this http URL

Abstract (translated)

奇怪、不寻常和反常的图像激发了观察家的好奇,因为它们挑战了常识。例如,在2022年世界杯期间,一张图片描绘了著名的足球运动员Lionel Messi和Cristiano Ronaldo玩 chess,这种游戏性的违反常识的期望我们期望他们的比赛应该在足球场上进行。人类很容易识别和理解这些非常规的图像,但是AI模型是否可以做到呢?我们介绍了WHOOPS!,一个用于视觉常识推理的新数据集和基准。数据集由设计师使用 Midjourney等公开可用的图像生成工具故意制作的违反常识的图像组成。我们考虑在数据集中设置几个任务。除了图像标题翻译、跨媒体匹配和视觉问题回答,我们还引入了一个困难解释生成任务,模型必须识别和解释为什么给定图像反常。我们的结果表明,最先进的模型如GPT3和BLIP2在WHOOPS!上的表现仍然落后于人类。我们希望我们的数据集将激励开发具有更强的视觉常识推理能力更强的AI模型。数据、模型和代码可在项目网站上找到:此httpURL。

URL

https://arxiv.org/abs/2303.07274

PDF

https://arxiv.org/pdf/2303.07274.pdf