Abstract
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
Abstract (translated)
视觉语言数据对于文本到图像(T2I)和图像到文本(I2T)研究来说至关重要。然而,现有的数据集缺乏描述性,这些描述可以让模型更丰富地学习关联。为了填补这个空白,我们引入了连接和对比图像的描述(DOCCI)数据集,这是一个由单个研究人员收集、策划和捐赠的15000张图片的数据集,旨在捕捉一些关键挑战,如空间关系、计数、文本渲染、世界知识等。我们指示人类标注者为每张图片创建全面的描述;这些描述通常长度为136个词,并刻意区分彼此的关系或相似性。每个描述高度可组合,通常涵盖多个挑战。通过数量和定性分析,我们证明DOCCI是一个有效的图像到文本生成训练资源——在DOCCI上训练的PaLI 5B模型与高度表现的大模型(如LaVA-1.5 7B和InstructBLIP 7B)相比,表现出相同或更好的效果。此外,我们还展示了DOCCI对于文本到图像生成的测试平台的价值,突出了当前文本到图像模型的局限性,即捕捉不了长描述和细节。
URL
https://arxiv.org/abs/2404.19753