Abstract
Image generators are gaining vast amount of popularity and have rapidly changed how digital content is created. With the latest AI technology, millions of high quality images are being generated by the public, which are constantly motivating the research community to push the limits of generative models to create more complex and realistic images. This paper focuses on Cross-Domain Image Retrieval (CDIR) which can be used as an additional tool to inspect collections of generated images by determining the level of similarity between images in a dataset. An ideal retrieval system would be able to generalize to unseen complex images from multiple domains (e.g., photos, drawings and paintings). To address this goal, we propose a novel caption-matching approach that leverages multimodal language-vision architectures pre-trained on large datasets. The method is tested on DomainNet and Office-Home datasets and consistently achieves state-of-the-art performance over the latest approaches in the literature for cross-domain image retrieval. In order to verify the effectiveness with AI-generated images, the method was also put to test with a database composed by samples collected from Midjourney, which is a widely used generative platform for content creation.
Abstract (translated)
图像生成器正在迅速获得大量关注,并已经彻底改变了数字内容是如何创作的。随着最新的AI技术,数百万高质量的图像是由公众生成的,这不断激励研究社区不断挑战生成模型的极限,以创建更复杂和逼真的图像。本文重点关注跨域图像检索(CDIR),可以作为进一步工具,通过确定数据集中图像之间的相似度来检查生成图像的收藏品。一个理想的检索系统应该能够泛化到多个领域的未见过的复杂图像(例如照片、绘画和绘画)。为了实现这个目标,我们提出了一个新颖的标题匹配方法,该方法利用预训练在大型数据集上的多模态语言视觉架构。该方法在DomainNet和Office-Home数据集上进行测试,并持续超越了文献中关于跨域图像检索的最新方法的性能。为了验证该方法与AI生成的图像的有效性,该方法还用于由Midjourney收集的样本的数据库上进行测试,这是一个广泛用于内容创作的生成平台。
URL
https://arxiv.org/abs/2403.15152