Paper Reading AI Learner

A Multimodal Approach for Cross-Domain Image Retrieval

2024-03-22 12:08:16
Lucas Iijima, Tania Stathaki

Abstract

Image generators are gaining vast amount of popularity and have rapidly changed how digital content is created. With the latest AI technology, millions of high quality images are being generated by the public, which are constantly motivating the research community to push the limits of generative models to create more complex and realistic images. This paper focuses on Cross-Domain Image Retrieval (CDIR) which can be used as an additional tool to inspect collections of generated images by determining the level of similarity between images in a dataset. An ideal retrieval system would be able to generalize to unseen complex images from multiple domains (e.g., photos, drawings and paintings). To address this goal, we propose a novel caption-matching approach that leverages multimodal language-vision architectures pre-trained on large datasets. The method is tested on DomainNet and Office-Home datasets and consistently achieves state-of-the-art performance over the latest approaches in the literature for cross-domain image retrieval. In order to verify the effectiveness with AI-generated images, the method was also put to test with a database composed by samples collected from Midjourney, which is a widely used generative platform for content creation.

Abstract (translated)

图像生成器正在迅速获得大量关注,并已经彻底改变了数字内容是如何创作的。随着最新的AI技术,数百万高质量的图像是由公众生成的,这不断激励研究社区不断挑战生成模型的极限,以创建更复杂和逼真的图像。本文重点关注跨域图像检索(CDIR),可以作为进一步工具,通过确定数据集中图像之间的相似度来检查生成图像的收藏品。一个理想的检索系统应该能够泛化到多个领域的未见过的复杂图像(例如照片、绘画和绘画)。为了实现这个目标,我们提出了一个新颖的标题匹配方法,该方法利用预训练在大型数据集上的多模态语言视觉架构。该方法在DomainNet和Office-Home数据集上进行测试,并持续超越了文献中关于跨域图像检索的最新方法的性能。为了验证该方法与AI生成的图像的有效性,该方法还用于由Midjourney收集的样本的数据库上进行测试,这是一个广泛用于内容创作的生成平台。

URL

https://arxiv.org/abs/2403.15152

PDF

https://arxiv.org/pdf/2403.15152.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot