Paper Reading AI Learner

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

2024-05-16 17:59:21
Zheng Gu, Shiyuan Yang, Jing Liao, Jing Huo, Yang Gao

Abstract

Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.

Abstract (translated)

视觉上下文学习(ICL)由于通过类比推理完成各种任务的能力而成为一个有前景的研究领域。然而,基于训练的视觉ICL在泛化到未见过的任务方面存在局限性,需要收集多样任务数据集。另一方面,基于推理的视觉ICL方法仅依赖文本提示,无法从给定的例子中捕捉到细微的上下文信息,并且将图像从图像到文本提示的转换过程中需要花费时间。为了应对这些挑战,我们提出了Analogist,一种新颖的基于推理的视觉ICL方法,利用预训练的文本到图像扩散模型来探索图像和文本提示技术。 在视觉提示方面,我们提出了自注意力克隆(SAC)方法,以引导图像示例之间的细粒度结构级类比。在文本提示方面,我们利用GPT-4V的视觉推理能力高效生成文本提示,并引入跨注意掩码(CAM)操作,以增强由文本提示引导的语义级类比的精度。我们的方法是出类拔萃的,不需要微调或优化。它也具有通用性和灵活性,能够以上下文方式执行各种视觉任务。大量实验证明,我们的方法在质量和数量上优于现有方法。

URL

https://arxiv.org/abs/2405.10316

PDF

https://arxiv.org/pdf/2405.10316.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot