Abstract
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
Abstract (translated)
视觉上下文学习(ICL)由于通过类比推理完成各种任务的能力而成为一个有前景的研究领域。然而,基于训练的视觉ICL在泛化到未见过的任务方面存在局限性,需要收集多样任务数据集。另一方面,基于推理的视觉ICL方法仅依赖文本提示,无法从给定的例子中捕捉到细微的上下文信息,并且将图像从图像到文本提示的转换过程中需要花费时间。为了应对这些挑战,我们提出了Analogist,一种新颖的基于推理的视觉ICL方法,利用预训练的文本到图像扩散模型来探索图像和文本提示技术。 在视觉提示方面,我们提出了自注意力克隆(SAC)方法,以引导图像示例之间的细粒度结构级类比。在文本提示方面,我们利用GPT-4V的视觉推理能力高效生成文本提示,并引入跨注意掩码(CAM)操作,以增强由文本提示引导的语义级类比的精度。我们的方法是出类拔萃的,不需要微调或优化。它也具有通用性和灵活性,能够以上下文方式执行各种视觉任务。大量实验证明,我们的方法在质量和数量上优于现有方法。
URL
https://arxiv.org/abs/2405.10316