Abstract
Recent advancements in image inpainting, particularly through diffusion modeling, have yielded promising outcomes. However, when tested in scenarios involving the completion of images based on the foreground objects, current methods that aim to inpaint an image in an end-to-end manner encounter challenges such as "over-imagination", inconsistency between foreground and background, and limited diversity. In response, we introduce Anywhere, a pioneering multi-agent framework designed to address these issues. Anywhere utilizes a sophisticated pipeline framework comprising various agents such as Visual Language Model (VLM), Large Language Model (LLM), and image generation models. This framework consists of three principal components: the prompt generation module, the image generation module, and the outcome analyzer. The prompt generation module conducts a semantic analysis of the input foreground image, leveraging VLM to predict relevant language descriptions and LLM to recommend optimal language prompts. In the image generation module, we employ a text-guided canny-to-image generation model to create a template image based on the edge map of the foreground image and language prompts, and an image refiner to produce the outcome by blending the input foreground and the template image. The outcome analyzer employs VLM to evaluate image content rationality, aesthetic score, and foreground-background relevance, triggering prompt and image regeneration as needed. Extensive experiments demonstrate that our Anywhere framework excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity. It successfully elevates foreground-conditioned image inpainting to produce more reliable and diverse results.
Abstract (translated)
近期在图像修复算法的进步,特别是扩散建模,已经取得了很好的结果。然而,在基于前景对象完成图像的场景中进行测试时,当前试图通过端到端方法修复图像的方法遇到了一些挑战,如“过度想象”、“前景和背景之间不一致”和“缺乏多样性”。为了应对这些挑战,我们引入了Anywhere,这是一个首创的多代理框架,旨在解决这些问题。Anywhere采用了一个复杂的管道框架,包括各种代理,如Visual Language Model(VLM)、大型语言模型(LLM)和图像生成模型。这个框架包括三个主要组件:提示生成模块、图像生成模块和结果分析器。提示生成模块对输入的前景图像进行语义分析,利用VLM预测相关的语言描述,并利用LLM推荐最优的语言提示。在图像生成模块中,我们使用基于文本指导的Canny-to-图像生成模型根据前景图像的边缘图和语言提示创建模板图像,并使用图像平滑器根据输入前景和模板图像产生结果。结果分析器利用VLM评估图像内容的合理性、美学分数和前景与背景的相关性,根据需要触发提示和图像再生。大量实验证明,我们的Anywhere框架在前景条件下的图像修复方面表现出色,减轻了“过度想象”,解决了前景和背景之间的不一致,并增强了多样性。它将前景条件下的图像修复提升到了产生更可靠和多样结果的水平。
URL
https://arxiv.org/abs/2404.18598