Abstract
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
Abstract (translated)
扩散模型在文本到图像生成方面表现出色。然而,现有的方法在处理涉及多个对象、特征和关系的复杂提示时常常会遇到性能瓶颈。因此,我们提出了一种基于多智能体协作的组合扩散(MCCD)方法,用于复杂的场景文本到图像生成。具体而言,我们设计了一个基于多智能体协作的场景解析模块,该模块生成一个由执行不同任务的多个代理组成的系统,并利用大规模语言模型有效地提取各种场景元素。此外,层次化组合扩散通过使用高斯掩码和过滤技术来细化边界框区域并增强对象,从而实现了复杂场景的精确且高质量生成。 全面的实验表明,在无需训练的情况下,我们的MCCD显著提高了基准模型在复杂场景生成方面的性能,并提供了明显的优势。
URL
https://arxiv.org/abs/2505.02648