Abstract
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
Abstract (translated)
当代扩散模型在文本到图像的生成方面表现出卓越的能力,但仍受限于固定的分辨率(例如,1024x1024)。近期的研究进展使得通过循环利用预训练的扩散模型,并借助区域去噪或膨胀采样/卷积技术来实现无需微调的高分辨率图像生成成为可能。然而,这些模型在同时保持全局语义结构和生成具有创意性的局部细节方面仍面临挑战。 为了解决这个问题,我们提出了C-Upscale,这是一种新的无需微调的图像上采样方法,它基于从给定的全局提示以及通过多模态大语言模型估算出的区域提示中提取出的全局-区域先验。在技术实现上,低分辨率图像中的低频部分被识别为全局结构先验,以鼓励高分辨率生成过程中的全局语义一致性。接下来,我们执行区域注意力控制,在区域去噪过程中筛选全局提示与每个区域之间的交叉注意,从而缓解对象重复的问题,并形成一个区域注意力先验。估算出的包含丰富描述性细节的区域提示进一步充当区域语义先验,以激发局部细节生成的创造性。 无论是定量还是定性的评估都表明,我们的C-Upscale方法能够生成超高分辨率图像(例如4096x4096和8192x8192),并具备更高的视觉保真度以及更多具有创意性的区域细节。
URL
https://arxiv.org/abs/2505.16976