BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Abstract
Abstract (translated)
URL
PDF

Abstract

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: this https URL.

Abstract (translated)

生成具有更高分辨率的人中心场景，具有详细信息和控制仍然是对现有文本到图像扩散模型的挑战。这一挑战源于训练图像大小有限、文本编码器能力有限以及生成涉及多个人的复杂场景的固有难度。虽然现有的方法试图解决训练大小限制，但它们通常产生具有严重伪影的人为中心场景。我们提出BeyondScene，一种新框架，克服了先前的限制，使用现有的预训练扩散模型生成卓越的高分辨率（超过8K）人中心场景，具有出色的文本图像匹配和自然性。BeyondScene采用阶段性和层次结构的方法，首先生成关注多个人类实例创建关键元素的详细基础图像，并超越了扩散模型的token limit，然后平滑地将基础图像转换为高分辨率输出，超过训练图像大小，并利用我们提出的实例感知层次结构扩展过程，其中包含我们提出的频高注入前向扩散和自适应联合扩散，超越了现有的方法在详细文本描述和自然性方面的表现。BeyondScene在详细文本描述和自然性方面超过了现有的方法，为高级应用于高分辨率人中心场景创建打开了道路，而无需进行昂贵的重新训练。项目页面：https://this URL。

URL

https://arxiv.org/abs/2404.04544

PDF

https://arxiv.org/pdf/2404.04544.pdf

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Abstract

Abstract (translated)

URL

PDF Copy

PDF