Abstract
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
Abstract (translated)
基于扩散的服装合成任务主要集中在时尚领域的设计阶段,而服装生产过程则较少被探索。为了弥合这一差距,我们提出了一项新的任务:平铺草图转真实服装图像(FS2RG),该任务通过整合平铺草图和文本引导生成逼真的服装图像。FS2RG面临两大挑战:1)布料特性仅由文本提示指导,提供的视觉监督不足,限制了扩散模型捕捉精细的布料细节的能力;2)平铺草图与文本指南可能提供冲突信息,要求模型在保留或修改服装属性的同时保持结构的一致性。 为应对这一任务,我们提出了HiGarment,这是一个包含两个核心组件的新框架:i) 多模态语义增强机制,在文本和视觉模式之间增强了布料表示;ii) 谐调跨注意力机制,动态平衡来自平铺草图和文本提示的信息,允许通过生成与草图对齐(图像偏置)或受文本引导(文本偏置)的输出来进行可控合成。此外,我们收集了多模态详细服装数据集,这是最大的开源服装生成数据集。 实验结果和用户研究表明HiGarment在服装生成方面具有有效性。代码和数据集将在未来发布。
URL
https://arxiv.org/abs/2505.23186