Paper Reading AI Learner

Weak-to-Strong Diffusion with Reflection

2025-02-01 16:00:08
Lichen Bai, Masashi Sugiyama, Zeke Xie

Abstract

The goal of diffusion generative models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations in training data quality, modeling strategies, and architectural design lead to inevitable gap between generated outputs and real data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated difference between existing weak and strong models (i.e., weak-to-strong difference) to approximate the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong difference, we theoretically understand that W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving SOTA performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong difference further solidify its practical utility and deployability.

Abstract (translated)

扩散生成模型的目标是通过梯度分数匹配使学习到的分布与真实数据分布对齐。然而,训练数据质量、建模策略和架构设计等方面的固有限制导致了生成输出与真实数据之间不可避免的差距。为了缩小这一差距,我们提出了弱至强扩散(Weak-to-Strong Diffusion, W2SD),这是一种新颖的框架,利用现有弱模型和强模型之间的估计差异来逼近理想模型与强模型间的差距。通过交替进行去噪和反转操作并结合弱至强差异,理论上可以理解W2SD引导潜在变量沿着采样轨迹向真实数据分布区域移动。 W2SD具有高度灵活性和广泛适用性,能够通过策略选择不同类型的弱至强模型对(例如DreamShaper vs SD1.5、MoE中的优秀专家vs不良专家)实现多样化的改进。广泛的实验表明,W2SD显著提高了人类偏好度、美学质量和提示一致性,并在各种模态(如图像、视频)、架构(基于UNet的、DiT-based、MoE)和基准测试中实现了最先进的性能。例如,使用W2SD的Juggernaut-XL可以将HPSv2获胜率提高至原始结果之上90%。 此外,通过W2SD实现的性能提升显著超过了其额外计算开销,并且从不同弱至强差异累计而来的改进进一步巩固了其实用性和可部署性。

URL

https://arxiv.org/abs/2502.00473

PDF

https://arxiv.org/pdf/2502.00473.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot