Paper Reading AI Learner

Canonical Latent Representations in Conditional Diffusion Models

2025-06-11 17:28:52
Yitao Xu, Tong Zhang, Ehsan Pajouheshgar, Sabine S\"usstrunk

Abstract

Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.

Abstract (translated)

条件扩散模型(CDMs)在多种生成任务中表现出卓越的性能。它们能够对完整数据分布进行建模的能力,为下游判别学习中的分析与合成方法开辟了新的途径。然而,这种强大的建模能力也导致CDMs将定义类别的特征与不相关的背景信息纠缠在一起,使得提取稳健且可解释的表示变得具有挑战性。为此,我们识别出了典范潜在表示(Canonical LAtent Representations, CLAReps),这是一种内部CDM特征能够保留关键类别信息同时摒弃非判别信号的潜在编码方式。当这些CLAReps被解码时,它们能为每个类生成代表性的样本,并提供一个简洁、可解释的核心语义摘要,包含最少的无关细节。 利用CLAReps,我们开发了一种新颖的基于扩散的方法——CaDistill(特征蒸馏),用于知识传递。在此过程中,学生模型可以完全访问整个训练集,而作为教师的CDM则仅通过CLAReps将核心类别知识传输给学生,这些CLAReps只占原始训练数据量的10%左右。经过训练后,学生模型在对抗鲁棒性和泛化能力方面表现优异,并且更注重类别的信号而非误导性的背景线索。 我们的发现表明,CDMs不仅可以作为图像生成器使用,还能充当紧凑、可解释的知识传授者,促进稳健的表示学习。

URL

https://arxiv.org/abs/2506.09955

PDF

https://arxiv.org/pdf/2506.09955.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot