Paper Reading AI Learner

Learned representation-guided diffusion models for large-image generation

2023-12-12 14:45:45
Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le, Saarthak Kapse, Prateek Prasanna, Joel Saltz, Dimitris Samaras

Abstract

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

Abstract (translated)

为了合成高保真度的样本,扩散模型通常需要辅助数据来指导生成过程。然而,在需要高度专业领域的领域如病理学和卫星图像中,这种努力通常是费力不值的;这通常是由领域专家完成的,需要数百万个补丁。现代的自监督学习(SSL)表示包含了丰富的语义和视觉信息。在本文中,我们主张,这样的表示足够表达精细的人类标签。我们引入了一种新方法,通过从SSL嵌入中训练扩散模型。我们的扩散模型成功地将这些特征投影回高质量的病理学和遥感图像。此外,通过将来自SSL嵌入的具有空间一致性的补丁组装成更大的图像,我们保留了长距离依赖关系。通过从真实数据中生成变量的图像,可以提高后端补丁级和更大图像尺度分类任务的下游分类器准确性。我们的模型在未在训练过程中遇到的训练数据集上同样有效,证明了其稳健性和泛化能力。从学到的嵌入中生成图像是无关于嵌入来源的。用于生成大图像的SSL嵌入可以从参考图像中提取,或者从任何相关模式(如类别标签、文本、基因组数据)的条件模型中采样。例如,作为演示,我们引入了从文本描述中合成大病理和卫星图像的新文本-大图像合成范式。我们成功地生成了大的病理和卫星图像,表明了我们的模型具有很强的泛化能力。

URL

https://arxiv.org/abs/2312.07330

PDF

https://arxiv.org/pdf/2312.07330.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot