Paper Reading AI Learner

Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains

2026-02-09 18:46:04
Jaesung Bae, Minje Kim

Abstract

Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.

Abstract (translated)

尽管在数据丰富的环境中,深度学习表现强劲,但在实践中常见的数据稀缺场景中却往往表现不佳。虽然基于大规模数据集训练的基础模型(FMs)通过提取通用特征展示了强大的泛化能力,但它们在下游微调时仍会受到标签数据不足的影响。为解决这一问题,我们提出了GeLDA——一个语义感知的生成隐式数据增强框架,该框架利用条件扩散模型在基础模型诱导的潜在空间中合成样本。由于这个空间是低维且集中了任务相关信息(相比输入空间),GeLDA能够实现高效、高质量的数据生成。通过辅助特征向量对生成过程进行控制,这些向量捕捉类别或子域之间的语义关系,使得在资源匮乏领域内数据增强成为可能。我们在两个大规模识别任务中验证了GeLDA的效果:(a) 在零样本特定语言的语音情感识别中,GeLDA将Whisper-large基线模型的无权平均召回率提高了6.13%;以及(b) 在长尾图像分类中,在ImageNet-LT数据集上实现了74.7%的尾部类别准确率,刷新了最新的最佳结果。

URL

https://arxiv.org/abs/2602.02841

PDF

https://arxiv.org/pdf/2602.02841.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot