Abstract
Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.
Abstract (translated)
尽管在数据丰富的环境中,深度学习表现强劲,但在实践中常见的数据稀缺场景中却往往表现不佳。虽然基于大规模数据集训练的基础模型(FMs)通过提取通用特征展示了强大的泛化能力,但它们在下游微调时仍会受到标签数据不足的影响。为解决这一问题,我们提出了GeLDA——一个语义感知的生成隐式数据增强框架,该框架利用条件扩散模型在基础模型诱导的潜在空间中合成样本。由于这个空间是低维且集中了任务相关信息(相比输入空间),GeLDA能够实现高效、高质量的数据生成。通过辅助特征向量对生成过程进行控制,这些向量捕捉类别或子域之间的语义关系,使得在资源匮乏领域内数据增强成为可能。我们在两个大规模识别任务中验证了GeLDA的效果:(a) 在零样本特定语言的语音情感识别中,GeLDA将Whisper-large基线模型的无权平均召回率提高了6.13%;以及(b) 在长尾图像分类中,在ImageNet-LT数据集上实现了74.7%的尾部类别准确率,刷新了最新的最佳结果。
URL
https://arxiv.org/abs/2602.02841