Abstract
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
Abstract (translated)
自监督学习已经成为从无标签数据中提取有意义表示的强大方法,覆盖了各个领域,并减少了对大型标注数据集的依赖。受 BERT 在自然语言处理中成功捕捉深度双向上下文的启发,类似的框架已被应用于其他模式如音频信号,模型如 BEATs 将双向训练范式扩展到音频信号,使用向量量化 (VQ) 技术。然而,这些框架面临着挑战,特别是它们依赖于单一代码本进行量化,这可能无法捕捉复杂多面的信号特性。此外,在代码本利用中的低效导致了未充分利用的码矢量。为了解决这些问题,我们介绍了 BRIDLE(双向残差量化交织离散学习编码器),这是一种自监督编码器预训练框架,它将残差量化 (RQ) 集成到双向训练过程中,并且适用于音频、图像和视频的预训练。通过使用多个分层代码本,RQ 在潜在空间中实现了细粒度的离散化,从而提升了表示的质量。BRIDLE 包含编码器和标记器之间的交错式训练程序。我们在音频理解任务上使用分类基准评估了 BRIDLE,并取得了最先进的结果;同时在图像分类和视频分类任务中展示了竞争性的性能,显示出与传统 VQ 方法相比,在下游任务中的持续改进。
URL
https://arxiv.org/abs/2502.02118