Paper Reading AI Learner

BRIDLE: Generalized Self-supervised Learning with Quantization

2025-02-04 08:54:06
Hoang M. Nguyen, Satya N. Shukla, Qiang Zhang, Hanchao Yu, Sreya D. Roy, Taipeng Tian, Lingjiong Zhu, Yuchen Liu

Abstract

Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.

Abstract (translated)

自监督学习已经成为从无标签数据中提取有意义表示的强大方法,覆盖了各个领域,并减少了对大型标注数据集的依赖。受 BERT 在自然语言处理中成功捕捉深度双向上下文的启发,类似的框架已被应用于其他模式如音频信号,模型如 BEATs 将双向训练范式扩展到音频信号,使用向量量化 (VQ) 技术。然而,这些框架面临着挑战,特别是它们依赖于单一代码本进行量化,这可能无法捕捉复杂多面的信号特性。此外,在代码本利用中的低效导致了未充分利用的码矢量。为了解决这些问题,我们介绍了 BRIDLE(双向残差量化交织离散学习编码器),这是一种自监督编码器预训练框架,它将残差量化 (RQ) 集成到双向训练过程中,并且适用于音频、图像和视频的预训练。通过使用多个分层代码本,RQ 在潜在空间中实现了细粒度的离散化,从而提升了表示的质量。BRIDLE 包含编码器和标记器之间的交错式训练程序。我们在音频理解任务上使用分类基准评估了 BRIDLE,并取得了最先进的结果;同时在图像分类和视频分类任务中展示了竞争性的性能,显示出与传统 VQ 方法相比,在下游任务中的持续改进。

URL

https://arxiv.org/abs/2502.02118

PDF

https://arxiv.org/pdf/2502.02118.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot