Paper Reading AI Learner

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

2024-11-04 16:46:53
Alexandros Haliassos, Rodrigo Mira, Honglie Chen, Zoe Landgraf, Stavros Petridis, Maja Pantic

Abstract

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at this https URL.

Abstract (translated)

传统的听觉、视觉和视听语音识别(分别简称ASR、VSR和AVSR)研究通常是独立进行的。即使最近的一些自监督学习研究同时处理两个或全部三个任务,它们往往也会产生单独的模型,导致推理管道分离,增加内存需求并带来冗余。本文提出了这些系统的统一训练策略。我们证明,通过为所有三项任务训练单一模型可以提高VSR和AVSR的表现,并克服了从头开始训练时常见的优化挑战。此外,我们提出了一种贪婪伪标注方法来更有效地利用未标记样本,解决了相关自监督方法中的不足之处。最后,我们在框架内开发了一种自监督预训练方法,证明了它与我们的半监督方法一起的有效性。尽管所有任务都使用单一模型,但我们的统一方法在ASR、VSR和AVSR上,尤其是在新发布的WildVSR数据集上的表现达到了最新技术的标准,在LRS3和LRS2数据集上的性能也超过了近期的方法。代码和模型可以在这个https链接中找到。

URL

https://arxiv.org/abs/2411.02256

PDF

https://arxiv.org/pdf/2411.02256.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot