Abstract
Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at this https URL.
Abstract (translated)
传统的听觉、视觉和视听语音识别(分别简称ASR、VSR和AVSR)研究通常是独立进行的。即使最近的一些自监督学习研究同时处理两个或全部三个任务,它们往往也会产生单独的模型,导致推理管道分离,增加内存需求并带来冗余。本文提出了这些系统的统一训练策略。我们证明,通过为所有三项任务训练单一模型可以提高VSR和AVSR的表现,并克服了从头开始训练时常见的优化挑战。此外,我们提出了一种贪婪伪标注方法来更有效地利用未标记样本,解决了相关自监督方法中的不足之处。最后,我们在框架内开发了一种自监督预训练方法,证明了它与我们的半监督方法一起的有效性。尽管所有任务都使用单一模型,但我们的统一方法在ASR、VSR和AVSR上,尤其是在新发布的WildVSR数据集上的表现达到了最新技术的标准,在LRS3和LRS2数据集上的性能也超过了近期的方法。代码和模型可以在这个https链接中找到。
URL
https://arxiv.org/abs/2411.02256