Abstract
In recent years, significant progress has been made in automatic lip reading. But these methods require large-scale datasets that do not exist for many low-resource languages. In this paper, we have presented a new multipurpose audio-visual dataset for Persian. This dataset consists of almost 220 hours of videos with 1760 corresponding speakers. In addition to lip reading, the dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition. Also, it is the first large-scale lip reading dataset in Persian. A baseline method was provided for each mentioned task. In addition, we have proposed a technique to detect visemes (a visual equivalent of a phoneme) in Persian. The visemes obtained by this method increase the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be applied to other languages as well.
Abstract (translated)
过去几年,在自动唇读方面取得了显著进展。但是,这些方法需要大规模数据集,对于许多资源匮乏的语言来说不存在。在本文中,我们介绍了一个为波斯语提供的多功能音频-视频数据集。这个数据集包含几乎220小时的视频,共有1760个对应听众。除了唇读之外,该数据集还适用于自动语音识别、音频-视频语音识别和听众识别。它也是波斯语是第一个大规模唇读数据集。为每个任务提供了基线方法。此外,我们提出了一种方法来在波斯语中检测视觉元音(音素的视觉替代品)。通过这种方法获得的的视觉元音相对于以前提出的视觉元音,可以提高唇读任务的准确性,可以应用于其他语言。
URL
https://arxiv.org/abs/2301.10180