JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

2021-12-17 05:09:44

Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki, Sayaka Shiota, Shinji Watanabe

arXiv_SD

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.

Abstract (translated)

URL

https://arxiv.org/abs/2112.09323

PDF

https://arxiv.org/pdf/2112.09323.pdf