NeMo Toolbox for Speech Dataset Construction

2021-04-11 01:57:55

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

arXiv_CL

arXiv_CL Pose Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we introduce a new toolbox for constructing speech datasets from long audio recording and raw reference texts. We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering. The proposed pipeline also supports human-in-the-loop to address text-audio mismatch issues and remove samples that don't satisfy the quality requirements. We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks. The toolbox is opne sourced in NeMo framework. The RuLS corpus is released in OpenSLR.

Abstract (translated)

URL

https://arxiv.org/abs/2104.04896

PDF

https://arxiv.org/pdf/2104.04896.pdf