Abstract
We present a pipeline and tools to build a large multilingual data sets of conversational speech that covers 66 languages and varieties from 32 phyla for the diversity-aware study of conversational speech in naturalistic settings. We describe compilation and format of the largely open-access resource based on language documentation projects and platforms and release an open-source tool `convo-parse' to help building this type of resource. We conclude with outlining two applications of how massively multilingual data sets can inform interactional linguistics and speech recognition technology and thus contribute to broadening the empirical foundations of language sciences and technologies of the future.
Abstract (translated)
URL
https://arxiv.org/abs/2203.03399