Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

2022-10-31 08:02:21

Lei Zhang, Shilin Zhou, Chen Gong, Zhenghua Li, Zhefeng Wang, Baoxing Huai, Min Zhang

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

Chinese word segmentation (CWS) models have achieved very high performance when the training data is sufficient and in-domain. However, the performance drops drastically when shifting to cross-domain and low-resource scenarios due to data sparseness issues. Considering that constructing large-scale manually annotated data is time-consuming and labor-intensive, in this work, we for the first time propose to mine word boundary information from pauses in speech to efficiently obtain large-scale CWS naturally annotated data. We present a simple yet effective complete-then-train method to utilize these natural annotations from speech for CWS model training. Extensive experiments demonstrate that the CWS performance in cross-domain and low-resource scenarios can be significantly improved by leveraging our naturally annotated data extracted from speech.

Abstract (translated)

URL

https://arxiv.org/abs/2210.17122

PDF

https://arxiv.org/pdf/2210.17122.pdf