A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

2022-04-03 22:53:36

Seth Kulick, Neville Ryant, Beatrice Santorini, Joel Wallenberg

arXiv_CL

arXiv_CL Embedding OCR Relation Knowledge Pose Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

We describe the construction and evaluation of a part-of-speech tagger for Yiddish (the first one, to the best of our knowledge). This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We evaluate the tagger performance on a 10-fold cross-validation split, with and without the embeddings, showing that the embeddings improve tagger performance. However, a great deal of work remains to be done, and we conclude by discussing some next steps, including the need for additional annotated training and test data.

Abstract (translated)

URL

https://arxiv.org/abs/2204.01175

PDF

https://arxiv.org/pdf/2204.01175.pdf