Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages: Towards a Generic Parser for Pre-Modern Slavic

2020-11-12 16:17:59

Nilo Pedrazzini (University of Oxford)

arXiv_CL

arXiv_CL Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper explores the possibility of improving the performance of specialized parsers for pre-modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a variety-agnostic parser. Previous experiments on early Slavic dependency parsing are discussed, particularly with regard to their ability to tackle different orthographic, regional and stylistic features. A generic pre-modern Slavic parser and two specialized parsers -- one for East Slavic and one for South Slavic -- are trained using jPTDP (Nguyen & Verspoor 2018), a neural network model for joint part-of-speech (POS) tagging and dependency parsing which had shown promising results on a number of Universal Dependency (UD) treebanks, including Old Church Slavonic (OCS). With these experiments, a new state of the art is obtained for both OCS (83.79\% unlabelled attachment score (UAS) and 78.43\% labelled attachement score (LAS)) and Old East Slavic (OES) (85.7\% UAS and 80.16\% LAS).

Abstract (translated)

URL

https://arxiv.org/abs/2011.06467

PDF

https://arxiv.org/pdf/2011.06467.pdf