Abstract
Syntactic parsing remains a critical tool for relation extraction and information extraction, especially in resource-scarce languages where LLMs are lacking. Yet in morphologically rich languages (MRLs), where parsers need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity. Some use a pipeline to peel away the layers: first segmentation, then morphology tagging, and then syntax parsing; however, errors in earlier layers are then propagated forward. Others use a joint architecture to evaluate all permutations at once; while this improves accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test case, we present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task. The classifiers are independent of one another, and only at the end do we synthesize their predictions. This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew NLP tasks. Because our architecture does not rely on any language-specific resources, it can serve as a model to develop similar parsers for other MRLs.
Abstract (translated)
句法分析仍然是关系抽取和信息抽取的关键工具,尤其是在资源有限的语言中,LLM 缺乏。然而,在多词格语言(MRLs)中,解析器需要在每个词标中识别多个词单位,现有系统在延迟和设置复杂性方面存在问题。有些人使用管道来剥离层次结构:首先进行分词,然后进行词性标注,最后进行句法解析;然而, earlier 层中的错误会向前传播。其他人使用联合架构来一次性评估所有排列:虽然这可以提高准确性,但众所周知,速度较慢。相比之下,以希伯来语为例,我们提出了一个新的“翻转管道”:专家分类器通过专家将整个词作为一个单位做出决策,每个分类器专门负责一个特定任务。分类器相互独立,仅在最后才合成它们的预测。这种快速的方法在希伯来语词性标注和关系解析的 SOTA 方面设置了一个新的标杆,同时在其他希伯来语 NLP 任务上达到了接近 SOTA 的性能。因为我们的架构不依赖于任何语言特定的资源,它可以作为为其他 MRL 开发类似解析器的模型。
URL
https://arxiv.org/abs/2403.06970