Paper Reading AI Learner

MRL Parsing Without Tears: The Case of Hebrew

2024-03-11 17:54:33
Shaltiel Shmidman, Avi Shmidman, Moshe Koppel, Reut Tsarfaty

Abstract

Syntactic parsing remains a critical tool for relation extraction and information extraction, especially in resource-scarce languages where LLMs are lacking. Yet in morphologically rich languages (MRLs), where parsers need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity. Some use a pipeline to peel away the layers: first segmentation, then morphology tagging, and then syntax parsing; however, errors in earlier layers are then propagated forward. Others use a joint architecture to evaluate all permutations at once; while this improves accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test case, we present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task. The classifiers are independent of one another, and only at the end do we synthesize their predictions. This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew NLP tasks. Because our architecture does not rely on any language-specific resources, it can serve as a model to develop similar parsers for other MRLs.

Abstract (translated)

句法分析仍然是关系抽取和信息抽取的关键工具,尤其是在资源有限的语言中,LLM 缺乏。然而,在多词格语言(MRLs)中,解析器需要在每个词标中识别多个词单位,现有系统在延迟和设置复杂性方面存在问题。有些人使用管道来剥离层次结构:首先进行分词,然后进行词性标注,最后进行句法解析;然而, earlier 层中的错误会向前传播。其他人使用联合架构来一次性评估所有排列:虽然这可以提高准确性,但众所周知,速度较慢。相比之下,以希伯来语为例,我们提出了一个新的“翻转管道”:专家分类器通过专家将整个词作为一个单位做出决策,每个分类器专门负责一个特定任务。分类器相互独立,仅在最后才合成它们的预测。这种快速的方法在希伯来语词性标注和关系解析的 SOTA 方面设置了一个新的标杆,同时在其他希伯来语 NLP 任务上达到了接近 SOTA 的性能。因为我们的架构不依赖于任何语言特定的资源,它可以作为为其他 MRL 开发类似解析器的模型。

URL

https://arxiv.org/abs/2403.06970

PDF

https://arxiv.org/pdf/2403.06970.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot