Paper Reading AI Learner

Sentence Identification with BOS and EOS Label Combinations

2023-01-31 01:03:07
Takuma Udagawa, Hiroshi Kanayama, Issei Yoshida

Abstract

The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.

Abstract (translated)

句子是许多自然语言处理应用程序的基本单位。句子分割被广泛用作预处理任务的第一步,将输入文本分成连续的语句,并将句子的结束标记(EOS)视为其边界。这个任务的定义依赖于一个强有力的假设,即输入文本仅包含语句,或我们称之为句级单位(SUs)。然而,实际文本中常常包含非语句单位(NSU),例如 metadata、句子碎片、非语言学标志等,这些对象不合理或不希望被视为SU的一部分。要解决这个问题,我们提出了句子识别的新任务,其目标是在给定文本中识别SUs,同时排除NSU。为了进行句子识别,我们提出了一种简单但有效的方法,它结合句子的开始标记(BOS)和EOS标签,基于动态规划来确定最可能的SU和NSU。为了评估这个任务,我们设计了一种自动、语言无关的程序来将通用依赖关系数据集转换为句子识别基准。最后,我们进行的句子识别任务的实验结果表明,我们提出的方法 generally outperforms sentence segmentation baselines which only utilize EOS labels.

URL

https://arxiv.org/abs/2301.13352

PDF

https://arxiv.org/pdf/2301.13352.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot