Paper Reading AI Learner

Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction

2023-05-24 02:53:48
Anna Martin-Boyle, Andrew Head, Kyle Lo, Risham Sidhu, Marti A. Hearst, Dongyeop Kang

Abstract

Mathematical symbol definition extraction is important for improving scholarly reading interfaces and scholarly information extraction (IE). However, the task poses several challenges: math symbols are difficult to process as they are not composed of natural language morphemes; and scholarly papers often contain sentences that require resolving complex coordinate structures. We present SymDef, an English language dataset of 5,927 sentences from full-text scientific papers where each sentence is annotated with all mathematical symbols linked with their corresponding definitions. This dataset focuses specifically on complex coordination structures such as "respectively" constructions, which often contain overlapping definition spans. We also introduce a new definition extraction method that masks mathematical symbols, creates a copy of each sentence for each symbol, specifies a target symbol, and predicts its corresponding definition spans using slot filling. Our experiments show that our definition extraction model significantly outperforms RoBERTa and other strong IE baseline systems by 10.9 points with a macro F1 score of 84.82. With our dataset and model, we can detect complex definitions in scholarly documents to make scientific writing more readable.

Abstract (translated)

数学符号定义提取对于改善学术阅读接口和提高学术信息提取(IE)任务非常重要。然而,该任务提出了多项挑战:数学符号不是由自然语言词码组成的,因此处理它们非常困难;学术文章通常包含需要解决复杂的坐标结构的句子。我们提出了SymDef一个英语语言数据集,由5,927个句子从完整论文中收集,每个句子都被注释了与它们的对应数学符号及其定义。这个数据集专门关注复杂的坐标结构,如“分别”构造,其中常常包含重叠的定义跨度。我们还介绍了一种新的定义提取方法,它掩盖了数学符号,为每个符号创建每个句子的副本,指定一个目标符号,并使用空插槽预测其相应的定义跨度。我们的实验表明,我们的定义提取模型显著优于RoBERTa和其他强大的IE基线系统,而且宏观F1得分为84.82。通过我们的数据和模型,我们可以在学术文档中识别复杂的定义,使科学写作更易于阅读。

URL

https://arxiv.org/abs/2305.14660

PDF

https://arxiv.org/pdf/2305.14660.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot