Abstract
Mathematical symbol definition extraction is important for improving scholarly reading interfaces and scholarly information extraction (IE). However, the task poses several challenges: math symbols are difficult to process as they are not composed of natural language morphemes; and scholarly papers often contain sentences that require resolving complex coordinate structures. We present SymDef, an English language dataset of 5,927 sentences from full-text scientific papers where each sentence is annotated with all mathematical symbols linked with their corresponding definitions. This dataset focuses specifically on complex coordination structures such as "respectively" constructions, which often contain overlapping definition spans. We also introduce a new definition extraction method that masks mathematical symbols, creates a copy of each sentence for each symbol, specifies a target symbol, and predicts its corresponding definition spans using slot filling. Our experiments show that our definition extraction model significantly outperforms RoBERTa and other strong IE baseline systems by 10.9 points with a macro F1 score of 84.82. With our dataset and model, we can detect complex definitions in scholarly documents to make scientific writing more readable.
Abstract (translated)
数学符号定义提取对于改善学术阅读接口和提高学术信息提取(IE)任务非常重要。然而,该任务提出了多项挑战:数学符号不是由自然语言词码组成的,因此处理它们非常困难;学术文章通常包含需要解决复杂的坐标结构的句子。我们提出了SymDef一个英语语言数据集,由5,927个句子从完整论文中收集,每个句子都被注释了与它们的对应数学符号及其定义。这个数据集专门关注复杂的坐标结构,如“分别”构造,其中常常包含重叠的定义跨度。我们还介绍了一种新的定义提取方法,它掩盖了数学符号,为每个符号创建每个句子的副本,指定一个目标符号,并使用空插槽预测其相应的定义跨度。我们的实验表明,我们的定义提取模型显著优于RoBERTa和其他强大的IE基线系统,而且宏观F1得分为84.82。通过我们的数据和模型,我们可以在学术文档中识别复杂的定义,使科学写作更易于阅读。
URL
https://arxiv.org/abs/2305.14660