Abstract
Information resources such as newspapers have produced unstructured text data in various languages related to the corona outbreak since December 2019. Analyzing these unstructured texts is time-consuming without representing them in a structured format; therefore, representing them in a structured format is crucial. An information extraction pipeline with essential tasks -- named entity tagging and relation extraction -- to accomplish this goal might be applied to these texts. This study proposes a data annotation pipeline to generate training data from corona news articles, including generic and domain-specific entities. Named entity recognition models are trained on this annotated corpus and then evaluated on test sentences manually annotated by domain experts evaluating the performance of a trained model. The code base and demonstration are available at this https URL.
Abstract (translated)
信息资源(如报纸)自2019年12月以来产生了与冠状病毒疫情相关的各种语言无结构文本数据。如果没有以结构化格式表示这些无结构文本,分析这些文本将耗时;因此,以结构化格式表示这些文本至关重要。一个实现这一目标的信息提取管道包括关键任务——命名实体标记和关系抽取——用于完成此任务。本研究提出了一个数据注释管道,用于从冠状病毒新闻文章中生成训练数据,包括通用和领域特定的实体。经过训练的命名实体识别模型被评估为专家对训练模型的表现进行手动标注的测试句子。代码库和演示文稿可在此https URL找到。
URL
https://arxiv.org/abs/2404.13439