Abstract
Despite the need for financial data on company activities in developing countries for development research and economic analysis, such data does not exist. In this project, we develop and evaluate two Natural Language Processing (NLP) based techniques to address this issue. First, we curate a custom dataset specific to the domain of financial text data on developing countries and explore multiple approaches for information extraction. We then explore a text-to-text approach with the transformer-based T5 model with the goal of undertaking simultaneous NER and relation extraction. We find that this model is able to learn the custom text structure output data corresponding to the entities and their relations, resulting in an accuracy of 92.44\%, a precision of 68.25\% and a recall of 54.20\% from our best T5 model on the combined task. Secondly, we explore an approach with sequential NER and relation extration. For the NER, we run pre-trained and fine-tuned models using SpaCy, and we develop a custom relation extraction model using SpaCy's Dependency Parser output and some heuristics to determine entity relationships \cite{spacy}. We obtain an accuracy of 84.72\%, a precision of 6.06\% and a recall of 5.57\% on this sequential task.
Abstract (translated)
尽管在发展中国家的公司活动方面需要财务数据进行发展研究和经济分析,但这些数据并不存在。在这个项目中,我们开发和评估了两种基于自然语言处理(NLP)的技术来解决这个问题。首先,我们筛选了一个针对发展中国家的金融文本数据领域的自定义数据集,并探索了多种信息提取方法。然后,我们研究了基于Transformer模型的T5模型,旨在实现同时进行实体抽取和关系提取。我们发现,这个模型能够学习到相应的实体和关系,从而使准确度为92.44%,精确度为68.25%,召回率为54.20%。其次,我们研究了一种序列NLP和关系抽取的方法。对于NLP,我们使用SpaCy预训练和微调的模型,并使用SpaCy的依赖解析器的输出和一些启发式来开发自定义关系提取模型。我们在这个序列任务上的准确度为84.72%,精确度为6.06%,召回率为5.57%。
URL
https://arxiv.org/abs/2403.09077