Industry-wide nuclear power plant operating experience is a critical source of raw data for performing parameter estimations in reliability and risk models. Much operating experience information pertains to failure events and is stored as reports containing unstructured data, such as narratives. Event reports are essential for understanding how failures are initiated and propagated, including the numerous causal relations involved. Causal relation extraction using deep learning represents a significant frontier in the field of natural language processing (NLP), and is crucial since it enables the interpretation of intricate narratives and connections contained within vast amounts of written information. This paper proposed a hybrid framework for causality detection and extraction from nuclear licensee event reports. The main contributions include: (1) we compiled an LER corpus with 20,129 text samples for causality analysis, (2) developed an interactive tool for labeling cause effect pairs, (3) built a deep-learning-based approach for causal relation detection, and (4) developed a knowledge based cause-effect extraction approach.
行业范围内的核电站运行经验是进行可靠性风险模型参数估计的关键原始数据来源。大量的操作经验信息涉及故障事件,并存储为包含无结构数据的报告,如叙述。事件报告对理解故障是如何启动和传播的至关重要,包括众多因果关系。使用深度学习提取因果关系代表了自然语言处理(NLP)领域的一个显著前沿,并且对解释大量书面信息中的复杂叙述和联系至关重要。本文提出了一种用于从核电站业主事件报告中检测和提取因果性的混合框架。主要贡献包括:(1)我们编写了包含20,129个文本样本的LR语料库,用于进行因果性分析;(2)开发了一个用于标记因果效应对的工具;(3)基于深度学习构建了因果关系检测方法;(4)开发了基于知识的因果关系提取方法。
https://arxiv.org/abs/2404.05656
This paper describes our participation in the Shared Task on Software Mentions Disambiguation (SOMD), with a focus on improving relation extraction in scholarly texts through Generative Language Models (GLMs) using single-choice question-answering. The methodology prioritises the use of in-context learning capabilities of GLMs to extract software-related entities and their descriptive attributes, such as distributive information. Our approach uses Retrieval-Augmented Generation (RAG) techniques and GLMs for Named Entity Recognition (NER) and Attributive NER to identify relationships between extracted software entities, providing a structured solution for analysing software citations in academic literature. The paper provides a detailed description of our approach, demonstrating how using GLMs in a single-choice QA paradigm can greatly enhance IE methodologies. Our participation in the SOMD shared task highlights the importance of precise software citation practices and showcases our system's ability to overcome the challenges of disambiguating and extracting relationships between software mentions. This sets the groundwork for future research and development in this field.
本文描述了我们参与共享任务:软件提及歧义(SOMD),重点是通过使用生成语言模型(GLMs)进行单选问题回答,提高学术文本中关系抽取的准确性。研究方法优先考虑了GLMs在上下文理解能力,以提取与软件相关的实体及其描述特征,如分布信息。我们的方法使用了检索增强生成(RAG)技术以及GLMs进行命名实体识别(NER)和属性实体识别(Attributive NER)来识别提取的软件实体之间的关系,为分析学术文献中的软件引用提供了结构化的解决方案。本文详细描述了我们的方法,展示了在单选问题回答范式中使用GLMs可以大大提高IE方法的效果。我们参与SOMD共享任务,强调了精确的软件引用实践的重要性,并展示了我们系统克服了软件提及歧义和提取关系的能力。为该领域的未来研究和开发奠定了基础。
https://arxiv.org/abs/2404.05587
In acupuncture therapy, the accurate location of acupoints is essential for its effectiveness. The advanced language understanding capabilities of large language models (LLMs) like Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations from textual knowledge sources. This study aims to compare the performance of GPT with traditional deep learning models (Long Short-Term Memory (LSTM) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT)) in extracting acupoint-related location relations and assess the impact of pretraining and fine-tuning on GPT's performance. We utilized the World Health Organization Standard Acupuncture Point Locations in the Western Pacific Region (WHO Standard) as our corpus, which consists of descriptions of 361 acupoints. Five types of relations ('direction_of,' 'distance_of,' 'part_of,' 'near_acupoint,' and 'located_near') (n= 3,174) between acupoints were annotated. Five models were compared: BioBERT, LSTM, pre-trained GPT-3.5, and fine-tuned GPT-3.5, as well as pre-trained GPT-4. Performance metrics included micro-average exact match precision, recall, and F1 scores. Our results demonstrate that fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types. Overall, it achieved the highest micro-average F1 score of 0.92. This study underscores the effectiveness of LLMs like GPT in extracting relations related to acupoint locations, with implications for accurately modeling acupuncture knowledge and promoting standard implementation in acupuncture training and practice. The findings also contribute to advancing informatics applications in traditional and complementary medicine, showcasing the potential of LLMs in natural language processing.
在针灸治疗中,准确地定位穴位对治疗效果至关重要。大型语言模型(如Generative Pre-trained Transformers(GPT))的高级语言理解能力为从文本知识来源中提取与穴位位置相关的关系提供了重大机会。本研究旨在比较GPT与传统深度学习模型(包括Long Short-Term Memory(LSTM)和Bidirectional Encoder Representations from Transformers for Biomedical Text Mining(BioBERT))在提取穴位位置相关关系方面的性能,并评估预训练和微调对GPT性能的影响。我们使用了世界卫生组织西太平洋地区标准穴位位置作为我们的数据集,它包括对361个穴位的描述。我们对穴位之间的五种关系(包括方向、距离、部分、附近和位于附近)进行了注释,共n=3,174个关系。我们比较了五种模型:BioBERT、LSTM、预训练的GPT-3.5和微调的GPT-3.5,以及预训练的GPT-4。性能指标包括微平均精确匹配精度、召回率和F1分数。我们的结果表明,微调的GPT-3.5在所有关系类型的F1得分上始终优于其他模型。总体而言,它实现了最高的微平均F1分数0.92。这项研究突出了LLM(如GPT)在提取穴位位置相关关系方面的有效性,以及其对准确建模针灸知识和促进标准化在针灸培训和实践中的推动作用。研究结果还促进了传统和补充医学的信息技术应用,展示了LLM在自然语言处理方面的潜力。
https://arxiv.org/abs/2404.05415
Transforming a sentence into a two-dimensional (2D) representation (e.g., the table filling) has the ability to unfold a semantic plane, where an element of the plane is a word-pair representation of a sentence which may denote a possible relation representation composed of two named entities. The 2D representation is effective in resolving overlapped relation instances. However, in related works, the representation is directly transformed from a raw input. It is weak to utilize prior knowledge, which is important to support the relation extraction task. In this paper, we propose a two-dimensional feature engineering method in the 2D sentence representation for relation extraction. Our proposed method is evaluated on three public datasets (ACE05 Chinese, ACE05 English, and SanWen) and achieves the state-of-the-art performance. The results indicate that two-dimensional feature engineering can take advantage of a two-dimensional sentence representation and make full use of prior knowledge in traditional feature engineering. Our code is publicly available at this https URL
将句子转换为二维(2D)表示(例如,表格填充)具有展开语义平面的能力,其中平面上一个元素是句子词对表示,这可能表示由两个命名实体组成的可能关系表示。2D表示在解决重叠关系实例方面非常有效。然而,在相关工作中,该表示是从原始输入直接转换的。它对利用先验知识较弱,这对于支持关系提取任务非常重要。在本文中,我们在关系提取的二维句子表示中提出了一种二维特征工程方法。我们对三个公开数据集(ACE05中文,ACE05英文和SanWen)进行了评估,并实现了最先进的性能。结果表明,二维特征工程可以利用二维句子表示充分利用传统特征工程中的先验知识。我们的代码公开可用,请点击以下链接:
https://arxiv.org/abs/2404.04959
We introduce a meta dataset for few-shot relation extraction, which includes two datasets derived from existing supervised relation extraction datasets NYT29 (Takanobu et al., 2019; Nayak and Ng, 2020) and WIKIDATA (Sorokin and Gurevych, 2017) as well as a few-shot form of the TACRED dataset (Sabo et al., 2021). Importantly, all these few-shot datasets were generated under realistic assumptions such as: the test relations are different from any relations a model might have seen before, limited training data, and a preponderance of candidate relation mentions that do not correspond to any of the relations of interest. Using this large resource, we conduct a comprehensive evaluation of six recent few-shot relation extraction methods, and observe that no method comes out as a clear winner. Further, the overall performance on this task is low, indicating substantial need for future research. We release all versions of the data, i.e., both supervised and few-shot, for future research.
我们提出了一个用于少样本关系提取的元数据集,其中包括来自现有监督关系提取数据集NYT29(Takanobu等人,2019;Nayak和Ng,2020)和WIKIDATA(Sorokin和Gurevych,2017)以及TACRED数据集中的少样本形式(Sabo等人,2021)。重要的是,所有这些少样本数据都是在现实假设下生成的,例如:测试关系与模型之前见过的任何关系不同,有限的数据训练,以及倾向不对应于感兴趣关系的候选关系注解。利用这个大量资源,我们对六个最近少样本关系提取方法进行了全面评估,观察到没有方法脱颖而出成为明确的优势。此外,这项任务的整体性能较低,表明需要进行大量的研究来提高。我们发布了所有数据版本,即监督和少样本数据,供未来研究。
https://arxiv.org/abs/2404.04445
Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at this https URL.
现代大型语言模型(LLMs)在各种任务上表现出了非凡的才能,这些任务需要复杂的认知行为。然而,观察到的一个悖论是,这些模型在似乎简单的任务(如关系提取和事件提取)上表现不佳,因为传统评估方法存在两个问题:(1)现有评估指标对衡量模型输出和真实值之间语义一致性的精确度较低;(2)评估基准本身就不完备,主要原因是受限制的人类标注模式,导致LLM性能被低估。受到主观问题纠正原则的启发,我们提出了一个新的评估方法:SQC-Score。这种方法通过主观问题纠正数据微调LLM,从而改善模型输出与真实标签之间的匹配。此外,通过引入自然语言推理(NLI)模型,SQC-Score 丰富了对已有黄金标签的评估,从而解决基准的不完备性,并承认正确但之前被忽略的答案。在三个信息提取任务上的结果表明,SQC-Score 更受人类标注者喜爱,其性能优于基线指标。利用 SQC-Score,我们对最先进的LLM进行了全面评估,为信息提取的未来研究提供了启示。数据集和相关信息代码可在此链接访问:https://url.cn/
https://arxiv.org/abs/2404.03532
Recent works in relation extraction (RE) have achieved promising benchmark accuracy; however, our adversarial attack experiments show that these works excessively rely on entities, making their generalization capability questionable. To address this issue, we propose an adversarial training method specifically designed for RE. Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations. Furthermore, we introduce a probabilistic strategy for leaving clean tokens in the context during adversarial training. This strategy enables a larger attack budget for entities and coaxes the model to leverage relational patterns embedded in the context. Extensive experiments show that compared to various adversarial training methods, our method significantly improves both the accuracy and robustness of the model. Additionally, experiments on different data availability settings highlight the effectiveness of our method in low-resource scenarios. We also perform in-depth analyses of our proposed method and provide further hints. We will release our code at this https URL.
近年来,在关系抽取(RE)领域取得了一些的有希望的基准准确度;然而,我们的攻击实验表明,这些工作过度依赖实体,导致其泛化能力值得怀疑。为了解决这个问题,我们提出了一个专门针对RE的攻击训练方法。我们的方法引入了样本级和标记级扰动,并使用了一个单独的扰动词汇表来提高对实体和上下文扰动的搜索。此外,我们还引入了一种概率策略,让其在上下文训练过程中留下干净的标记。这种策略使得实体和上下文中的关系模式得到更充分的利用。大量实验证明,与各种攻击训练方法相比,我们的方法显著提高了模型的准确性和鲁棒性。此外,在不同数据可用性设置的实验中,我们的方法在低资源场景中表现出有效的效果。我们还对我们所提出的方法进行了深入的分析,并提供了进一步的提示。我们将发布我们的代码在這個 URL 上。
https://arxiv.org/abs/2404.02931
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at this https URL .
场景图生成(SGG)是一项具有挑战性的任务,旨在检测物体并预测物体之间的关系。在DETR开发之后,基于一阶段对象的检测器的一阶SGG模型得到了广泛研究。然而,为了预测物体之间的关系,使用了复杂的建模。在多头自注意的对象检测器中学习的物体查询固有关系已被忽视。我们提出了一种轻量级的一阶SGG模型,它从DETR decoder的各个关系层中提取关系图。通过充分利用自注意的副产品,浅层关系提取头可以有效地提取关系图。考虑到关系提取任务与物体检测任务之间的依赖关系,我们提出了一种新颖的关系平滑技术,根据检测到的物体的质量调整关系标签。通过关系平滑,根据训练开始的物体检测任务,对模型进行训练,并在物体检测性能逐渐提高时进行多任务学习。此外,我们还提出了一种关系预测任务,作为关系提取的辅助任务,预测物体对之间的关系是否存在。我们在Visual Genome和Open Image V6数据集上证明了我们方法的有效性和高效性。我们的代码可在此处公开访问:https:// this URL.
https://arxiv.org/abs/2404.02072
Information extraction (IE) is a fundamental area in natural language processing where prompting large language models (LLMs), even with in-context examples, cannot defeat small LMs tuned on very small IE datasets. We observe that IE tasks, such as named entity recognition and relation extraction, all focus on extracting important information, which can be formalized as a label-to-span matching. In this paper, we propose a novel framework MetaIE to build a small LM as meta-model by learning to extract "important information", i.e., the meta-understanding of IE, so that this meta-model can be adapted to all kind of IE tasks effectively and efficiently. Specifically, MetaIE obtains the small LM via a symbolic distillation from an LLM following the label-to-span scheme. We construct the distillation dataset via sampling sentences from language model pre-training datasets (e.g., OpenWebText in our implementation) and prompting an LLM to identify the typed spans of "important information". We evaluate the meta-model under the few-shot adaptation setting. Extensive results on 13 datasets from 6 IE tasks confirm that MetaIE can offer a better starting point for few-shot tuning on IE datasets and outperform other meta-models from (1) vanilla language model pre-training, (2) multi-IE-task pre-training with human annotations, and (3) single-IE-task symbolic distillation from LLM. Moreover, we provide comprehensive analyses of MetaIE, such as the size of the distillation dataset, the meta-model architecture, and the size of the meta-model.
信息抽取(IE)是自然语言处理领域的一个基本领域,即使在大规模语言模型(LLMs)的监督下,甚至使用带有上下文例子的LLMs,也无法击败在非常小的IE数据集上微调的小型LLM。我们观察到,IE任务(如命名实体识别和关系提取)都关注于提取重要信息,这可以形式化为标签到句子的匹配。在本文中,我们提出了一个名为MetaIE的新框架,通过学习提取“重要信息”——即IE的元理解,将小LLM构建为元模型,以便适应各种IE任务有效地和高效地。具体来说,MetaIE通过从LLM跟随标签到句子的符号蒸馏中获得小LLM,然后通过从语言模型预训练数据(例如,在我们的实现中使用OpenWebText)采样句子并提示LLM识别重要信息的句柄来构建差分数据集。我们在少样本适应设置下评估元模型。从6个IE任务中的13个数据集中获得的大量结果证实,MetaIE可以在IE数据集上提供更好的起点,并在(1)普通语言模型预训练,(2)带人类注释的多IE任务预训练和(3)LLM的单IE任务符号差分方面优于其他元模型。此外,我们提供了对MetaIE的全面分析,包括差分数据集的大小、元模型架构和元模型的大小。
https://arxiv.org/abs/2404.00457
Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual experiments that could benefit many low-resource languages.
关系提取对于在数字人文学和相关领域提取和理解个人信息至关重要。随着社区对构建能够训练机器学习模型提取关系的数据集的兴趣不断增加,然而,为这类数据集进行标注往往代价昂贵且耗时,同时限制在英语范围内。本文应用指导远距离监督创建了一个用于德语的大型关系提取数据集。由九种关系类型组成的超过80,000个实例的数据集是最大的德语关系提取数据集。我们还创建了一个手动标注的数据集,包含2000个实例,用于评估模型并将其与使用指导远距离监督构建的数据集一起发布。我们在自动创建的数据集上训练了几个最先进的机器学习模型,并将其发布。此外,我们还尝试了多语言和跨语言实验,这些实验对许多低资源语言有利。
https://arxiv.org/abs/2403.17143
Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.
关系提取是自然语言处理领域中一个关键的任务,具有许多实际应用。现有的研究主要集中在单语义关系提取或跨语言增强关系提取。然而,在混合语言(或代码切换)场景中,关系提取的理解仍然存在显著的差距。在这种场景中,个人在句子中混合来自不同语言的内容,生成混合语言内容。由于缺乏专门的 datasets,现有关系提取模型的有效性在此场景下的探索受到了限制。为了解决这个问题,我们引入了一个名为MixRE的新任务,专注于在混合语言场景中考虑关系提取,并构建了人标注数据集MixRED来支持这个任务。除了构建MixRED数据集之外,我们还评估了最先进的监督模型和大型语言模型(LLMs)在MixRED上的性能,揭示了它们在此混合语言场景中的优势和局限。此外,我们深入研究了影响模型性能的因素,并发现了增强监督模型和LLM性能的新方向,以探索这种新颖任务中模型性能的提高。
https://arxiv.org/abs/2403.15696
The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing-Company, Company-Location). State-of-the-art deep learning models were trained to recognize entities and extract relations showing first promising results. An anonymized version of the dataset, along with guidelines and the code used for model training, are publicly available at this https URL.
面向金融实体和服务提供者的网络映射过程提供了关于这些实体之间关系的洞察。在德国基金概要中的公司对外包实践进行研究,我们引入了一个专门为命名实体识别和关系提取任务设计的数据集。对948个句子的标注过程由三位专家进行,结果为四个实体类型(外包、公司、地点和软件)的5,969个标注和四个实体类型之间关系的4,102个标注。最先进的深度学习模型被训练来识别实体并提取关系,显示出 promising 的初步结果。匿名化的数据集,以及用于模型训练的指南和代码,可以在这个https URL上获得。
https://arxiv.org/abs/2403.15322
Event temporal relation (TempRel) is a primary subject of the event relation extraction task. However, the inherent ambiguity of TempRel increases the difficulty of the task. With the rise of prompt engineering, it is important to design effective prompt templates and verbalizers to extract relevant knowledge. The traditional manually designed templates struggle to extract precise temporal knowledge. This paper introduces a novel retrieval-augmented TempRel extraction approach, leveraging knowledge retrieved from large language models (LLMs) to enhance prompt templates and verbalizers. Our method capitalizes on the diverse capabilities of various LLMs to generate a wide array of ideas for template and verbalizer design. Our proposed method fully exploits the potential of LLMs for generation tasks and contributes more knowledge to our design. Empirical evaluations across three widely recognized datasets demonstrate the efficacy of our method in improving the performance of event temporal relation extraction tasks.
事件时间关系(TempRel)是事件关系提取任务的主要研究对象。然而,TempRel固有的歧义性增加了任务的难度。随着提示工程的发展,设计有效的提示模板和语义器以提取相关知识非常重要。传统的自定义模板在提取精确的时间知识方面很难。本文介绍了一种新颖的基于检索增强的TempRel提取方法,利用来自大型语言模型(LLMs)的知识来增强提示模板和语义器。我们的方法利用各种LLM的多样性来生成模板和语义器设计的大量想法。我们所提出的方法充分发掘LLMs的生成任务潜力,并为我们的设计贡献了更多的知识。通过在三个广泛认可的数据集上进行实证评估,证明我们的方法在提高事件时间关系提取任务的性能方面非常有效。
https://arxiv.org/abs/2403.15273
Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the ``Chinese Historical Information Extraction Corpus''(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at \url{this https URL}.
自然语言处理(NLP)在数字人文领域中扮演着关键角色,并作为研究历史和文化遗产文本结构分析的基础。这尤其是在命名实体识别(NER)和关系提取(RE)领域。在我们致力于加速古代历史和文化的承诺下,我们推出了“中文历史信息抽取数据集”(CHisIEC)。CHisIEC是一个精心挑选的数据集,旨在开发和评估NER和RE任务,为该领域的研究提供资源。该数据集涵盖了从13个王朝跨越超过1830年的历史数据的惊人历史时间轴,恰当地代表了中文历史文献中存在的广泛时间和文本异质性。数据集包括四个不同的实体类型和十二种关系类型,形成了一个包含14,194个实体和8,609个关系的 meticulously labeled 数据集。为了验证我们数据集的稳健性和多样性,我们进行了涉及各种大小和范式的全面实验。此外,我们还评估了大型语言模型(LLMs)在古代中国历史相关任务中的能力。数据集及其代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.15088
Large Language Models (LLMs) have demonstrated exceptional abilities in comprehending and generating text, motivating numerous researchers to utilize them for Information Extraction (IE) purposes, including Relation Extraction (RE). Nonetheless, most existing methods are predominantly designed for Sentence-level Relation Extraction (SentRE) tasks, which typically encompass a restricted set of relations and triplet facts within a single sentence. Furthermore, certain approaches resort to treating relations as candidate choices integrated into prompt templates, leading to inefficient processing and suboptimal performance when tackling Document-Level Relation Extraction (DocRE) tasks, which entail handling multiple relations and triplet facts distributed across a given document, posing distinct challenges. To overcome these limitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of real-world scenarios. Additionally, we have developed an easily extensible RE framework using a Parameters Efficient Fine Tuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results, surpassing TAG by 10.03% and 9.03% respectively on the dev and test set.
大语言模型(LLMs)在理解和生成文本方面表现出色,这促使许多研究人员将它们应用于信息抽取(IE)任务,包括关系抽取(RE)。然而,现有的方法主要针对句子级别的关系抽取(SentRE)任务,通常涵盖单个句子内的有限关系和三元组事实。此外,某些方法将关系视为提示模板中的候选选项,导致处理效率低下,性能较差,当处理文档级别关系抽取(DocRE)任务时,这会带来明确的挑战。为了克服这些限制,我们引入了AutoRE,一种端到端的关系抽取模型,采用名为RHF(关系-头-事实)的新颖关系抽取范式。与现有方法不同,AutoRE不依赖于已知的关系选项,因此更贴近现实场景。此外,我们还使用参数有效微调(PEFT)算法(QLoRA)开发了一个易于扩展的关系抽取框架。我们对RE-DocRED数据集的实验结果展示了AutoRE的最佳性能,实现了最先进的水平,分别比TAG提高了10.03%和9.03%。
https://arxiv.org/abs/2403.14888
Events describe the state changes of entities. In a document, multiple events are connected by various relations (e.g., Coreference, Temporal, Causal, and Subevent). Therefore, obtaining the connections between events through Event-Event Relation Extraction (ERE) is critical to understand natural language. There are two main problems in the current ERE works: a. Only embeddings of the event triggers are used for event feature representation, ignoring event arguments (e.g., time, place, person, etc.) and their structure within the event. b. The interconnection between relations (e.g., temporal and causal relations usually interact with each other ) is ignored. To solve the above problems, this paper proposes a jointly multiple ERE framework called GraphERE based on Graph-enhanced Event Embeddings. First, we enrich the event embeddings with event argument and structure features by using static AMR graphs and IE graphs; Then, to jointly extract multiple event relations, we use Node Transformer and construct Task-specific Dynamic Event Graphs for each type of relation. Finally, we used a multi-task learning strategy to train the whole framework. Experimental results on the latest MAVEN-ERE dataset validate that GraphERE significantly outperforms existing methods. Further analyses indicate the effectiveness of the graph-enhanced event embeddings and the joint extraction strategy.
事件描述了实体状态的变化。在文档中,多个事件通过各种关系(如共指、时间、因果关系和子事件)相互连接。因此,通过事件-事件关系提取(ERE)获得事件之间的连接对于理解自然语言非常重要。当前ERE工作的两个主要问题是:a. 只有事件嵌入被用于事件特征表示,忽略了事件参数(例如,时间、地点、人物等)及其在事件中的结构。b. 关系之间的相互作用(例如,时间和因果关系通常相互交互)被忽视。为解决上述问题,本文提出了一种基于图增强事件嵌入的多任务框架,称为GraphERE。首先,我们通过使用静态AMR图和IE图丰富事件嵌入;然后,为了共同提取多个事件关系,我们使用节点Transformer并为每种关系构建了任务特定的动态事件图。最后,我们使用多任务学习策略训练整个框架。对最新MAVEN-ERE数据集的实验结果证实,GraphERE显著优于现有方法。进一步的分析表明,图形增强的事件嵌入和联合提取策略的有效性。
https://arxiv.org/abs/2403.12523
Biomedical event extraction is an information extraction task to obtain events from biomedical text, whose targets include the type, the trigger, and the respective arguments involved in an event. Traditional biomedical event extraction usually adopts a pipelined approach, which contains trigger identification, argument role recognition, and finally event construction either using specific rules or by machine learning. In this paper, we propose an n-ary relation extraction method based on the BERT pre-training model to construct Binding events, in order to capture the semantic information about an event's context and its participants. The experimental results show that our method achieves promising results on the GE11 and GE13 corpora of the BioNLP shared task with F1 scores of 63.14% and 59.40%, respectively. It demonstrates that by significantly improving theperformance of Binding events, the overall performance of the pipelined event extraction approach or even exceeds those of current joint learning methods.
生物医学事件提取是一种从生物医学文本中获取事件信息的信息提取任务,其目标包括事件类型、触发器和相应的事件参与者。传统的生物医学事件提取通常采用流水线方法,包括触发器识别、参与方角色识别和事件构建,或使用特定规则或机器学习。在本文中,我们提出了一种基于BERT预训练模型的n元关系提取方法,以构建绑定事件,以捕捉事件上下文和参与者的语义信息。实验结果表明,我们的方法在BioNLP共享任务中使用F1分数为63.14%和59.40%的数据集时取得了很好的结果。这表明,通过显著提高绑定事件的表现,流水线事件提取方法的总体性能甚至超过了当前的联合学习方法。
https://arxiv.org/abs/2403.12386
Despite the need for financial data on company activities in developing countries for development research and economic analysis, such data does not exist. In this project, we develop and evaluate two Natural Language Processing (NLP) based techniques to address this issue. First, we curate a custom dataset specific to the domain of financial text data on developing countries and explore multiple approaches for information extraction. We then explore a text-to-text approach with the transformer-based T5 model with the goal of undertaking simultaneous NER and relation extraction. We find that this model is able to learn the custom text structure output data corresponding to the entities and their relations, resulting in an accuracy of 92.44\%, a precision of 68.25\% and a recall of 54.20\% from our best T5 model on the combined task. Secondly, we explore an approach with sequential NER and relation extration. For the NER, we run pre-trained and fine-tuned models using SpaCy, and we develop a custom relation extraction model using SpaCy's Dependency Parser output and some heuristics to determine entity relationships \cite{spacy}. We obtain an accuracy of 84.72\%, a precision of 6.06\% and a recall of 5.57\% on this sequential task.
尽管在发展中国家的公司活动方面需要财务数据进行发展研究和经济分析,但这些数据并不存在。在这个项目中,我们开发和评估了两种基于自然语言处理(NLP)的技术来解决这个问题。首先,我们筛选了一个针对发展中国家的金融文本数据领域的自定义数据集,并探索了多种信息提取方法。然后,我们研究了基于Transformer模型的T5模型,旨在实现同时进行实体抽取和关系提取。我们发现,这个模型能够学习到相应的实体和关系,从而使准确度为92.44%,精确度为68.25%,召回率为54.20%。其次,我们研究了一种序列NLP和关系抽取的方法。对于NLP,我们使用SpaCy预训练和微调的模型,并使用SpaCy的依赖解析器的输出和一些启发式来开发自定义关系提取模型。我们在这个序列任务上的准确度为84.72%,精确度为6.06%,召回率为5.57%。
https://arxiv.org/abs/2403.09077
Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
自然语言处理(NLP)从业者利用大型语言模型(LLM)将半结构化和非结构化数据源(如专利、论文和论据)构建为结构化数据,而无需具备专业知识。同时,生态专家正在寻找各种方法来保护生物多样性。为了为这些努力做出贡献,我们专注于濒危物种,并通过上下文学习从GPT-4中提炼知识。实际上,我们通过两个阶段创建了数据集:1)我们从GPT-4的四个濒危物种生成了合成数据;2)人类验证了合成数据的准确性,从而获得了金数据。最终,我们的新数据集包含3.6K个句子,其中1.8K个用于命名实体识别(NER)和1.8K个用于关系提取(RE)。构建的数据集随后用于微调 both general BERT 和 domain-specific BERT 版本,完成从GPT-4到BERT的 knowledge distillation 过程,因为GPT-4资源密集。实验结果表明,我们的知识传递方法在从文本中检测濒危物种方面是有效的。
https://arxiv.org/abs/2403.15430
Syntactic parsing remains a critical tool for relation extraction and information extraction, especially in resource-scarce languages where LLMs are lacking. Yet in morphologically rich languages (MRLs), where parsers need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity. Some use a pipeline to peel away the layers: first segmentation, then morphology tagging, and then syntax parsing; however, errors in earlier layers are then propagated forward. Others use a joint architecture to evaluate all permutations at once; while this improves accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test case, we present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task. The classifiers are independent of one another, and only at the end do we synthesize their predictions. This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew NLP tasks. Because our architecture does not rely on any language-specific resources, it can serve as a model to develop similar parsers for other MRLs.
句法分析仍然是关系抽取和信息抽取的关键工具,尤其是在资源有限的语言中,LLM 缺乏。然而,在多词格语言(MRLs)中,解析器需要在每个词标中识别多个词单位,现有系统在延迟和设置复杂性方面存在问题。有些人使用管道来剥离层次结构:首先进行分词,然后进行词性标注,最后进行句法解析;然而, earlier 层中的错误会向前传播。其他人使用联合架构来一次性评估所有排列:虽然这可以提高准确性,但众所周知,速度较慢。相比之下,以希伯来语为例,我们提出了一个新的“翻转管道”:专家分类器通过专家将整个词作为一个单位做出决策,每个分类器专门负责一个特定任务。分类器相互独立,仅在最后才合成它们的预测。这种快速的方法在希伯来语词性标注和关系解析的 SOTA 方面设置了一个新的标杆,同时在其他希伯来语 NLP 任务上达到了接近 SOTA 的性能。因为我们的架构不依赖于任何语言特定的资源,它可以作为为其他 MRL 开发类似解析器的模型。
https://arxiv.org/abs/2403.06970