Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at this https URL.
专家策展对于从FAIR开放知识库中捕获酶功能知识至关重要,但无法跟上新发现和新出版物的发展速度。在这项工作中,我们提出了EnzChemRED,Enzyme Chemistry Relation Extraction Dataset的训练和基准数据集,以支持开发自然语言处理(NLP)方法,如(大型)语言模型,以协助酶策展。EnzChemRED由1,210个专家编写的PubMed摘要组成,其中酶及其催化的化学反应使用来自UniProt知识库(UniProtKB)和化学生物实体(ChEBI)的标识符进行注释。我们证明了使用EnzChemRED对预训练语言模型进行微调可以显著提高其在文本(命名实体识别,NER)中识别蛋白质和化学物质的提及能力以及提取它们参与的化学转换(关系提取,RE)能力,平均F1分数为86.30% for NER,86.66% for RE for chemical conversion pairs,83.79% for RE for chemical conversion pairs and linked enzymes。我们使用EnzChemRED中表现最好的方法对文本进行微调,创建了从文本到摘要的端到端管道,并将此应用于PubMed大小的摘要以创建酶功能文献的初步映射,以指导在UniProtKB和反应知识库Rhea中的策展工作。EnzChemRED语料库可在此链接处免费获取:https://www.ncbi.nlm.nih.gov/25962541
https://arxiv.org/abs/2404.14209
Information resources such as newspapers have produced unstructured text data in various languages related to the corona outbreak since December 2019. Analyzing these unstructured texts is time-consuming without representing them in a structured format; therefore, representing them in a structured format is crucial. An information extraction pipeline with essential tasks -- named entity tagging and relation extraction -- to accomplish this goal might be applied to these texts. This study proposes a data annotation pipeline to generate training data from corona news articles, including generic and domain-specific entities. Named entity recognition models are trained on this annotated corpus and then evaluated on test sentences manually annotated by domain experts evaluating the performance of a trained model. The code base and demonstration are available at this https URL.
信息资源(如报纸)自2019年12月以来产生了与冠状病毒疫情相关的各种语言无结构文本数据。如果没有以结构化格式表示这些无结构文本,分析这些文本将耗时;因此,以结构化格式表示这些文本至关重要。一个实现这一目标的信息提取管道包括关键任务——命名实体标记和关系抽取——用于完成此任务。本研究提出了一个数据注释管道,用于从冠状病毒新闻文章中生成训练数据,包括通用和领域特定的实体。经过训练的命名实体识别模型被评估为专家对训练模型的表现进行手动标注的测试句子。代码库和演示文稿可在此https URL找到。
https://arxiv.org/abs/2404.13439
Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.
信息抽取(IE)是一个将非结构化文本数据转换为结构化格式的变革性过程,通过采用实体和关系提取(RE)方法来完成。在信息抽取框架内,实体间关系的识别具有关键作用。尽管存在各种关系提取技术,但它们的有效性很大程度上依赖于访问标注数据和大量的计算资源。为了应对这些挑战,本文提出了一种基于已检索增强生成关系提取(RAG4RE)的方法,为提高关系抽取任务的性能提供了一条途径。 本文通过利用已有的基准数据集,如TACRED、TACREV、Re-TACRED和SemEval RE,全面评估了所提出RAG4RE方法的有效性。特别地,我们在研究中利用了重要的LLM,包括Flan T5、Llama2和Mistral。我们研究的结果表明,基于LLM的传统关系抽取方法在RAG4RE方法上显著超越了 performance,特别是在TACRED数据集及其变化中。此外,与以前的关系提取方法相比,我们在TACRED和TACREV数据集上的表现都有显著优势,这表明了RAG4RE方法在自然语言处理中的潜在推动作用。
https://arxiv.org/abs/2404.13397
Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual subtasks and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which is available at this https URL.
从无结构文本中提取结构化信息对于许多下游自然语言处理(NLP)应用至关重要,而且通常通过关闭信息提取(cIE)来实现。然而,现有的cIE方法存在两个局限:(i)它们通常是流水线,容易传播错误,(ii)它们仅限于句子级别,无法捕捉长距离依赖关系,导致推理时间昂贵。为了克服这些局限,我们提出了REXEL,一种高效且准确的文档级别cIE(DocIE)模型。REXEL在单向传递过程中实现提举检测、实体类型、实体歧义、关系分类和文档级别关系,以产生完全链接到参考知识图谱的事实。在类似设置中,REXEL的平均速度是现有方法的11倍,而且在优化任何单个子任务或各种组合任务时,表现出色,超过了基线平均6个F1分。速度和准确性的结合使REXEL成为在网页规模上提取结构化信息的准确且高效系统。我们还发布了DocRED数据集的扩展,以便于未来在DocIE上进行基准测试,该扩展可通过此链接获得。
https://arxiv.org/abs/2404.12788
Joint entity and relation extraction plays a pivotal role in various applications, notably in the construction of knowledge graphs. Despite recent progress, existing approaches often fall short in two key aspects: richness of representation and coherence in output structure. These models often rely on handcrafted heuristics for computing entity and relation representations, potentially leading to loss of crucial information. Furthermore, they disregard task and/or dataset-specific constraints, resulting in output structures that lack coherence. In our work, we introduce EnriCo, which mitigates these shortcomings. Firstly, to foster rich and expressive representation, our model leverage attention mechanisms that allow both entities and relations to dynamically determine the pertinent information required for accurate extraction. Secondly, we introduce a series of decoding algorithms designed to infer the highest scoring solutions while adhering to task and dataset-specific constraints, thus promoting structured and coherent outputs. Our model demonstrates competitive performance compared to baselines when evaluated on Joint IE datasets.
联合实体和关系提取在各种应用中发挥着重要作用,特别是在知识图谱的构建中。尽管最近取得了进展,但现有的方法在两个关键方面往往存在不足:表示的丰富性和输出结构的连贯性。这些方法通常依赖于手工构建的启发式规则计算实体和关系表示,可能导致关键信息的丢失。此外,它们忽视了任务和/或数据集特定的约束,导致输出结构缺乏连贯性。在我们的工作中,我们引入了EnriCo模型,从而缓解了这些不足。首先,为了促进丰富和表现性的表示,我们的模型利用了注意机制,允许实体和关系动态确定所需的相关信息。其次,我们引入了一系列解码算法,旨在在遵守任务和数据集特定约束的情况下推断最高得分解决方案,从而促进结构和连贯的输出。与基线相比,我们的模型在Joint IE数据集上的评估表现具有竞争力。
https://arxiv.org/abs/2404.12493
Information extraction (IE) is an important task in Natural Language Processing (NLP), involving the extraction of named entities and their relationships from unstructured text. In this paper, we propose a novel approach to this task by formulating it as graph structure learning (GSL). By formulating IE as GSL, we enhance the model's ability to dynamically refine and optimize the graph structure during the extraction process. This formulation allows for better interaction and structure-informed decisions for entity and relation prediction, in contrast to previous models that have separate or untied predictions for these tasks. When compared against state-of-the-art baselines on joint entity and relation extraction benchmarks, our model, GraphER, achieves competitive results.
信息抽取(IE)是自然语言处理(NLP)中一个重要任务,涉及从无结构文本中提取命名实体及其关系。在本文中,我们提出了一种将IE表示为图结构学习(GSL)的新方法。通过将IE表示为GSL,我们增强了模型在提取过程期间动态精炼和优化图结构的能力。这种表示使得实体和关系预测具有更好的交互性和结构指导决策,与之前这些任务具有单独或松散预测的模型相比。与联合实体和关系提取基准上的最先进基线相比,我们的模型GraphER在竞争中取得了竞争力的结果。
https://arxiv.org/abs/2404.12491
Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.
多模态关系提取(MMRE)是一个具有挑战性的任务,旨在利用图像信息识别文本中实体之间的关系。现有方法的一个局限是它们忽略了共享非常相似上下文信息的多个实体对,导致在MMRE任务中难度增加。为了应对这个局限,我们提出了用于多模态关系提取的变分多模态超图注意力网络(VM-HAN)。具体来说,我们首先为每句话构建一个带有相应图像的多模态超图,以建立不同实体对之间的高层次内部/间相互作用关系。我们进一步设计变分超图注意力网络(V-HAN)来通过高斯分布获得表示多样性,并通过变分注意来学习更好的超图结构。VM-HAN在多模态关系提取任务上实现了最先进的性能,在准确性和效率方面均优于现有方法。
https://arxiv.org/abs/2404.12006
Document Understanding is an evolving field in Natural Language Processing (NLP). In particular, visual and spatial features are essential in addition to the raw text itself and hence, several multimodal models were developed in the field of Visual Document Understanding (VDU). However, while research is mainly focused on Key Information Extraction (KIE), Relation Extraction (RE) between identified entities is still under-studied. For instance, RE is crucial to regroup entities or obtain a comprehensive hierarchy of data in a document. In this paper, we present a model that, initialized from LayoutLMv3, can match or outperform the current state-of-the-art results in RE applied to Visually-Rich Documents (VRD) on FUNSD and CORD datasets, without any specific pre-training and with fewer parameters. We also report an extensive ablation study performed on FUNSD, highlighting the great impact of certain features and modelization choices on the performances.
文档理解是一个不断发展的领域,自然语言处理(NLP)中。特别是,视觉和空间特征在文本本身之外至关重要,因此,在视觉文档理解(VDU)领域中,已经开发了许多多模态模型。然而,研究主要集中在关键信息提取(KIE)上,而识别实体之间的关系扩展(RE)研究仍处于起步阶段。例如,RE对于重新分组实体或获得文档数据的全面层次结构非常重要。在本文中,我们提出了一个模型,最初基于LayoutLMv3,可以在不进行特定预训练的情况下,与FUNSD和CORD数据集中的当前最先进结果在RE上相匹配或超越,且参数更少。我们还对FUNSD进行了广泛的消融研究,突出了某些特征和模型化选择对性能的巨大影响。
https://arxiv.org/abs/2404.10848
Business Process Modeling projects often require formal process models as a central component. High costs associated with the creation of such formal process models motivated many different fields of research aimed at automated generation of process models from readily available data. These include process mining on event logs, and generating business process models from natural language texts. Research in the latter field is regularly faced with the problem of limited data availability, hindering both evaluation and development of new techniques, especially learning-based ones. To overcome this data scarcity issue, in this paper we investigate the application of data augmentation for natural language text data. Data augmentation methods are well established in machine learning for creating new, synthetic data without human assistance. We find that many of these methods are applicable to the task of business process information extraction, improving the accuracy of extraction. Our study shows, that data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text, where currently mostly rule-based systems are still state of the art. Simple data augmentation techniques improved the $F_1$ score of mention extraction by 2.9 percentage points, and the $F_1$ of relation extraction by $4.5$. To better understand how data augmentation alters human annotated texts, we analyze the resulting text, visualizing and discussing the properties of augmented textual data. We make all code and experiments results publicly available.
商业流程建模项目通常需要形式化过程模型作为核心组件。由于创建这样的形式化过程模型的成本很高,许多不同的研究领域都致力于从可用的数据中自动生成过程模型,这些领域包括事件日志中的过程挖掘和自然语言文本中的业务流程建模。后者领域的研究经常会面临数据可用性有限的问题,这阻碍了评估和开发新技术,特别是基于学习的技术。为了克服这一数据稀缺性问题,本文研究了在自然语言文本数据中应用数据增强的方法。数据增强方法在机器学习领域已经过时,但是已经建立了许多可以创建新、合成数据的方法,而无需人类帮助。我们发现,许多这些方法都适用于从自然语言文本中提取业务流程信息,提高提取的准确性。我们的研究表明,数据增强是使机器学习方法为从自然语言文本中生成业务流程模型的关键组成部分,而目前仍然主要是基于规则的系统。简单的数据增强技术将提及提取的$F_1$分数提高了2.9个百分点,而关系提取的$F_1$分数则提高了$4.5$个百分点。为了更好地了解数据增强如何改变人类标注文本,我们分析了生成的文本,并可视化和讨论了增强文本数据的属性。我们将所有代码和实验结果公开发布。
https://arxiv.org/abs/2404.07501
Industry-wide nuclear power plant operating experience is a critical source of raw data for performing parameter estimations in reliability and risk models. Much operating experience information pertains to failure events and is stored as reports containing unstructured data, such as narratives. Event reports are essential for understanding how failures are initiated and propagated, including the numerous causal relations involved. Causal relation extraction using deep learning represents a significant frontier in the field of natural language processing (NLP), and is crucial since it enables the interpretation of intricate narratives and connections contained within vast amounts of written information. This paper proposed a hybrid framework for causality detection and extraction from nuclear licensee event reports. The main contributions include: (1) we compiled an LER corpus with 20,129 text samples for causality analysis, (2) developed an interactive tool for labeling cause effect pairs, (3) built a deep-learning-based approach for causal relation detection, and (4) developed a knowledge based cause-effect extraction approach.
行业范围内的核电站运行经验是进行可靠性风险模型参数估计的关键原始数据来源。大量的操作经验信息涉及故障事件,并存储为包含无结构数据的报告,如叙述。事件报告对理解故障是如何启动和传播的至关重要,包括众多因果关系。使用深度学习提取因果关系代表了自然语言处理(NLP)领域的一个显著前沿,并且对解释大量书面信息中的复杂叙述和联系至关重要。本文提出了一种用于从核电站业主事件报告中检测和提取因果性的混合框架。主要贡献包括:(1)我们编写了包含20,129个文本样本的LR语料库,用于进行因果性分析;(2)开发了一个用于标记因果效应对的工具;(3)基于深度学习构建了因果关系检测方法;(4)开发了基于知识的因果关系提取方法。
https://arxiv.org/abs/2404.05656
This paper describes our participation in the Shared Task on Software Mentions Disambiguation (SOMD), with a focus on improving relation extraction in scholarly texts through Generative Language Models (GLMs) using single-choice question-answering. The methodology prioritises the use of in-context learning capabilities of GLMs to extract software-related entities and their descriptive attributes, such as distributive information. Our approach uses Retrieval-Augmented Generation (RAG) techniques and GLMs for Named Entity Recognition (NER) and Attributive NER to identify relationships between extracted software entities, providing a structured solution for analysing software citations in academic literature. The paper provides a detailed description of our approach, demonstrating how using GLMs in a single-choice QA paradigm can greatly enhance IE methodologies. Our participation in the SOMD shared task highlights the importance of precise software citation practices and showcases our system's ability to overcome the challenges of disambiguating and extracting relationships between software mentions. This sets the groundwork for future research and development in this field.
本文描述了我们参与共享任务:软件提及歧义(SOMD),重点是通过使用生成语言模型(GLMs)进行单选问题回答,提高学术文本中关系抽取的准确性。研究方法优先考虑了GLMs在上下文理解能力,以提取与软件相关的实体及其描述特征,如分布信息。我们的方法使用了检索增强生成(RAG)技术以及GLMs进行命名实体识别(NER)和属性实体识别(Attributive NER)来识别提取的软件实体之间的关系,为分析学术文献中的软件引用提供了结构化的解决方案。本文详细描述了我们的方法,展示了在单选问题回答范式中使用GLMs可以大大提高IE方法的效果。我们参与SOMD共享任务,强调了精确的软件引用实践的重要性,并展示了我们系统克服了软件提及歧义和提取关系的能力。为该领域的未来研究和开发奠定了基础。
https://arxiv.org/abs/2404.05587
In acupuncture therapy, the accurate location of acupoints is essential for its effectiveness. The advanced language understanding capabilities of large language models (LLMs) like Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations from textual knowledge sources. This study aims to compare the performance of GPT with traditional deep learning models (Long Short-Term Memory (LSTM) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT)) in extracting acupoint-related location relations and assess the impact of pretraining and fine-tuning on GPT's performance. We utilized the World Health Organization Standard Acupuncture Point Locations in the Western Pacific Region (WHO Standard) as our corpus, which consists of descriptions of 361 acupoints. Five types of relations ('direction_of,' 'distance_of,' 'part_of,' 'near_acupoint,' and 'located_near') (n= 3,174) between acupoints were annotated. Five models were compared: BioBERT, LSTM, pre-trained GPT-3.5, and fine-tuned GPT-3.5, as well as pre-trained GPT-4. Performance metrics included micro-average exact match precision, recall, and F1 scores. Our results demonstrate that fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types. Overall, it achieved the highest micro-average F1 score of 0.92. This study underscores the effectiveness of LLMs like GPT in extracting relations related to acupoint locations, with implications for accurately modeling acupuncture knowledge and promoting standard implementation in acupuncture training and practice. The findings also contribute to advancing informatics applications in traditional and complementary medicine, showcasing the potential of LLMs in natural language processing.
在针灸治疗中,准确地定位穴位对治疗效果至关重要。大型语言模型(如Generative Pre-trained Transformers(GPT))的高级语言理解能力为从文本知识来源中提取与穴位位置相关的关系提供了重大机会。本研究旨在比较GPT与传统深度学习模型(包括Long Short-Term Memory(LSTM)和Bidirectional Encoder Representations from Transformers for Biomedical Text Mining(BioBERT))在提取穴位位置相关关系方面的性能,并评估预训练和微调对GPT性能的影响。我们使用了世界卫生组织西太平洋地区标准穴位位置作为我们的数据集,它包括对361个穴位的描述。我们对穴位之间的五种关系(包括方向、距离、部分、附近和位于附近)进行了注释,共n=3,174个关系。我们比较了五种模型:BioBERT、LSTM、预训练的GPT-3.5和微调的GPT-3.5,以及预训练的GPT-4。性能指标包括微平均精确匹配精度、召回率和F1分数。我们的结果表明,微调的GPT-3.5在所有关系类型的F1得分上始终优于其他模型。总体而言,它实现了最高的微平均F1分数0.92。这项研究突出了LLM(如GPT)在提取穴位位置相关关系方面的有效性,以及其对准确建模针灸知识和促进标准化在针灸培训和实践中的推动作用。研究结果还促进了传统和补充医学的信息技术应用,展示了LLM在自然语言处理方面的潜力。
https://arxiv.org/abs/2404.05415
Transforming a sentence into a two-dimensional (2D) representation (e.g., the table filling) has the ability to unfold a semantic plane, where an element of the plane is a word-pair representation of a sentence which may denote a possible relation representation composed of two named entities. The 2D representation is effective in resolving overlapped relation instances. However, in related works, the representation is directly transformed from a raw input. It is weak to utilize prior knowledge, which is important to support the relation extraction task. In this paper, we propose a two-dimensional feature engineering method in the 2D sentence representation for relation extraction. Our proposed method is evaluated on three public datasets (ACE05 Chinese, ACE05 English, and SanWen) and achieves the state-of-the-art performance. The results indicate that two-dimensional feature engineering can take advantage of a two-dimensional sentence representation and make full use of prior knowledge in traditional feature engineering. Our code is publicly available at this https URL
将句子转换为二维(2D)表示(例如,表格填充)具有展开语义平面的能力,其中平面上一个元素是句子词对表示,这可能表示由两个命名实体组成的可能关系表示。2D表示在解决重叠关系实例方面非常有效。然而,在相关工作中,该表示是从原始输入直接转换的。它对利用先验知识较弱,这对于支持关系提取任务非常重要。在本文中,我们在关系提取的二维句子表示中提出了一种二维特征工程方法。我们对三个公开数据集(ACE05中文,ACE05英文和SanWen)进行了评估,并实现了最先进的性能。结果表明,二维特征工程可以利用二维句子表示充分利用传统特征工程中的先验知识。我们的代码公开可用,请点击以下链接:
https://arxiv.org/abs/2404.04959
We introduce a meta dataset for few-shot relation extraction, which includes two datasets derived from existing supervised relation extraction datasets NYT29 (Takanobu et al., 2019; Nayak and Ng, 2020) and WIKIDATA (Sorokin and Gurevych, 2017) as well as a few-shot form of the TACRED dataset (Sabo et al., 2021). Importantly, all these few-shot datasets were generated under realistic assumptions such as: the test relations are different from any relations a model might have seen before, limited training data, and a preponderance of candidate relation mentions that do not correspond to any of the relations of interest. Using this large resource, we conduct a comprehensive evaluation of six recent few-shot relation extraction methods, and observe that no method comes out as a clear winner. Further, the overall performance on this task is low, indicating substantial need for future research. We release all versions of the data, i.e., both supervised and few-shot, for future research.
我们提出了一个用于少样本关系提取的元数据集,其中包括来自现有监督关系提取数据集NYT29(Takanobu等人,2019;Nayak和Ng,2020)和WIKIDATA(Sorokin和Gurevych,2017)以及TACRED数据集中的少样本形式(Sabo等人,2021)。重要的是,所有这些少样本数据都是在现实假设下生成的,例如:测试关系与模型之前见过的任何关系不同,有限的数据训练,以及倾向不对应于感兴趣关系的候选关系注解。利用这个大量资源,我们对六个最近少样本关系提取方法进行了全面评估,观察到没有方法脱颖而出成为明确的优势。此外,这项任务的整体性能较低,表明需要进行大量的研究来提高。我们发布了所有数据版本,即监督和少样本数据,供未来研究。
https://arxiv.org/abs/2404.04445
Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at this https URL.
现代大型语言模型(LLMs)在各种任务上表现出了非凡的才能,这些任务需要复杂的认知行为。然而,观察到的一个悖论是,这些模型在似乎简单的任务(如关系提取和事件提取)上表现不佳,因为传统评估方法存在两个问题:(1)现有评估指标对衡量模型输出和真实值之间语义一致性的精确度较低;(2)评估基准本身就不完备,主要原因是受限制的人类标注模式,导致LLM性能被低估。受到主观问题纠正原则的启发,我们提出了一个新的评估方法:SQC-Score。这种方法通过主观问题纠正数据微调LLM,从而改善模型输出与真实标签之间的匹配。此外,通过引入自然语言推理(NLI)模型,SQC-Score 丰富了对已有黄金标签的评估,从而解决基准的不完备性,并承认正确但之前被忽略的答案。在三个信息提取任务上的结果表明,SQC-Score 更受人类标注者喜爱,其性能优于基线指标。利用 SQC-Score,我们对最先进的LLM进行了全面评估,为信息提取的未来研究提供了启示。数据集和相关信息代码可在此链接访问:https://url.cn/
https://arxiv.org/abs/2404.03532
Recent works in relation extraction (RE) have achieved promising benchmark accuracy; however, our adversarial attack experiments show that these works excessively rely on entities, making their generalization capability questionable. To address this issue, we propose an adversarial training method specifically designed for RE. Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations. Furthermore, we introduce a probabilistic strategy for leaving clean tokens in the context during adversarial training. This strategy enables a larger attack budget for entities and coaxes the model to leverage relational patterns embedded in the context. Extensive experiments show that compared to various adversarial training methods, our method significantly improves both the accuracy and robustness of the model. Additionally, experiments on different data availability settings highlight the effectiveness of our method in low-resource scenarios. We also perform in-depth analyses of our proposed method and provide further hints. We will release our code at this https URL.
近年来,在关系抽取(RE)领域取得了一些的有希望的基准准确度;然而,我们的攻击实验表明,这些工作过度依赖实体,导致其泛化能力值得怀疑。为了解决这个问题,我们提出了一个专门针对RE的攻击训练方法。我们的方法引入了样本级和标记级扰动,并使用了一个单独的扰动词汇表来提高对实体和上下文扰动的搜索。此外,我们还引入了一种概率策略,让其在上下文训练过程中留下干净的标记。这种策略使得实体和上下文中的关系模式得到更充分的利用。大量实验证明,与各种攻击训练方法相比,我们的方法显著提高了模型的准确性和鲁棒性。此外,在不同数据可用性设置的实验中,我们的方法在低资源场景中表现出有效的效果。我们还对我们所提出的方法进行了深入的分析,并提供了进一步的提示。我们将发布我们的代码在這個 URL 上。
https://arxiv.org/abs/2404.02931
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at this https URL .
场景图生成(SGG)是一项具有挑战性的任务,旨在检测物体并预测物体之间的关系。在DETR开发之后,基于一阶段对象的检测器的一阶SGG模型得到了广泛研究。然而,为了预测物体之间的关系,使用了复杂的建模。在多头自注意的对象检测器中学习的物体查询固有关系已被忽视。我们提出了一种轻量级的一阶SGG模型,它从DETR decoder的各个关系层中提取关系图。通过充分利用自注意的副产品,浅层关系提取头可以有效地提取关系图。考虑到关系提取任务与物体检测任务之间的依赖关系,我们提出了一种新颖的关系平滑技术,根据检测到的物体的质量调整关系标签。通过关系平滑,根据训练开始的物体检测任务,对模型进行训练,并在物体检测性能逐渐提高时进行多任务学习。此外,我们还提出了一种关系预测任务,作为关系提取的辅助任务,预测物体对之间的关系是否存在。我们在Visual Genome和Open Image V6数据集上证明了我们方法的有效性和高效性。我们的代码可在此处公开访问:https:// this URL.
https://arxiv.org/abs/2404.02072
Information extraction (IE) is a fundamental area in natural language processing where prompting large language models (LLMs), even with in-context examples, cannot defeat small LMs tuned on very small IE datasets. We observe that IE tasks, such as named entity recognition and relation extraction, all focus on extracting important information, which can be formalized as a label-to-span matching. In this paper, we propose a novel framework MetaIE to build a small LM as meta-model by learning to extract "important information", i.e., the meta-understanding of IE, so that this meta-model can be adapted to all kind of IE tasks effectively and efficiently. Specifically, MetaIE obtains the small LM via a symbolic distillation from an LLM following the label-to-span scheme. We construct the distillation dataset via sampling sentences from language model pre-training datasets (e.g., OpenWebText in our implementation) and prompting an LLM to identify the typed spans of "important information". We evaluate the meta-model under the few-shot adaptation setting. Extensive results on 13 datasets from 6 IE tasks confirm that MetaIE can offer a better starting point for few-shot tuning on IE datasets and outperform other meta-models from (1) vanilla language model pre-training, (2) multi-IE-task pre-training with human annotations, and (3) single-IE-task symbolic distillation from LLM. Moreover, we provide comprehensive analyses of MetaIE, such as the size of the distillation dataset, the meta-model architecture, and the size of the meta-model.
信息抽取(IE)是自然语言处理领域的一个基本领域,即使在大规模语言模型(LLMs)的监督下,甚至使用带有上下文例子的LLMs,也无法击败在非常小的IE数据集上微调的小型LLM。我们观察到,IE任务(如命名实体识别和关系提取)都关注于提取重要信息,这可以形式化为标签到句子的匹配。在本文中,我们提出了一个名为MetaIE的新框架,通过学习提取“重要信息”——即IE的元理解,将小LLM构建为元模型,以便适应各种IE任务有效地和高效地。具体来说,MetaIE通过从LLM跟随标签到句子的符号蒸馏中获得小LLM,然后通过从语言模型预训练数据(例如,在我们的实现中使用OpenWebText)采样句子并提示LLM识别重要信息的句柄来构建差分数据集。我们在少样本适应设置下评估元模型。从6个IE任务中的13个数据集中获得的大量结果证实,MetaIE可以在IE数据集上提供更好的起点,并在(1)普通语言模型预训练,(2)带人类注释的多IE任务预训练和(3)LLM的单IE任务符号差分方面优于其他元模型。此外,我们提供了对MetaIE的全面分析,包括差分数据集的大小、元模型架构和元模型的大小。
https://arxiv.org/abs/2404.00457
Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual experiments that could benefit many low-resource languages.
关系提取对于在数字人文学和相关领域提取和理解个人信息至关重要。随着社区对构建能够训练机器学习模型提取关系的数据集的兴趣不断增加,然而,为这类数据集进行标注往往代价昂贵且耗时,同时限制在英语范围内。本文应用指导远距离监督创建了一个用于德语的大型关系提取数据集。由九种关系类型组成的超过80,000个实例的数据集是最大的德语关系提取数据集。我们还创建了一个手动标注的数据集,包含2000个实例,用于评估模型并将其与使用指导远距离监督构建的数据集一起发布。我们在自动创建的数据集上训练了几个最先进的机器学习模型,并将其发布。此外,我们还尝试了多语言和跨语言实验,这些实验对许多低资源语言有利。
https://arxiv.org/abs/2403.17143
Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.
关系提取是自然语言处理领域中一个关键的任务,具有许多实际应用。现有的研究主要集中在单语义关系提取或跨语言增强关系提取。然而,在混合语言(或代码切换)场景中,关系提取的理解仍然存在显著的差距。在这种场景中,个人在句子中混合来自不同语言的内容,生成混合语言内容。由于缺乏专门的 datasets,现有关系提取模型的有效性在此场景下的探索受到了限制。为了解决这个问题,我们引入了一个名为MixRE的新任务,专注于在混合语言场景中考虑关系提取,并构建了人标注数据集MixRED来支持这个任务。除了构建MixRED数据集之外,我们还评估了最先进的监督模型和大型语言模型(LLMs)在MixRED上的性能,揭示了它们在此混合语言场景中的优势和局限。此外,我们深入研究了影响模型性能的因素,并发现了增强监督模型和LLM性能的新方向,以探索这种新颖任务中模型性能的提高。
https://arxiv.org/abs/2403.15696