Despite the need for financial data on company activities in developing countries for development research and economic analysis, such data does not exist. In this project, we develop and evaluate two Natural Language Processing (NLP) based techniques to address this issue. First, we curate a custom dataset specific to the domain of financial text data on developing countries and explore multiple approaches for information extraction. We then explore a text-to-text approach with the transformer-based T5 model with the goal of undertaking simultaneous NER and relation extraction. We find that this model is able to learn the custom text structure output data corresponding to the entities and their relations, resulting in an accuracy of 92.44\%, a precision of 68.25\% and a recall of 54.20\% from our best T5 model on the combined task. Secondly, we explore an approach with sequential NER and relation extration. For the NER, we run pre-trained and fine-tuned models using SpaCy, and we develop a custom relation extraction model using SpaCy's Dependency Parser output and some heuristics to determine entity relationships \cite{spacy}. We obtain an accuracy of 84.72\%, a precision of 6.06\% and a recall of 5.57\% on this sequential task.
尽管在发展中国家的公司活动方面需要财务数据进行发展研究和经济分析,但这些数据并不存在。在这个项目中,我们开发和评估了两种基于自然语言处理(NLP)的技术来解决这个问题。首先,我们筛选了一个针对发展中国家的金融文本数据领域的自定义数据集,并探索了多种信息提取方法。然后,我们研究了基于Transformer模型的T5模型,旨在实现同时进行实体抽取和关系提取。我们发现,这个模型能够学习到相应的实体和关系,从而使准确度为92.44%,精确度为68.25%,召回率为54.20%。其次,我们研究了一种序列NLP和关系抽取的方法。对于NLP,我们使用SpaCy预训练和微调的模型,并使用SpaCy的依赖解析器的输出和一些启发式来开发自定义关系提取模型。我们在这个序列任务上的准确度为84.72%,精确度为6.06%,召回率为5.57%。
https://arxiv.org/abs/2403.09077
Syntactic parsing remains a critical tool for relation extraction and information extraction, especially in resource-scarce languages where LLMs are lacking. Yet in morphologically rich languages (MRLs), where parsers need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity. Some use a pipeline to peel away the layers: first segmentation, then morphology tagging, and then syntax parsing; however, errors in earlier layers are then propagated forward. Others use a joint architecture to evaluate all permutations at once; while this improves accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test case, we present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task. The classifiers are independent of one another, and only at the end do we synthesize their predictions. This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew NLP tasks. Because our architecture does not rely on any language-specific resources, it can serve as a model to develop similar parsers for other MRLs.
句法分析仍然是关系抽取和信息抽取的关键工具,尤其是在资源有限的语言中,LLM 缺乏。然而,在多词格语言(MRLs)中,解析器需要在每个词标中识别多个词单位,现有系统在延迟和设置复杂性方面存在问题。有些人使用管道来剥离层次结构:首先进行分词,然后进行词性标注,最后进行句法解析;然而, earlier 层中的错误会向前传播。其他人使用联合架构来一次性评估所有排列:虽然这可以提高准确性,但众所周知,速度较慢。相比之下,以希伯来语为例,我们提出了一个新的“翻转管道”:专家分类器通过专家将整个词作为一个单位做出决策,每个分类器专门负责一个特定任务。分类器相互独立,仅在最后才合成它们的预测。这种快速的方法在希伯来语词性标注和关系解析的 SOTA 方面设置了一个新的标杆,同时在其他希伯来语 NLP 任务上达到了接近 SOTA 的性能。因为我们的架构不依赖于任何语言特定的资源,它可以作为为其他 MRL 开发类似解析器的模型。
https://arxiv.org/abs/2403.06970
Because protein-protein interactions (PPIs) are crucial to understand living systems, harvesting these data is essential to probe disease development and discern gene/protein functions and biological processes. Some curated datasets contain PPI data derived from the literature and other sources (e.g., IntAct, BioGrid, DIP, and HPRD). However, they are far from exhaustive, and their maintenance is a labor-intensive process. On the other hand, machine learning methods to automate PPI knowledge extraction from the scientific literature have been limited by a shortage of appropriate annotated data. This work presents a unified, multi-source PPI corpora with vetted interaction definitions augmented by binary interaction type labels and a Transformer-based deep learning method that exploits entities' relational context information for relation representation to improve relation classification performance. The model's performance is evaluated on four widely studied biomedical relation extraction datasets, as well as this work's target PPI datasets, to observe the effectiveness of the representation to relation extraction tasks in various data. Results show the model outperforms prior state-of-the-art models. The code and data are available at: this https URL
蛋白质-蛋白质相互作用(PPIs)对了解生命系统非常重要,因此收集这些数据对探究疾病发展和鉴定基因/蛋白质功能以及生物过程至关重要。一些预先整理的数据集包含从文献和其他来源获得的PPI数据(例如,IntAct,BioGrid,DIP和HPRD)。然而,它们远远不能涵盖所有数据,而且它们的维护是一个费力的工作。另一方面,自动化从科学文献中提取PPI知识的方法受到缺乏适当注释数据的影响。 本文提出了一个统一的多源PPI数据集,通过二进制相互作用类型标签增强验证的相互作用定义,以及一个基于Transformer的深度学习方法,该方法利用实体之间的关系上下文信息进行关系表示,以提高关系分类性能。模型的性能在四个广泛研究的多生物医学关系提取数据集以及本文的目标PPI数据集上进行了评估,以观察表示与关系提取任务在不同数据中的效果。结果表明,该模型超过了 prior state-of-the-art 模型。代码和数据可在此处下载:https://this URL
https://arxiv.org/abs/2403.05602
The field of biomedical research has witnessed a significant increase in the accumulation of vast amounts of textual data from various sources such as scientific literatures, electronic health records, clinical trial reports, and social media. However, manually processing and analyzing these extensive and complex resources is time-consuming and inefficient. To address this challenge, biomedical text mining, also known as biomedical natural language processing, has garnered great attention. Community challenge evaluation competitions have played an important role in promoting technology innovation and interdisciplinary collaboration in biomedical text mining research. These challenges provide platforms for researchers to develop state-of-the-art solutions for data mining and information processing in biomedical research. In this article, we review the recent advances in community challenges specific to Chinese biomedical text mining. Firstly, we collect the information of these evaluation tasks, such as data sources and task types. Secondly, we conduct systematic summary and comparative analysis, including named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Then, we summarize the potential clinical applications of these community challenge tasks from translational informatics perspective. Finally, we discuss the contributions and limitations of these community challenges, while highlighting future directions in the era of large language models.
生物医学研究领域已经见证了从各种来源积累了大量文本数据的过程,如科学文献、电子病历、临床试验报告和社会媒体。然而,手动处理和分析这些丰富和复杂资源是耗时且效率低下的。为解决这个问题,生物医学文本挖掘(也称为生物医学自然语言处理)已经引起了很大的关注。社区挑战评估比赛在推动生物医学文本挖掘研究的技术创新和跨学科合作方面发挥了重要作用。这些挑战为研究人员在生物医学研究中进行数据挖掘和信息处理提供了平台。在本文中,我们回顾了针对中国生物医学文本挖掘的社区挑战的最近进展。首先,我们收集了这些评估任务的有关信息,如数据来源和任务类型。然后,我们进行了系统总结和比较分析,包括命名实体识别、实体正常化、属性提取、关系提取、事件提取、文本分类、文本相似度、知识图谱构建、问题回答、文本生成和大语言模型评估。接着,我们从翻译信息学的角度总结了这些社区挑战的潜在临床应用。最后,我们讨论了这些社区挑战的贡献和局限性,并强调了大型语言模型时代的研究方向。
https://arxiv.org/abs/2403.04261
Continuous Relation Extraction (CRE) aims to incrementally learn relation knowledge from a non-stationary stream of data. Since the introduction of new relational tasks can overshadow previously learned information, catastrophic forgetting becomes a significant challenge in this domain. Current replay-based training paradigms prioritize all data uniformly and train memory samples through multiple rounds, which would result in overfitting old tasks and pronounced bias towards new tasks because of the imbalances of the replay set. To handle the problem, we introduce the DecouPled CRE (DP-CRE) framework that decouples the process of prior information preservation and new knowledge acquisition. This framework examines alterations in the embedding space as new relation classes emerge, distinctly managing the preservation and acquisition of knowledge. Extensive experiments show that DP-CRE significantly outperforms other CRE baselines across two datasets.
连续关系提取(CRE)旨在逐步从非稳定数据流中学习关系知识。由于引入新的关系任务可能会掩盖之前学到的信息,灾难性遗忘成为这个领域的一个显著挑战。当前基于回放的训练范式普遍偏重于所有数据同等重要,并通过多轮训练来训练记忆样本,这会导致对旧任务的过拟合和对新任务的偏见,因为回放集的不平衡。为了解决这个问题,我们引入了DecouPled CRE(DP-CRE)框架,它解耦了先验信息保留和新知识获取的过程。这个框架在新生关系类出现时检查嵌入空间的变化,明确管理知识的保留和获取。大量实验证明,DP-CRE在两个数据集上都显著优于其他CRE基线。
https://arxiv.org/abs/2403.02718
Document-level Relation Extraction (DocRE) aims to identify relation labels between entities within a single document. It requires handling several sentences and reasoning over them. State-of-the-art DocRE methods use a graph structure to connect entities across the document to capture dependency syntax information. However, this is insufficient to fully exploit the rich syntax information in the document. In this work, we propose to fuse constituency and dependency syntax into DocRE. It uses constituency syntax to aggregate the whole sentence information and select the instructive sentences for the pairs of targets. It exploits the dependency syntax in a graph structure with constituency syntax enhancement and chooses the path between entity pairs based on the dependency graph. The experimental results on datasets from various domains demonstrate the effectiveness of the proposed method. The code is publicly available at this url.
文档级别关系提取(DocRE)旨在在单个文档内识别实体之间的关系标签。它需要处理几句话,并对它们进行推理。最先进的DocRE方法使用图结构将文档中的实体连接起来,以捕捉依赖关系语法的信息。然而,这不足以充分利用文档中的丰富语法信息。在这项工作中,我们提出将句法信息与DocRE相结合。它使用句法信息来汇总整个句子的信息,并选择目标句的对。它利用基于句法的图结构进行增强,并选择基于依赖关系图的路由。各种领域数据集的实验结果证明了所提出方法的有效性。代码可在此处下载。
https://arxiv.org/abs/2403.01886
Objectives: Our objective is to create an end-to-end system called AutoRD, which automates extracting information from clinical text about rare diseases. We have conducted various tests to evaluate the performance of AutoRD and highlighted its strengths and limitations in this paper. Materials and Methods: Our system, AutoRD, is a software pipeline involving data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implement this using large language models and medical knowledge graphs developed from open-source medical ontologies. We quantitatively evaluate our system on entity extraction, relation extraction, and the performance of knowledge graph construction. Results: AutoRD achieves an overall F1 score of 47.3%, a 14.4% improvement compared to the base LLM. In detail, AutoRD achieves an overall entity extraction F1 score of 56.1% (rare_disease: 83.5%, disease: 35.8%, symptom_and_sign: 46.1%, anaphor: 67.5%) and an overall relation extraction F1 score of 38.6% (produces: 34.7%, increases_risk_of: 12.4%, is_a: 37.4%, is_acronym: 44.1%, is_synonym: 16.3%, anaphora: 57.5%). Our qualitative experiment also demonstrates that the performance in constructing the knowledge graph is commendable. Discussion: AutoRD demonstrates the potential of LLM applications in rare disease detection. This improvement is attributed to several design, including the integration of ontologies-enhanced LLMs. Conclusion: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs. It uses ontologies-enhanced LLMs for a robust medical knowledge base. The superior performance of AutoRD is validated by experimental evaluations, demonstrating the potential of LLMs in healthcare.
目标:我们的目标是创建一个端到端系统,名为AutoRD,该系统可以自动从临床文本中提取关于罕见疾病的更多信息。为此,我们进行了各种测试来评估AutoRD的性能并强调了其在本文中的优势和局限性。材料和方法:我们的系统,AutoRD,涉及数据预处理、实体提取、关系提取、实体对齐和知识图谱构建。我们使用从开源医疗本体中开发的大语言模型和医疗知识图来实现此系统。我们定量评估了我们的系统在实体提取、关系提取和知识图谱构建方面的性能。结果:AutoRD的F1分数为47.3%,比基线LLM提高了14.4%。具体来说,AutoRD在实体提取方面的F1分数为56.1%(rare_disease: 83.5%, disease: 35.8%, symptom_and_sign: 46.1%, anaphor: 67.5%),在关系提取方面的F1分数为38.6%(produces: 34.7%, increases_risk_of: 12.4%, is_a: 37.4%, is_acronym: 44.1%, is_synonym: 16.3%, anaphor: 57.5%)。我们的定性实验还表明,知识图谱构建的性能是值得称赞的。讨论:AutoRD展示了LLM在罕见病检测中的潜力。这种提高归因于几个设计,包括集成本体增强的LLM。结论:AutoRD是提取罕见病信息并构建知识图谱的自动端到端系统。它使用本体增强的LLM作为稳健的医疗知识库。AutoRD的卓越性能是由实验评估验证的,这证明了LLM在医疗保健领域的前景。
https://arxiv.org/abs/2403.00953
The goal of few-shot relation extraction is to predict relations between name entities in a sentence when only a few labeled instances are available for training. Existing few-shot relation extraction methods focus on uni-modal information such as text only. This reduces performance when there are no clear contexts between the name entities described in text. We propose a multi-modal few-shot relation extraction model (MFS-HVE) that leverages both textual and visual semantic information to learn a multi-modal representation jointly. The MFS-HVE includes semantic feature extractors and multi-modal fusion components. The MFS-HVE semantic feature extractors are developed to extract both textual and visual features. The visual features include global image features and local object features within the image. The MFS-HVE multi-modal fusion unit integrates information from various modalities using image-guided attention, object-guided attention, and hybrid feature attention to fully capture the semantic interaction between visual regions of images and relevant texts. Extensive experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
少量样本关系提取的目标是在只有几篇标注实例的训练数据时,预测句子中名词实体之间的关系。现有的少量样本关系提取方法主要关注文本信息,这使得在文本描述的名词实体之间没有明确上下文时,性能降低。我们提出了一种多模态少量样本关系提取模型(MFS-HVE),它利用文本和视觉语义信息来学习一种多模态表示。MFS-HVE包括语义特征提取器和多模态融合部件。MFS-HVE的语义特征提取器提取文本和视觉特征。视觉特征包括全局图像特征和图像内的局部对象特征。MFS-HVE的多模态融合部件使用图像指导的注意、对象指导的注意和混合特征注意来整合各种模态的信息,以完全捕捉图像中视觉区域与相关文本之间的语义交互。在两个公开数据集上进行的大量实验证明,语义视觉信息显著提高了少量样本关系预测的性能。
https://arxiv.org/abs/2403.00724
This paper investigates the use of large language models (LLMs) for extracting sample lists of polymer nanocomposites (PNCs) from full-length materials science research papers. The challenge lies in the complex nature of PNC samples, which have numerous attributes scattered throughout the text. The complexity of annotating detailed information on PNCs limits the availability of data, making conventional document-level relation extraction techniques impractical due to the challenge in creating comprehensive named entity span annotations. To address this, we introduce a new benchmark and an evaluation technique for this task and explore different prompting strategies in a zero-shot manner. We also incorporate self-consistency to improve the performance. Our findings show that even advanced LLMs struggle to extract all of the samples from an article. Finally, we analyze the errors encountered in this process, categorizing them into three main challenges, and discuss potential strategies for future research to overcome them.
本文研究了大型语言模型(LLMs)在从完整的材料科学研究论文中提取聚合物纳米复合物(PNCs)样本列表的中的应用。挑战在于PNC样本的复杂性,这些样本具有多个分散在文本中的属性。对PNC的详细信息进行注释的复杂性限制了数据的可用性,使得传统的文档级别关系提取技术由于在创建全面的命名实体范围注释方面具有挑战性而不可行。为了应对这个问题,我们引入了一个新的基准和用于此任务的评估技术,并以零散的方式探索了不同的提示策略。我们还引入了自一致性来提高性能。我们的研究结果表明,即使是最先进的LLM也无法从论文中提取所有样本。最后,我们分析了此过程中遇到的错误,并将它们分为三个主要挑战,同时讨论了未来研究如何克服这些挑战。
https://arxiv.org/abs/2403.00260
In recent years, large language models have achieved state-of-the-art performance across various NLP tasks. However, investigations have shown that these models tend to rely on shortcut features, leading to inaccurate predictions and causing the models to be unreliable at generalization to out-of-distribution (OOD) samples. For instance, in the context of relation extraction (RE), we would expect a model to identify the same relation independently of the entities involved in it. For example, consider the sentence "Leonardo da Vinci painted the Mona Lisa" expressing the created(Leonardo_da_Vinci, Mona_Lisa) relation. If we substiute "Leonardo da Vinci" with "Barack Obama", then the sentence still expresses the created relation. A robust model is supposed to detect the same relation in both cases. In this work, we describe several semantically-motivated strategies to generate adversarial examples by replacing entity mentions and investigate how state-of-the-art RE models perform under pressure. Our analyses show that the performance of these models significantly deteriorates on the modified datasets (avg. of -48.5% in F1), which indicates that these models rely to a great extent on shortcuts, such as surface forms (or patterns therein) of entities, without making full use of the information present in the sentences.
近年来,大型语言模型在各种自然语言处理任务中取得了最先进的性能。然而,调查表明,这些模型往往依赖于短语特征,导致不准确的预测,并使模型对扩展到分布外(OOD)样本的泛化不可靠。例如,在关系提取(RE)背景下,我们期望模型应独立于其中涉及的关系实体。例如,考虑句子"Leonardo da Vinci painted the Mona Lisa"表示创建(Leonardo_da_Vinci, Mona_Lisa)关系。如果我们用"Barack Obama"替换"Leonardo da Vinci",那么句子仍然表达了相同的关系。一个健壮的模型应该在两种情况下都检测到相同的关系。在本文中,我们描述了几个基于语义的有动机生成对抗样本的方法,通过替换实体提及,并研究了在压力下最先进的RE模型的性能。我们的分析表明,这些模型的性能在修改的数据集上显著恶化(平均F1分数为-48.5%),这表明这些模型在很大程度上依赖于短语特征,如实体表面的形式(或其中的模式)而不充分利用句子中的信息。
https://arxiv.org/abs/2402.19076
This paper proposes a novel named entity recognition (NER) technique specifically tailored for the open-source software systems. Our approach aims to address the scarcity of annotated software data by employing a comprehensive two-step distantly supervised annotation process. This process strategically leverages language heuristics, unique lookup tables, external knowledge sources, and an active learning approach. By harnessing these powerful techniques, we not only enhance model performance but also effectively mitigate the limitations associated with cost and the scarcity of expert annotators. It is noteworthy that our framework significantly outperforms the state-of-the-art LLMs by a substantial margin. We also show the effectiveness of NER in the downstream task of relation extraction.
本文提出了一种名为实体识别(NER)的新技术,特别针对开源软件系统。我们的方法旨在通过采用全面的两步远距离监督标注过程来解决标注软件数据稀疏的问题。这一过程通过利用语言启发式、独特的查找表、外部知识来源和主动学习方法来有效地利用这些技术。通过利用这些强大的技术,我们不仅提高了模型的性能,而且有效地缓解了成本和专家标注稀疏性的限制。值得注意的是,我们的框架在当前最佳 LLM 之间取得了很大的优势。我们还证明了 NER 在关系抽取任务的下游工作中非常有效。
https://arxiv.org/abs/2402.16159
Relational triple extraction is a fundamental task in the field of information extraction, and a promising framework based on table filling has recently gained attention as a potential baseline for entity relation extraction. However, inherent shortcomings such as redundant information and incomplete triple recognition remain problematic. To address these challenges, we propose an Implicit Perspective for relational triple Extraction based on Diffusion model (IPED), an innovative approach for extracting relational triples. Our classifier-free solution adopts an implicit strategy using block coverage to complete the tables, avoiding the limitations of explicit tagging methods. Additionally, we introduce a generative model structure, the block-denoising diffusion model, to collaborate with our implicit perspective and effectively circumvent redundant information disruptions. Experimental results on two popular datasets demonstrate that IPED achieves state-of-the-art performance while gaining superior inference speed and low computational complexity. To support future research, we have made our source code publicly available online.
关系三元提取是信息抽取领域的一个基本任务,并且基于表格填充的基于三元组的模型最近因作为潜在实体关系提取的基本框架而受到关注。然而,固有的缺陷,如冗余信息和部分三元组识别仍然是一个问题。为了应对这些挑战,我们提出了一个基于扩散模型的关系三元组提取implicit perspective (IPED)方案,这是一种创新的方法,用于提取关系三元组。我们的无分类解决方案采用块覆盖策略完成表格,避免了显式标签方法的局限性。此外,我们还引入了一个生成模型结构,块去噪扩散模型,与我们的implicit perspective协同工作,有效绕过了冗余信息干扰。在两个流行的数据集上的实验结果表明,IPED在实现最先进的性能的同时,具有卓越的推理速度和低计算复杂度。为了支持未来的研究,我们将我们的源代码公开发布在网上。
https://arxiv.org/abs/2403.00808
Continual Few-shot Relation Extraction (CFRE) is a practical problem that requires the model to continuously learn novel relations while avoiding forgetting old ones with few labeled training data. The primary challenges are catastrophic forgetting and overfitting. This paper harnesses prompt learning to explore the implicit capabilities of pre-trained language models to address the above two challenges, thereby making language models better continual few-shot relation extractors. Specifically, we propose a Contrastive Prompt Learning framework, which designs prompt representation to acquire more generalized knowledge that can be easily adapted to old and new categories, and margin-based contrastive learning to focus more on hard samples, therefore alleviating catastrophic forgetting and overfitting issues. To further remedy overfitting in low-resource scenarios, we introduce an effective memory augmentation strategy that employs well-crafted prompts to guide ChatGPT in generating diverse samples. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin and significantly mitigates catastrophic forgetting and overfitting in low-resource scenarios.
持续少样本关系提取(CFRE)是一个需要模型在不断学习新关系的同时,避免遗忘旧关系,同时有少量标记训练数据的问题。主要的挑战是灾难性遗忘和过拟合。本文利用提示学习来探讨预训练语言模型隐含的潜在能力,以解决上述两个挑战,从而使语言模型成为更好的持续少样本关系提取器。具体来说,我们提出了一个对比性提示学习框架,设计提示表示来获取更通用的知识,可以轻松适应旧和新的类别,并采用基于边界的对比学习来更关注难样本,从而减轻灾难性遗忘和过拟合问题。为了进一步解决低资源场景下的过拟合问题,我们引入了一种有效的记忆增强策略,使用精心制作的提示来引导 ChatGPT 生成多样化的样本。大量实验证明,我们的方法在很大程度上超越了最先进的method,显著减轻了低资源场景下的灾难性遗忘和过拟合问题。
https://arxiv.org/abs/2402.15713
Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.
标准英语和马来语英语在某些方面存在显著差异,这给自然语言处理(NLP)任务在马来语英语中带来了挑战。不幸的是,现有的数据集主要基于标准英语,因此不适合提高马来语英语中的NLP任务。使用最先进的命名实体识别(NER)解决方案在马来语英语新闻文章上进行实验,发现它们无法处理马来语英语中的形态变化。据我们所知,还没有可用的标注数据来提高模型的性能。为了应对这些问题,我们构建了一个马来语英语新闻(MEN)数据集,其中包含200篇新闻文章,这些文章都有实体和关系进行手动标注。然后,我们微调了spaCy NER工具,并验证了一个适合马来语英语的 dataset 显著地改善了NLP在马来语英语中的性能。本文介绍了我们在数据获取、标注方法和详细分析标注数据的努力。为了验证标注质量,我们使用了 inter-annotator agreement,然后由主题专家解决分歧。在完成这些任务后,我们成功开发了一个包含6,061个实体和3,268个关系实例的数据集。最后,我们讨论了spaCy在NER性能上的微调设置和分析。这个独特的数据集将对马来语英语NLP研究的发展做出重大贡献,使研究人员能够加速他们的进展,特别是NGER和关系提取。数据集和标注指南已发布在Github上。
https://arxiv.org/abs/2402.14521
Recently, large language models (LLMs) have been successful in relational extraction (RE) tasks, especially in the few-shot learning. An important problem in the field of RE is long-tailed data, while not much attention is currently paid to this problem using LLM approaches. Therefore, in this paper, we propose SLCoLM, a model collaboration framework, to mitigate the data long-tail problem. In our framework, We use the ``\textit{Training-Guide-Predict}'' strategy to combine the strengths of pre-trained language models (PLMs) and LLMs, where a task-specific PLM framework acts as a tutor, transfers task knowledge to the LLM, and guides the LLM in performing RE tasks. Our experiments on a RE dataset rich in relation types show that the approach in this paper facilitates RE of long-tail relation types.
近年来,大型语言模型(LLMs)在关系提取(RE)任务中取得了成功,尤其是在少样本学习方面。然而,在RE领域,LLM方法对长尾数据问题的关注程度并不高。因此,在本文中,我们提出了一种模型合作框架SLCoLM,以减轻数据长尾问题。在我们的框架中,我们使用“训练-指导-预测”策略将预训练语言模型(PLM)和LLM的优势相结合,其中一个任务特定的PLM框架充当导师的角色,将任务知识传输给LLM,并指导LLM执行RE任务。我们对一个关系丰富的RE数据集的实验结果表明,本文提出的方法有助于解决长尾关系类型的RE问题。
https://arxiv.org/abs/2402.14373
Relation extraction is an efficient way of mining the extraordinary wealth of human knowledge on the Web. Existing methods rely on domain-specific training data or produce noisy outputs. We focus here on extracting targeted relations from semi-structured web pages given only a short description of the relation. We present GraphScholarBERT, an open-domain information extraction method based on a joint graph and language model structure. GraphScholarBERT can generalize to previously unseen domains without additional data or training and produces only clean extraction results matched to the search keyword. Experiments show that GraphScholarBERT can improve extraction F1 scores by as much as 34.8\% compared to previous work in a zero-shot domain and zero-shot website setting.
关系提取是一种在互联网上挖掘人类知识非凡财富的有效方法。现有的方法依赖于领域特定的训练数据或产生嘈杂的输出。我们在这里关注的是从仅给出关系描述的半结构化网页中提取目标关系。我们提出了GraphScholarBERT,一种基于联合图和语言模型结构的开放领域信息提取方法。GraphScholarBERT可以扩展到之前未见过的领域,而不需要额外的数据或训练,并仅产生与搜索关键词匹配的干净提取结果。实验证明,GraphScholarBERT可以在零散域和零散网站设置中的 previous 工作基础上将提取 F1 分数提高 34.8\%。
https://arxiv.org/abs/2402.14129
Cutting edge techniques developed in the general NLP domain are often subsequently applied to the high-value, data-rich biomedical domain. The past few years have seen generative language models (LMs), instruction finetuning, and few-shot learning become foci of NLP research. As such, generative LMs pretrained on biomedical corpora have proliferated and biomedical instruction finetuning has been attempted as well, all with the hope that domain specificity improves performance on downstream tasks. Given the nontrivial effort in training such models, we investigate what, if any, benefits they have in the key biomedical NLP task of relation extraction. Specifically, we address two questions: (1) Do LMs trained on biomedical corpora outperform those trained on general domain corpora? (2) Do models instruction finetuned on biomedical datasets outperform those finetuned on assorted datasets or those simply pretrained? We tackle these questions using existing LMs, testing across four datasets. In a surprising result, general-domain models typically outperformed biomedical-domain models. However, biomedical instruction finetuning improved performance to a similar degree as general instruction finetuning, despite having orders of magnitude fewer instructions. Our findings suggest it may be more fruitful to focus research effort on larger-scale biomedical instruction finetuning of general LMs over building domain-specific biomedical LMs
在自然语言处理(NLP)领域,发展前沿的技术通常会应用于高价值、数据丰富的生物医学领域。过去的几年里,生成语言模型(LMs)、指令微调以及少样本学习已经成为自然语言处理研究的重点。因此,基于生物医学语料库的生成LMs预训练已经泛滥,同时也在尝试将生物医学指令微调作为研究重点,希望领域特定性能够提高其在下游任务上的表现。鉴于训练这类模型的非 trivial effort,我们研究了它们在生物医学NLP关键任务关系提取方面的益处,具体而言,我们回答了两个问题:(1)在生物医学语料库上训练的LMs是否优于在通用领域语料库上训练的LMs?(2)在生物医学数据上微调的模型是否优于在各种数据上微调的模型,或者仅仅是预训练的模型?我们通过使用现有的LMs,在四个数据集上进行了测试。令人惊讶的结果是,通用领域的模型通常表现优于生物医学领域的模型。然而,尽管生物医学指令微调的益处较小,但它与通用指令微调的性能相似。我们的研究结果表明,更值得关注的是针对通用LMs的大规模生物医学指令微调,而不是针对特定领域的生物医学LMs的构建。
https://arxiv.org/abs/2402.13470
Large language models (LLMs) have demonstrated impressive abilities in generating unstructured natural language according to instructions. However, their performance can be inconsistent when tasked with producing text that adheres to specific structured formats, which is crucial in applications like named entity recognition (NER) or relation extraction (RE). To address this issue, this paper introduces an efficient method, G&O, to enhance their structured text generation capabilities. It breaks the generation into a two-step pipeline: initially, LLMs generate answers in natural language as intermediate responses. Subsequently, LLMs are asked to organize the output into the desired structure, using the intermediate responses as context. G&O effectively separates the generation of content from the structuring process, reducing the pressure of completing two orthogonal tasks simultaneously. Tested on zero-shot NER and RE, the results indicate a significant improvement in LLM performance with minimal additional efforts. This straightforward and adaptable prompting technique can also be combined with other strategies, like self-consistency, to further elevate LLM capabilities in various structured text generation tasks.
大语言模型(LLMs)在根据指令生成无结构自然语言方面表现出了令人印象深刻的能力。然而,当它们被要求生成遵循特定结构格式的文本时,其性能可能会出现不一致,这是在应用中如命名实体识别(NER)或关系提取(RE)时至关重要的。为解决这个问题,本文提出了一种有效的方法G&O,以增强LLM的结构文本生成能力。它将生成过程分为两个步骤:首先,LLMs根据指令生成自然语言的答案作为中间结果。随后,LLMs被要求将中间结果组织成所需的结构,使用中间结果作为上下文。G&O有效地将内容生成与结构过程分离,减少了同时完成两个正交任务的压力。在零散的NER和RE测试中,结果表明,LLM性能在不需要额外努力的情况下显著提高。这种简单而可适应的提示技术还可以与其他策略,如自一致性相结合,进一步提高LLM在各种结构文本生成任务中的能力。
https://arxiv.org/abs/2402.13364
In this study, we investigate the potential of GPT-4 and its advanced iteration, GPT-4 Turbo, in autonomously developing a detailed entity type taxonomy. Our objective is to construct a comprehensive taxonomy, starting from a broad classification of entity types - including objects, time, locations, organizations, events, actions, and subjects - similar to existing manually curated taxonomies. This classification is then progressively refined through iterative prompting techniques, leveraging GPT-4's internal knowledge base. The result is an extensive taxonomy comprising over 5000 nuanced entity types, which demonstrates remarkable quality upon subjective evaluation. We employed a straightforward yet effective prompting strategy, enabling the taxonomy to be dynamically expanded. The practical applications of this detailed taxonomy are diverse and significant. It facilitates the creation of new, more intricate branches through pattern-based combinations and notably enhances information extraction tasks, such as relation extraction and event argument extraction. Our methodology not only introduces an innovative approach to taxonomy creation but also opens new avenues for applying such taxonomies in various computational linguistics and AI-related fields.
在这项研究中,我们研究了 GPT-4 和其高级迭代 GPT-4 Turbo 在自动开发详细实体类型分类中的潜力。我们的目标是为自定义实体类型分类构建一个全面的分类,从广泛的分类——包括物体、时间、地点、组织、事件、行动和主题——类似于现有的手动编辑分类。通过使用迭代提示技术逐步精炼分类,并利用 GPT-4 的内部知识库。结果是一个包括超过 5000 个细微实体类型的广泛分类,在主观评价方面表现出非凡的质量。我们采用了一种简单而有效的提示策略,使得分类能够动态扩展。这个详细分类的实际应用是多样且具有重大意义的。它通过基于模式的组合简化了创建新分支,并显著增强了关系提取和事件论据提取等任务。我们的方法不仅为分类创建提供了一种创新的方法,而且为各种计算语言学和相关领域打开了新的应用途径。
https://arxiv.org/abs/2402.12557
Recently, there has been increasing interest in exploring the capabilities of advanced large language models (LLMs) in the field of information extraction (IE), specifically focusing on tasks related to named entity recognition (NER) and relation extraction (RE). Although researchers are exploring the use of few-shot information extraction through in-context learning with LLMs, they tend to focus only on using correct or positive examples for demonstration, neglecting the potential value of incorporating incorrect or negative examples into the learning process. In this paper, we present c-ICL, a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. This approach enhances the ability of LLMs to extract entities and relations by utilizing prompts that incorporate not only the positive samples but also the reasoning behind them. This method allows for the identification and correction of potential interface errors. Specifically, our proposed method taps into the inherent contextual information and valuable information in hard negative samples and the nearest positive neighbors to the test and then applies the in-context learning demonstrations based on LLMs. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods, delivering substantial enhancements in performance across a broad spectrum of related tasks. These improvements are noteworthy, showcasing the versatility of our approach in miscellaneous scenarios.
近年来,在信息抽取(IE)领域,探索大型语言模型(LLMs)的高级能力越来越受到关注,尤其是关于命名实体识别(NER)和关系提取(RE)等任务。尽管研究者正在探索通过在上下文学习中使用少样本信息提取的方法,但他们往往只关注使用正确的或积极的样本进行演示,而忽略了将不正确或消极的样本纳入学习过程的潜在价值。在本文中,我们提出了c-ICL,一种新颖的少样本技术,它利用正确的和不正确的样本构建上下文学习演示。这种方法通过利用包含不仅正面样本而且其推理的积极样本,增强了LLMs提取实体和关系的能力。这种方法允许识别和纠正潜在的界面错误。具体来说,我们提出的方法基于LLMs,并利用测试和最近积极邻居中的固有上下文信息以及有价值的信息,然后应用上下文学习演示。我们对各种数据集的实验表明,c-ICL在之前少样本信息提取方法中表现出优异的性能,在广泛的任务中显著提高了性能。这些改进值得注意的是,展示了我们在各种场景中方法的有效性。
https://arxiv.org/abs/2402.11254