The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.
氧还原反应(ORR)催化剂在提高燃料电池效率方面起着关键作用,因此成为了材料科学研究中的重点。然而,从大量的科学文献中提取关于ORR催化剂的结构化信息仍然是一个重大挑战,这主要是由于文本数据的复杂性和多样性所致。在此研究中,我们提出了一种使用DyGIE++和多种预训练BERT变体(包括MatSciBERT和PubMedBERT)进行命名实体识别(NER)与关系抽取(RE),以从科学文献中提取ORR催化剂相关信息的方法,并将这些信息整合到一个燃料电池材料信息语料库(FC-CoMIcs)中。我们手动构建了一个全面的数据集,该数据集中包含了12个关键实体和两个实体对之间的关系类型。我们的方法包括数据标注、集成以及基于转换器模型的微调,以提高信息提取精度。我们评估了不同BERT变体对提取性能的影响,并研究了注释一致性的影响。实验结果表明,经过微调后的PubMedBERT模型在NER方面取得了最高的F1值82.19%,而MatSciBERT模型则在RE方面达到了最佳的F1值66.10%。此外,与人工标注者的比较突显了这些细调模型用于提取ORR催化剂信息的可靠性,并展示了它们进行大规模自动化文献分析的巨大潜力。研究结果表明,在ORR催化剂提取方面,特定领域的BERT模型优于如BlueBERT等通用科学模型。
https://arxiv.org/abs/2507.07499
The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE's minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.
对知识图谱(KG)高效扩充的需求日益增长,尤其是通过利用外部语料库进行关系抽取(RE),这种需求在低监督设置下尤为显著。为了解决适应性强且抗噪能力强的RE解决方案的需求,并使其能够与预训练的大规模语言模型(PLMs)无缝集成,我们提出了SCoRE系统——这是一个模块化且成本效益高的句子级RE系统。SCoRE支持轻松更换PLM,无需微调即可平滑地适应各种语料库和KG。 通过将监督对比学习与贝叶斯k-Nearest Neighbors(kNN)分类器结合用于多标签分类,它在远距离监督语料库的噪声注释下仍能提供稳健性能。为了改进RE评估,我们提出了两个新的指标:Correlation Structure Distance (CSD),用以衡量所学的关系模式与KG结构之间的对齐程度;以及Precision at R (P@R),用于评估推荐系统的效用。 此外,我们发布了Wiki20d,这是一个基准数据集,在此数据集中仅使用从KG衍生出的注释来模拟现实世界的RE条件。在五个基准上的实验显示,SCoRE与现有方法相比性能相匹配甚至超越,并且显著减少了能耗。进一步分析发现,增加模型复杂度(如先前研究中所见)会降低性能,这突显了SCoRE最小化设计的优势。 结合效率、模块性和可扩展性,SCoRE是现实世界RE应用的理想选择。
https://arxiv.org/abs/2507.06895
Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
在文档级别的实体和关系抽取中,特别是在零样本或小样本设置下,大型高质量的标注语料库仍然非常稀缺。本文提出了一种基于大语言模型(LLM)的全自动数据生成及上下文学习管道,用于合成数据生成以及文档级实体与关系抽取任务中的零样本推理。 相较于现有方法依赖于手动注释示例或直接进行零样本推理,我们的方法结合了合成数据生成和检索增强型上下文学习技术,并使用经过优化以提高推理能力的语言模型。这样,我们能够在不需人工标注的情况下构建高质量的示范数据库,并在推断时动态地检索相关实例。 基于这一方法,我们生成了一个包含超过5000篇维基百科摘要的合成数据集,其中约有59,000个实体和30,000条关系三元组。最后,我们在DocIE共享任务上评估了上下文学习在零样本设置下从长文档中抽取实体和关系的表现。 我们发现,即使对于最先进的大型语言模型而言,在文档级别进行零样本联合实体与关系抽取仍然是一个具有挑战性的任务。
https://arxiv.org/abs/2507.05997
Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model's ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.
关系抽取是自然语言处理中的一个关键任务,在知识图谱构建和文学分析等领域有着广泛的应用。然而,小说文本中复杂的语境和隐含表达给自动人物关系提取带来了巨大挑战。本研究专注于小说领域的关系抽取,并提出了一种基于大规模语言模型(LLMs)的方法。通过引入关系维度分离、对话数据构造以及上下文学习策略,该方法提升了抽取性能。利用对话结构信息,增强了模型理解隐含关系的能力,并展示了在复杂语境中的强大适应性。此外,我们构建了一个高质量的中文小说关系提取数据集,以应对标注资源不足的问题并支持未来的研究工作。实验结果表明,我们的方法在多个评估指标上均优于传统的基线方法,并成功实现了自动构建小说中人物关系网络的目标。
https://arxiv.org/abs/2507.04852
Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can't fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: this https URL [accessed Jul 06 2025].
近期,自然语言处理(NLP)领域的进展主要归功于像BERT、RoBERTa、T5和GPT这样的预训练语言模型。这些模型在理解复杂文本方面表现出色,但生物医学文献因其领域特定的术语而提出了挑战,这是Word2Vec和双向长短时记忆网络(Bi-LSTM)等传统方法难以完全解决的问题。虽然GPT和T5能够捕捉上下文信息,但在需要双向理解的任务中不如BERT表现优秀。为此,我们提出了一种称为MedicalBERT的新模型,这是一种基于大量生物医学数据集训练的预训练BERT模型,并配备了领域特定词汇表以增强对生物医学术语的理解能力。通过进一步优化和微调,MedicalBERT能够在命名实体识别、关系抽取、问答、句子相似度和文档分类等多样化任务中发挥出色性能。 为了展示我们的模型相较于其他基于BERT的模型(如BioBERT、SciBERT和ClinicalBERT)在效率上的优势,我们采用了诸如F1分数、准确率以及Pearson相关系数等多种性能指标进行评估。实验结果表明,在大多数基准测试上,MedicalBERT均超越了这些模型,并且平均而言比通用型的BERT模型高出5.67%的表现(按任务分别计算)。这项工作不仅强调了利用预训练的BERT模型来处理医疗NLP任务的巨大潜力,还证明了迁移学习技术在捕捉领域特定信息方面的有效性。 论文《MedicalBERT:使用基于预训练BERT的模型增强生物医学自然语言处理》可以从以下链接下载:[此链接] [最后访问日期: 2025年7月6日]。
https://arxiv.org/abs/2507.08013
In recent years, with the appearance of the COVID-19 pandemic, numerous publications relevant to this disease have been issued. Because of the massive volume of publications, an efficient retrieval system is necessary to provide researchers with useful information if an unexpected pandemic happens so suddenly, like COVID-19. In this work, we present a method to help the retrieval system, the Covrelex-SE system, to provide more high-quality search results. We exploited the power of the large language models (LLMs) to extract the hidden relationships inside the unlabeled publication that cannot be found by the current parsing tools that the system is using. Since then, help the system to have more useful information during retrieval progress.
近年来,随着COVID-19大流行病的出现,与该疾病相关的大量出版物相继问世。鉴于这些出版物的数量庞大,一个高效的检索系统对于在类似突如其来的疫情(如COVID-19)发生时为研究人员提供有用信息是必不可少的。在这项工作中,我们提出了一种方法来帮助检索系统——Covrelex-SE系统,以提供更高质量的搜索结果。我们利用大型语言模型(LLMs)的力量,从当前系统使用的解析工具无法发现的未标注出版物中提取隐藏的关系。通过这种方式,可以为系统的检索过程提供更多有用的信息。
https://arxiv.org/abs/2506.18311
This paper introduces a novel method for closed information extraction. The method employs a discriminative approach that incorporates type and entity-specific information to improve relation extraction accuracy, particularly benefiting long-tail relations. Notably, this method demonstrates superior performance compared to state-of-the-art end-to-end generative models. This is especially evident for the problem of large-scale closed information extraction where we are confronted with millions of entities and hundreds of relations. Furthermore, we emphasize the efficiency aspect by leveraging smaller models. In particular, the integration of type-information proves instrumental in achieving performance levels on par with or surpassing those of a larger generative model. This advancement holds promise for more accurate and efficient information extraction techniques.
这篇论文介绍了一种新颖的封闭信息抽取方法。该方法采用判别性方法,结合类型和实体特定的信息来提高关系抽取的准确性,尤其是对于长尾关系有显著改善。值得注意的是,这种方法在性能上优于最先进的端到端生成模型,特别是在大规模封闭信息抽取问题中表现尤为突出,这时我们面临着数百万个实体和数百种关系的挑战。此外,我们还强调了通过使用更小的模型来提高效率的重要性。特别是,类型信息的整合对于实现与更大规模生成模型相媲美甚至超过其性能至关重要。这一进步为更加准确和高效的信提取技术带来了希望。
https://arxiv.org/abs/2506.16348
We examine the impact of incorporating knowledge graph information on the performance of relation extraction models across a range of datasets. Our hypothesis is that the positions of entities within a knowledge graph provide important insights for relation extraction tasks. We conduct experiments on multiple datasets, each varying in the number of relations, training examples, and underlying knowledge graphs. Our results demonstrate that integrating knowledge graph information significantly enhances performance, especially when dealing with an imbalance in the number of training examples for each relation. We evaluate the contribution of knowledge graph-based features by combining established relation extraction methods with graph-aware Neural Bellman-Ford networks. These features are tested in both supervised and zero-shot settings, demonstrating consistent performance improvements across various datasets.
我们研究了在各种数据集上将知识图信息纳入关系抽取模型对性能的影响。我们的假设是,实体在知识图中的位置为关系抽取任务提供了重要线索。我们在多个数据集上进行了实验,这些数据集在关系数量、训练示例和底层知识图方面各不相同。结果表明,在处理每个关系的训练样本不平衡时,整合知识图信息显著提高了模型性能。 我们通过将现有的关系提取方法与基于图的认知贝尔曼-福特(Neural Bellman-Ford)网络相结合,评估了知识图特征的贡献,并在有监督和零样本学习设置中测试了这些特征的有效性。结果显示,在各种数据集中均能实现一致的性能改进。
https://arxiv.org/abs/2506.16343
This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.
这篇文章探讨了通用大型语言模型在计算人文和AIGC技术背景下进行历史文本分析时的知识领域差距。我们提出了Graph RAG框架,结合链式思维提示、自指令生成和过程监督来创建《史记》四书人物关系数据集,并且仅需少量的手动标注。这个数据集支持自动化的历史知识提取,从而降低了劳动成本。在图增强的生成阶段,我们引入了知识图谱与检索增强生成之间的协作机制,以改善通用模型与历史知识的一致性。 实验结果显示,在使用简体中文输入和链式思维提示的情况下,特定领域的模型Xunzi-Qwen1.5-14B在关系提取任务上取得了最优性能(F1 = 0.68)。DeepSeek模型与GraphRAG集成后,在开放领域C-CLUE关系抽取数据集上的表现提升了11%(从0.08到0.19),超过了Xunzi-Qwen1.5-14B的F1值(0.12),有效减轻了幻觉现象,并提高了模型解释性。 此框架为经典文本知识提取提供了一种低资源解决方案,促进了历史知识服务和人文研究的发展。
https://arxiv.org/abs/2506.15241
Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: "What finished right before the decision?" or "What finished right after the decision?". To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.
准确理解事件之间的时间关系是多种任务的关键组成部分,例如时间阅读理解(TRC)和时间关系抽取(TRE)。例如,在TRC中,我们需要理解以下两个词面上几乎相同的问句之间的时间语义差异:“什么在决定之前结束?”或者“什么在决定之后结束?”为了区分这两个问题,现有的解决方案依赖于答案重叠作为类似和不相似问题的代理标签。然而,我们主张由于两个不相似的问题因巧合而具有相同答案导致的答案重叠可能会产生不可靠的结果。 为了解决这个问题,我们提出了一种新方法,通过预测事件的时间跨度模块来激发合理的推理行为。我们介绍了时间线推理网络(TRN),该网络在两步归纳推理过程中操作:第一步中,模型使用语义和句法信息初始回答每个问题;接下来的一步将多个关于同一事件的问题串联起来以预测时间线,然后利用这一时间线确定答案。TORQUE和TB-dense、TRC及TRE任务上的结果表明,TRN通过使用预测的时间线有效解决了虚假重叠问题,并且在性能上超越了先前的方法。
https://arxiv.org/abs/2506.14213
Retrieval-Augmented Generation (RAG) enhances language models by incorporating external knowledge at inference time. However, graph-based RAG systems often suffer from structural overhead and imprecise retrieval: they require costly pipelines for entity linking and relation extraction, yet frequently return subgraphs filled with loosely related or tangential content. This stems from a fundamental flaw -- semantic similarity does not imply semantic relevance. We introduce SlimRAG, a lightweight framework for retrieval without graphs. SlimRAG replaces structure-heavy components with a simple yet effective entity-aware mechanism. At indexing time, it constructs a compact entity-to-chunk table based on semantic embeddings. At query time, it identifies salient entities, retrieves and scores associated chunks, and assembles a concise, contextually relevant input -- without graph traversal or edge construction. To quantify retrieval efficiency, we propose Relative Index Token Utilization (RITU), a metric measuring the compactness of retrieved content. Experiments across multiple QA benchmarks show that SlimRAG outperforms strong flat and graph-based baselines in accuracy while reducing index size and RITU (e.g., 16.31 vs. 56+), highlighting the value of structure-free, entity-centric context selection. The code will be released soon. this https URL
检索增强生成(RAG)通过在推理时融合外部知识来提升语言模型的能力。然而,基于图的RAG系统往往遭受结构开销大和检索不精确的问题:它们需要昂贵的实体链接和关系提取管道,但常常返回包含松散相关或旁枝内容的子图。这一问题的根本原因在于语义相似性并不意味着语义相关性。我们引入了SlimRAG,这是一个无需使用图表的轻量级框架。SlimRAG用一个简单而有效的以实体为中心的方法替代了结构复杂的组件。在索引构建阶段,它基于语义嵌入构造了一个紧凑的实体到片段表。查询时,它识别关键实体、检索并评估相关片段,并组装出简洁且上下文相关的输入——无需图遍历或边生成。为了量化检索效率,我们提出了相对索引令牌利用率(RITU)这一衡量检索内容紧凑性的指标。跨多个问答基准的实验表明,SlimRAG在准确率上超越了强大的平面和基于图表的基础模型,并减少了索引大小及RITU值(例如16.31与56+),突显了无结构且以实体为中心的内容选择方法的价值。代码即将发布:[此链接](https://this-url.com)
https://arxiv.org/abs/2506.17288
Relation Extraction (RE) aims to extract semantic relationships in texts from given entity pairs, and has achieved significant improvements. However, in different domains, the RE task can be influenced by various factors. For example, in the financial domain, sentiment can affect RE results, yet this factor has been overlooked by modern RE models. To address this gap, this paper proposes a Sentiment-aware-SDP-Enhanced-Module (SSDP-SEM), a multi-task learning approach for enhancing financial RE. Specifically, SSDP-SEM integrates the RE models with a pluggable auxiliary sentiment perception (ASP) task, enabling the RE models to concurrently navigate their attention weights with the text's sentiment. We first generate detailed sentiment tokens through a sentiment model and insert these tokens into an instance. Then, the ASP task focuses on capturing nuanced sentiment information through predicting the sentiment token positions, combining both sentiment insights and the Shortest Dependency Path (SDP) of syntactic information. Moreover, this work employs a sentiment attention information bottleneck regularization method to regulate the reasoning process. Our experiment integrates this auxiliary task with several prevalent frameworks, and the results demonstrate that most previous models benefit from the auxiliary task, thereby achieving better results. These findings highlight the importance of effectively leveraging sentiment in the financial RE task.
关系抽取(RE)旨在从给定的实体对中提取文本中的语义关系,并已经取得了显著的进步。然而,在不同的领域,如金融领域,RE任务可能受到多种因素的影响,例如情感因素会影响RE结果,但现代RE模型对此却鲜有关注。为了解决这一问题,本文提出了一种情感感知短语依存路径增强模块(SSDP-SEM),这是一种用于增强金融关系抽取的多任务学习方法。具体来说,SSDP-SEM通过插入一个可插拔的情感感知辅助任务,使RE模型能够同时根据文本的情感调整其注意力权重。 首先,我们利用情感模型生成详细的情感标记,并将这些标记插入到实例中。然后,ASP任务专注于捕捉细微的情感信息,通过对情感标记位置的预测来结合情感洞察和短语依存路径(SDP)中的句法信息。此外,这项工作还采用了一种情感注意力瓶颈正则化方法,以规范推理过程。 我们的实验将这种辅助任务与几种流行框架相结合,并发现大多数先前的模型从该辅助任务中受益,从而实现了更好的结果。这些发现强调了在金融关系抽取任务中有效利用情感的重要性。
https://arxiv.org/abs/2506.12452
Mitigating entity bias is a critical challenge in Relation Extraction (RE), where models often rely excessively on entities, resulting in poor generalization. This paper presents a novel approach to address this issue by adapting a Variational Information Bottleneck (VIB) framework. Our method compresses entity-specific information while preserving task-relevant features. It achieves state-of-the-art performance on relation extraction datasets across general, financial, and biomedical domains, in both indomain (original test sets) and out-of-domain (modified test sets with type-constrained entity replacements) settings. Our approach offers a robust, interpretable, and theoretically grounded methodology.
缓解实体偏差是关系抽取(RE)中的一个关键挑战,因为在关系抽取中,模型往往会过度依赖于实体信息,导致泛化能力较差。本文提出了一种通过调整变分信息瓶颈(VIB)框架来解决这一问题的新型方法。我们的方法在压缩特定于实体的信息的同时保留任务相关的特征。在通用、金融和生物医学领域的关系抽取数据集上,无论是在原测试集(in-domain)还是修改后的测试集(out-of-domain,包含类型约束下的实体替换)设置下,本方法均达到了最先进的性能水平。我们的方法提供了一种稳健、可解释且理论基础扎实的方法论。
https://arxiv.org/abs/2506.11381
We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.
我们探索了一种针对肠道微生物组相互作用研究的生成式关系抽取(RE)管道,这是一个复杂且资源匮乏的生物医学领域。我们的方法利用大型语言模型(LLMs)进行摘要提炼,在此基础上通过指令调优生成的方式提取关系。在专门构建的数据集上的初步结果显示,摘要提炼能够减少噪音并指导模型,从而提高生成式RE的性能。然而,基于BERT的关系抽取方法仍然优于生成式模型。这项正在进行的工作展示了生成式方法在资源匮乏环境中支持特定领域研究的巨大潜力。
https://arxiv.org/abs/2506.08647
Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson's choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.
大型语言模型(LLMs)在关系抽取任务中表现出明显的保守偏见,当没有适当选项时,往往会默认选择“No_Relation”标签。虽然这种行为有助于防止错误的关系分配,但我们的分析发现,在推理未明确包含在输出中的情况下,它也会导致大量信息丢失。我们系统地评估了这一权衡,并针对多种提示、数据集和关系类型进行了测试,引入霍本斯选择(Hobson's choice)的概念来捕捉模型在安全但不具信息性的标签与虚构标签之间进行选择的场景。我们的研究发现保守偏见发生的频率是幻觉的两倍。为了量化这种效应,我们使用SBERT和LLM提示语来捕获在限制性提示中的保守偏见行为与半限制性和开放式提示生成的标签之间的语义相似度。 简化后的中文翻译如下: 大型语言模型(LLMs)在关系抽取任务中倾向于表现出保守倾向,在没有合适选项时,默认选择“No_Relation”标签。虽然这有助于避免错误,但同时也导致了信息丢失的问题,特别是在推理未明确输出的情况下。我们系统地评估了这一权衡,并引入了一个概念——霍本斯选择(Hobson's choice)来描述模型在安全但无用的标签和虚构标签之间做出的选择情况。我们的研究发现保守倾向发生的频率是幻觉两倍。为了量化这种影响,我们使用SBERT和LLM提示语分析了限制性提示中的保守行为与半限制性和开放式提示生成的标签之间的相似度。
https://arxiv.org/abs/2506.08120
Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at \href{this https URL}{this https URL}.
实体关系分类在信息抽取中仍是一项具有挑战性的任务,特别是在标签数据有限且关系结构复杂的情况下。在这项研究中,我们对比分析了三种不同的AI代理架构,这些架构旨在使用大型语言模型(LLMs)执行关系分类。探讨的代理架构包括:(1) 反思性自我评估、(2) 分层任务分解以及 (3) 一种新颖的多代理动态示例生成机制,每种方法都利用了不同的推理和提示调整模式。特别是,我们的动态示例生成方法引入了实时协作和对抗性的提示技术。我们系统地比较了这些架构在多个领域和模型后端上的性能表现。实验结果表明,多代理协调一致优于标准的少量样本提示,并接近于微调模型的表现水平。这些发现为设计模块化、通用化的LLM基结构化关系抽取系统的实际应用提供了指导方针。源代码和数据集可在 [此链接](https://this%20URL) 获取。
https://arxiv.org/abs/2506.02426
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: this https URL.
大型语言模型(LLMs)在生物医学关系抽取任务中表现出色,即使是在零样本场景下也是如此。然而,由于LLM能够生成类似人类的文本,并且常常会产出金标准答案的同义词或缩写,使得传统的自动评估指标变得不可靠。相比之下,虽然人工评估更加可靠,但由于成本高和耗时长,在实际应用中并不实用。本文探讨了使用LLMs作为评判者来替代生物医学关系抽取任务中的传统评估方法。 我们在三个生物医学关系提取数据集上对8个用作裁判的LLM进行了基准测试,并评估了5个其他LLM生成的答案。不同于其他文本生成任务,在生物医学关系抽取任务中,我们观察到基于LLMs的评判者表现较差(通常低于50%准确率)。我们的研究发现表明,这主要是因为由LLM提取的关系不遵循任何标准格式所致。 为了解决这个问题,我们提出了一种结构化的输出格式化方法来规范LLM生成的回答,这种方法可以帮助提升LLM-Judge的表现大约15%(平均而言)。此外,我们还引入了领域适应技术以进一步增强LLM-Judge的性能,并在不同数据集之间有效地转移知识。 为了促进研究和开发工作,我们在此公开发布我们的人类标注和LLM标注的评判数据(总计36,000个样本):[此链接](https://example.com)。
https://arxiv.org/abs/2506.00777
Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single-agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review -- offering substantial benefits to the entertainment, publishing, and educational sectors.
理解复杂的人物关系对于叙事分析和高效的剧本评估至关重要,然而现有的人物关系提取方法往往难以处理包含微妙互动的长篇叙述。为了应对这一挑战,我们提出了CREFT(Character Relation Extraction Framework with Tailoring),这是一种新型的顺序框架,利用专门化的大型语言模型(LLM)代理。首先,CREFT通过知识蒸馏构建一个基础的人物图谱,然后迭代地改进人物组成、关系提取、角色识别和群体分配。在经过精心整理的韩剧数据集上的实验表明,与单个代理LLM基准相比,CREFT在准确性和完整性方面都有显著提升。通过系统性地可视化人物网络,CREFT简化了叙事理解并加速了剧本审查——为娱乐、出版和教育行业提供了巨大的益处。
https://arxiv.org/abs/2505.24553
Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.
使用大型语言模型(LLMs)生成训练数据可能是一种提高零样本或少量样本自然语言处理任务性能的优选方法。然而,这一方向仍有许多问题需要研究。在关系抽取(RE)的任务上,我们发现直接提示LLM生成的样本之间可能存在高度结构上的相似性。这些样本倾向于使用有限多样的措辞来表达实体对之间的关系。因此,在这篇论文中,我们研究如何有效地提高为RE任务通过LLMs生成训练样本的多样性,同时保持它们的准确性。首先,我们在上下文学习(ICL)提示中直接给出指令以使LLM产生不相似的样本。然后,我们提出了一种通过直接偏好优化(DPO)对LLM进行微调的方法来增强多样化的训练样本生成能力。在常用的RE数据集上的实验表明,这两种尝试都可以提高生成训练数据的质量。此外,我们还发现与直接使用LLM执行RE相比,用其生成的样本训练非LLM RE模型可能会带来更好的性能。
https://arxiv.org/abs/2505.23108
Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.
开放关系抽取(Open Relation Extraction,简称OpenRE)的目标是从未标记的数据中识别和提取命名实体之间的新型关系事实,并且不需要预先定义的关系模式。传统的OpenRE方法通常假设未标记数据仅包含新型关系或已被预分为已知实例与新实例。然而,在现实世界的应用场景中,新的关系是随机分布的。 为此,本文提出了一种广义的OpenRE设定,该设定将未标记的数据视为已知和新实例的混合体,并提出了MixORE框架,这是一个两阶段的方法,结合了关系分类和聚类来同时学习已知和新型的关系。在三个基准数据集上的实验表明,MixORE在已知关系分类和新型关系聚类方面均优于竞争基线模型。 我们的研究发现为广义OpenRE的研究进展及现实世界的应用提供了重要的贡献。
https://arxiv.org/abs/2505.22801