Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at this https URL.
https://arxiv.org/abs/2605.12074
During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.
https://arxiv.org/abs/2605.11348
Joint named entity recognition (NER) and relation extraction (RE) is a fundamental task in natural language processing for constructing knowledge graphs from unstructured text. While recent approaches treat NER and RE as separate tasks requiring distinct models, we introduce GLiNER-Relex, a unified architecture that extends the GLiNER framework to perform both entity recognition and relation extraction in a single model. Our approach leverages a shared bidirectional transformer encoder to jointly represent text, entity type labels, and relation type labels, enabling zero-shot extraction of arbitrary entity and relation types specified at inference time. GLiNER-Relex constructs entity pair representations from recognized spans and scores them against relation type embeddings using a dedicated relation scoring module. We evaluate our model on four standard relation extraction benchmarks: CoNLL04, DocRED, FewRel, and CrossRE, and demonstrate competitive performance against both specialized relation extraction models and large language models, while maintaining the computational efficiency characteristic of the GLiNER family. The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both entities and relation triplets in a single call. All models and code are publicly available.
https://arxiv.org/abs/2605.10108
This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.
https://arxiv.org/abs/2604.20795
Radio astronomy plays a crucial role in understanding the universe, particularly within the realm of non-thermal astrophysics. Images of celestial objects are derived from the signals (called visibility) measured by radio telescopes. Such imaging results, called dirty images, contain artifacts due to factors such as sparsity and therefore require reconstruction to improve imaging quality. Existing methods typically restrict reconstruction to a unimodal domain, either to the dirty image after imaging or to the sparse visibility prior to imaging. Focusing solely on each unimodal reconstruction results in the loss of complementary in-context information in either the visibility or image domain, leading to an incomplete modeling of mutual dependency and consistency. To address these challenges, we propose CDCRec, a multimodal radio interferometric data reconstruction method that explicitly models cross-domain consistency. We design a hierarchical multi-task and multi-stage framework to enhance the exploration of interplays between domains during reconstruction. Our experimental results demonstrate that CDCRec improves imaging performance through enhanced cross-domain correlation extraction. In particular, our self-supervised complementary modeling strategy is better than current methods at interferometric domain translations that rely heavily on recovering dense information from constrained source-domain data.
https://arxiv.org/abs/2604.16794
Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.
https://arxiv.org/abs/2604.08752
Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.
https://arxiv.org/abs/2604.07937
Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.
https://arxiv.org/abs/2604.06650
Knowledge Graph construction from natural language requires extracting structured triplets from complex, information-dense sentences. In this paper, we investigate if the decomposition of text into atomic propositions (minimal, semantically autonomous units of information) can improve the triplet extraction. We introduce MPropositionneur-V2, a small multilingual model covering six European languages trained by knowledge distillation from Qwen3-32B into a Qwen3-0.6B architecture, and we evaluate its integration into two extraction paradigms: entity-centric (GLiREL) and generative (Qwen3). Experiments on SMiLER, FewRel, DocRED and CaRB show that atomic propositions benefit weaker extractors (GLiREL, CoreNLP, 0.6B models), improving relation recall and, in the multilingual setting, overall accuracy. For stronger LLMs, a fallback combination strategy recovers entity recall losses while preserving the gains in relation extraction. These results show that atomic propositions are an interpretable intermediate data structure that complements extractors without replacing them.
从自然语言构建知识图谱需要从复杂、信息密集的句子中提取结构化三元组。本文研究了将文本分解为原子命题(最小、语义自足的信息单元)是否能改进三元组抽取。我们引入了MPropositionneur-V2,一个覆盖六种欧洲语言的小型多语言模型,通过将Qwen3-32B的知识蒸馏至Qwen3-0.6B架构中进行训练,并评估了其整合到两种抽取范式中的效果:以实体为中心(GLiREL)和生成式(Qwen3)。在SMiLER、FewRel、DocRED和CaRB上的实验表明,原子命题有益于较弱的抽取器(GLiREL、CoreNLP、0.6B模型),提高了关系召回率,并在多语言设置中提升了整体准确率。对于更强大的大语言模型,一种回退组合策略在保持关系抽取增益的同时,恢复了实体召回率的损失。这些结果表明,原子命题是一种可解释的中间数据结构,能够补充而非替代抽取器。
https://arxiv.org/abs/2604.02866
Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.
时间关系提取(Temporal Relation Extraction, TRE)需要识别两个事件或时间表达式在时间上的关联方式。现有的基于注意力的模型常聚焦于全局显著标记,却忽略了实际决定时间关系的事件对特定线索。我们提出WISTERIA(基于弱隐式信号与注意力机制的时间关系提取)框架,该框架检验了在 conditioning on 每个事件对条件下的top-K注意力成分是否真正编码了可解释的时间分类证据。与先前假设"之前"、"之后"或"当...时"等显性标记的研究不同,WISTERIA将信号定义为任何隐式表达时间顺序的词汇、句法或形态学元素。通过将多头注意力与事件对条件化的top-K池化相结合,模型能为每个事件对分离出最具信息量的上下文标记。我们在TimeBank-Dense、MATRES、TDDMan和TDDAuto数据集上进行了广泛实验,包括对top-K标记的语言学分析。结果表明,WISTERIA实现了具有竞争力的准确率,并揭示了与时间语言学线索对齐的事件级推理依据,为时间推理提供了局部化且可解释的视角。
https://arxiv.org/abs/2603.23319
Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic "turns" in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (this https URL), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.
自动化知识图谱(KG)构建对于快速增长的科研文献的探索至关重要。然而,现有方法难以识别长多词实体,通常无法跨领域泛化,且往往忽视科学知识的层次性。虽然通用大语言模型(LLMs)具有适应性,但它们在专业任务上计算成本高昂且准确率不稳定。因此,当前KG普遍浅层且不一致,限制了其用于探索和综合的效用。我们提出一个可扩展的零样本科学KG构建两阶段框架。第一阶段Z-NERD引入了(i)正交语义分解(OSD),通过隔离文本中的语义“转向”来促进领域无关实体识别;(ii)多尺度TCQK注意力机制,通过n-gram感知注意力头捕获连贯的多词实体。第二阶段HGNet通过层次感知消息传递执行关系抽取,显式建模父级、子级和同级关系。为强化全局一致性,我们引入两个互补目标:可微层次损失(抑制循环和捷径边)以及连续抽象场(CAF)损失(在欧氏空间的可学习轴上嵌入抽象层级)。这是首个将层次抽象形式化为标准欧氏嵌入中连续属性的方法,为双曲方法提供了更简单的替代方案。我们发布了SPHERE(此https链接),一个用于层次关系抽取的多领域基准。我们的框架在SciERC、SciER和SPHERE上建立了新的最先进水平,在分布外测试中将NER提升8.08%、RE提升5.99%。在零样本设置中,NER和RE的增益分别达到10.76%和26.2%。
https://arxiv.org/abs/2603.23136
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: this https URL
https://arxiv.org/abs/2603.19017
Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.
https://arxiv.org/abs/2603.16002
Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at this https URL
从大规模生物医学文献中自动提取药物组合(DCE)对于推进精准医疗和药理学研究至关重要。然而,现有的关系抽取方法主要关注二元相互作用,并且难以建模长度可变的n-元药物组合,这些组合需要考虑复杂的兼容逻辑及分布证据。为了克服这些问题,我们提出了RexDrug——一个基于大规模语言模型、用于n-元药物组合提取的端到端推理增强的关系抽取框架。RexDrug采用了一种两阶段训练策略:首先利用多代理协同机制自动生成高质量的专业级推理痕迹,以进行监督微调;其次应用具有多维奖励函数的强化学习来进一步优化推理质量和抽取准确性,该奖励函数特别针对DCE设计。在DrugComb数据集上的广泛实验表明,RexDrug在n-元提取方面始终优于最新的基准方法。对DDI13语料库进行的额外评估证实了其在二元药物相互作用任务中的泛化能力。人类专家评价及自动推理指标进一步表明,RexDrug能够生成连贯的医学推理,并准确识别复杂的治疗方案。这些结果使RexDrug成为从非结构化文本中提取复杂生物医学关系的一种可扩展且可靠的解决方案。源代码和数据可在[提供的URL]获取。
https://arxiv.org/abs/2603.08166
Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.
临床信息提取(例如,2010年i2b2/VA挑战赛)通常包括概念识别、断言分类和关系抽取等任务。在临床领域中联合建模多阶段任务是一个较少被探索的课题。现有的独立任务设置(每个阶段都有给定的参考输入),使得联合模型无法直接与现有流水线工作进行比较。为了解决这些问题,我们定义了一个联合任务设置,并提出了一种新颖的端到端系统以共同优化三个阶段的任务。我们通过使用各种嵌入技术:词嵌入、上下文嵌入和领域内上下文嵌入来进行实验性评估,比较了我们的提案与流水线基线的联合评价结果。提出的联合系统在概念、断言和关系F1分数上分别比流水线基线高出0.3、1.4和3.1分。这项工作将联合方法与临床信息提取联系起来,所提出的方法可以作为未来研究的强大联合基准。代码已公开发布。
https://arxiv.org/abs/2603.07487
Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model's predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.
数学文本理解是一个具有挑战性的任务,因为其中包含专业的实体以及这些实体之间复杂的关系。这项研究将数学问题解释定义为一个数学实体关系抽取(MERE)的任务,在这个过程中,操作数被视为实体,而运算符则作为它们之间的关系。基于Transformer的模型被应用于自动从数学文本中提取这些关系,并且双向编码器表示法(BERT)表现最佳,达到了99.39%的准确率。 为了提高模型预测的透明度和可信度,研究采用了可解释的人工智能(XAI),具体使用了Shapley Additive Explanations(SHAP)。通过这种可解释性分析,可以揭示特定文本特征和数学特性如何影响关系预测,并提供有关特征重要性和模型行为的见解。 这项工作结合了基于Transformer的学习、任务特异性数据集以及可解释建模方法,为MERE提供了有效且可解释的框架。这将支持未来在自动化问题解决、知识图谱构建以及智能教育系统中的应用。
https://arxiv.org/abs/2603.06348
Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.
零样本关系抽取的目标是利用文本描述的新类型(即之前未见过的)来识别实体提及之间的关系,而不是依赖于有标签的训练示例。以往的工作常常基于一些不切实际的假设:(1) 实体对通常直接编码在输入中,这阻碍了大规模文档数据库查询时的离线预计算;(2) 没有引入拒绝机制,在检索场景下使用这些模型时会偏向于评估,因为在某些情况下(通常是大多数情况),输入是无关的且必须被忽略。在这项工作中,我们研究了现有零样本关系抽取模型在适应现实提取场景时的稳健性。为此,我们介绍了现有模型的分类,并提出了几种策略来构建单次通过模型和具备拒绝机制的模型。我们将几款最先进的工具进行了调整,在这种具有挑战性的设置下进行比较,结果显示目前没有工作能真正应对现实假设的挑战,但总体而言,AlignRE(Li等,2024)在所有标准上表现最佳。
https://arxiv.org/abs/2603.01266
This paper introduces FRAME (Fine-grained Recognition of Art-historical Metadata and Entities), a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE). Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases, then filtered to ensure that each text focuses on a single artwork and contains explicit statements about its material, composition, or iconography. FRAME provides stand-off annotations in three layers: a metadata layer for object-level properties, a content layer for depicted subjects and motifs, and a co-reference layer linking repeated mentions. Across layers, entity spans are labeled with 37 types and connected by typed RE links between mentions. Entity types are aligned with Wikidata to support Named Entity Linking (NEL) and downstream knowledge-graph construction. The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Language Models (LLMs).
本文介绍了FRAME(艺术史元数据和实体的细粒度识别),这是一个由人工注释的艺术史图像描述数据集,用于命名实体识别(NER)和关系抽取(RE)。描述是从博物馆目录、拍卖清单、开放获取平台和学术数据库中收集的,经过筛选以确保每个文本聚焦于单一艺术品,并包含对其材质、构图或象征意义的具体陈述。FRAME提供了三层离散注释:元数据层用于对象级属性;内容层用于描绘的主题和动机;以及共指层连接重复提及的内容。在各层次之间,实体跨度被标记为37种类型并通过提及之间的类型化RE链接进行关联。实体类型与Wikidata对齐以支持命名实体链接(NEL)及下游知识图的构建。该数据集作为UIMA XMI通用分析结构(CAS)文件发布,并附带相关图像和文献元数据,可用于基准测试并微调NER和RE系统,包括大型语言模型(LLMs)的零样本和少样本设置。
https://arxiv.org/abs/2602.19133
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
https://arxiv.org/abs/2602.17663
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
https://arxiv.org/abs/2602.17475