Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
以下是给定文本的中文翻译: 宣传是一种历史上长期使用的说服形式,其目的是通过修辞和心理劝说技巧来影响人们的观点以达到特定的目的。尽管阿拉伯语在互联网上排名第四常用语言,但在英语以外的语言(尤其是阿拉伯语)中用于检测宣传的资源仍然极为有限。为了填补这一空白,首次推出了针对阿拉伯语的多标签宣传、情感和情绪(MultiProSE)数据集。MultiProSE是现有阿拉伯语宣传数据集ArPro的一个开源扩展版本,并为每条文本添加了情感和情绪注释。该数据集包含8,000篇经过标注的新闻文章,这是迄今为止最大的宣传数据集。对于每个任务,开发人员使用大型语言模型(如GPT-4o-mini)和预训练语言模型(包括三种基于BERT的模型)建立了多个基准模型。该数据集、注释指南及源代码均公开发布,以促进阿拉伯语语言模型的未来研究和发展,并有助于深入了解新闻媒体中各种观点维度之间的相互作用。
https://arxiv.org/abs/2502.08319
Hate speech detection is a crucial task, especially on social media, where harmful content can spread quickly. Implementing machine learning models to automatically identify and address hate speech is essential for mitigating its impact and preventing its proliferation. The first step in developing an effective hate speech detection model is to acquire a high-quality dataset for training. Labeled data is foundational for most natural language processing tasks, but categorizing hate speech is difficult due to the diverse and often subjective nature of hate speech, which can lead to varying interpretations and disagreements among annotators. This paper examines strategies for addressing annotator disagreement, an issue that has been largely overlooked. In particular, we evaluate different approaches to deal with annotator disagreement regarding hate speech classification in Turkish tweets, based on a fine-tuned BERT model. Our work highlights the importance of the problem and provides state-of-art benchmark results for detection and understanding of hate speech in online discourse.
仇恨言论检测是一项至关重要的任务,特别是在社交媒体上,有害内容可以迅速传播。实施机器学习模型以自动识别和应对仇恨言论对于减轻其影响并防止其扩散至关重要。开发有效的仇恨言论检测模型的第一步是获取高质量的训练数据集。标注数据对大多数自然语言处理任务来说都是基础性的,但将仇恨言论分类却相当困难,因为仇恨言论在多样性和主观性方面存在巨大差异,这可能导致注释者之间出现不同的解释和争议。本文探讨了应对注释者分歧策略的问题,这是长期以来被忽视的一个问题。具体而言,我们基于经过微调的BERT模型评估了处理土耳其推文中的仇恨言论分类时不同注释者之间的分歧的不同方法。我们的工作突出了这一问题的重要性,并为在线对话中仇恨言论的检测和理解提供了最先进的基准结果。
https://arxiv.org/abs/2502.08266
This paper presents a novel Natural Language Processing (NLP) framework for enhancing medical diagnosis through the integration of advanced techniques in data augmentation, feature extraction, and classification. The proposed approach employs back-translation to generate diverse paraphrased datasets, improving robustness and mitigating overfitting in classification tasks. Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained contextual and positional relationships, dynamically adjusting the influence of positional information based on semantic context to produce high-quality text embeddings. For classification, an Attention-Based Feedforward Neural Network (ABFNN) is utilized, effectively focusing on the most relevant features to improve decision-making accuracy. Applied to the classification of symptoms, clinical notes, and other medical texts, this architecture demonstrates its ability to address the complexities of medical data. The combination of data augmentation, contextual embedding generation, and advanced classification mechanisms offers a robust and accurate diagnostic tool, with potential applications in automated medical diagnosis and clinical decision support. This method demonstrates the effectiveness of the proposed NLP framework for medical diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of 99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only underscore the model's robust performance in classifying medical texts with exceptional precision and reliability but also highlight its superiority over existing methods, making it a highly promising tool for automated diagnostic systems.
本文提出了一种新颖的自然语言处理(NLP)框架,通过集成先进的数据增强、特征提取和分类技术来提升医学诊断。该方法采用回译生成多样化的同义句数据集,以提高鲁棒性并减轻分类任务中的过拟合问题。 利用解码增强BERT与分散注意力机制(DeBERTa)结合动态上下文位置门控(DCPG),模型能够捕捉到细微的语境和位置关系,并根据语义背景动态调整位置信息的影响,生成高质量的文字嵌入。在分类阶段,采用基于注意机制的前馈神经网络(ABFNN),有效地聚焦于最相关的特征以提高决策准确性。 将该架构应用于症状、临床笔记和其他医学文本的分类,证明了其解决医学数据复杂性的能力。通过结合数据增强、上下文生成和先进的分类方法,为医学诊断提供了强大而准确的工具,并具有在自动医疗诊断和支持临床决策方面的潜在应用价值。 本文提出的NLP框架对于医学诊断的有效性得到了证实,在精确度(99.78%)、召回率(99.72%)、精确性(99.79%)以及F1值(99.75%)方面取得了卓越的结果。这些指标不仅凸显了模型在分类医学文本时的稳健性能和高精度,还强调了其相对于现有方法的优势,使其成为自动诊断系统中的一个极具前景的工具。
https://arxiv.org/abs/2502.07755
Negation has been a long-standing challenge for language models. Previous studies have shown that they struggle with negation in many natural language understanding tasks. In this work, we propose a self-supervised method to make language models more robust against negation. We introduce a novel task, Next Sentence Polarity Prediction (NSPP), and a variation of the Next Sentence Prediction (NSP) task. We show that BERT and RoBERTa further pre-trained on our tasks outperform the off-the-shelf versions on nine negation-related benchmarks. Most notably, our pre-training tasks yield between 1.8% and 9.1% improvement on CondaQA, a large question-answering corpus requiring reasoning over negation.
否定一直是语言模型面临的长期挑战。先前的研究表明,它们在许多自然语言理解任务中难以处理否定现象。在这项工作中,我们提出了一种自监督方法来增强语言模型对否定的鲁棒性。我们引入了一个新颖的任务——下一句子极性预测(NSPP)以及下一个句子预测(NSP)任务的一个变体。实验表明,在九个与否定相关的基准测试上,基于BERT和RoBERTa进一步预训练于我们的任务版本比现成版本表现更佳。尤为值得注意的是,在要求进行否定推理的大规模问答语料库CondaQA中,我们的预训练任务提高了1.8%到9.1%的表现。
https://arxiv.org/abs/2502.07717
We present FoQA, a Faroese extractive question-answering (QA) dataset with 2,000 samples, created using a semi-automated approach combining Large Language Models (LLMs) and human validation. The dataset was generated from Faroese Wikipedia articles using GPT-4-turbo for initial QA generation, followed by question rephrasing to increase complexity and native speaker validation to ensure quality. We provide baseline performance metrics for FoQA across multiple models, including LLMs and BERT, demonstrating its effectiveness in evaluating Faroese QA performance. The dataset is released in three versions: a validated set of 2,000 samples, a complete set of all 10,001 generated samples, and a set of 2,395 rejected samples for error analysis.
我们介绍了FoQA,这是一个包含2000个样本的法罗塞语提取式问答(QA)数据集,该数据集是通过结合大型语言模型(LLMs)和人工验证的半自动化方法创建的。数据集是从法罗塞维基百科文章中生成的,使用GPT-4-turbo进行初始的问答生成,并随后由人类对问题进行重新表述以增加难度,以及本地母语人士验证以确保质量。我们提供了FoQA在多个模型上的基准性能指标,包括大型语言模型和BERT,证明了它在评估法罗塞语问答表现方面的有效性。该数据集发布了三个版本:一个包含2000个样本的已验证集合、所有10,001个生成样本的完整集合以及用于错误分析的2395个被拒绝样本集合。
https://arxiv.org/abs/2502.07642
The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at this https URL.
声音模仿,特别是针对音色和说话风格等特定语音属性的模仿,在语音生成领域至关重要。然而,现有的方法严重依赖于标注数据,并且难以有效分离音色与风格,这在实现可控生成方面尤其具有挑战性,特别是在零样本(zero-shot)场景中。为了应对这些问题,我们提出了Vevo,一个灵活的零样本声音模仿框架,能够控制音色和风格。 Vevo主要通过两个核心阶段运作: 1. **内容-风格建模**:给定文本或语音的内容标记作为输入,我们使用自回归变压器生成在风格参考提示下引导的内容-风格标记。 2. **声学建模**:给定内容-风格标记作为输入,我们采用流匹配变压器产生音色参考引导下的声学表示。 为了获取语音的内容和内容-风格标记,我们设计了一种完全自我监督的方法,逐步解耦语音的音色、风格和语言内容。具体来说,我们将VQ-VAE用作HuBERT连续隐藏特征的标记器,并将VQ-VAE代码本词汇大小作为信息瓶颈,仔细调整以获得分离的声音表示。 仅通过60,000小时的有声书语音数据进行完全自我监督训练,而无需在风格特定语料库上微调,Vevo在口音和情感转换任务中与现有方法相匹配或超越。此外,在零样本声音转换和文本到语音任务中的有效性进一步证明了其强大的泛化能力和多功能性。 音频示例可在[此链接](https://this-url.com)访问(原信息请见原文)。
https://arxiv.org/abs/2502.07243
Transparent object manipulation remains a sig- nificant challenge in robotics due to the difficulty of acquiring accurate and dense depth measurements. Conventional depth sensors often fail with transparent objects, resulting in in- complete or erroneous depth data. Existing depth completion methods struggle with interframe consistency and incorrectly model transparent objects as Lambertian surfaces, leading to poor depth reconstruction. To address these challenges, we propose TranSplat, a surface embedding-guided 3D Gaussian Splatting method tailored for transparent objects. TranSplat uses a latent diffusion model to generate surface embeddings that provide consistent and continuous representations, making it robust to changes in viewpoint and lighting. By integrating these surface embeddings with input RGB images, TranSplat effectively captures the complexities of transparent surfaces, enhancing the splatting of 3D Gaussians and improving depth completion. Evaluations on synthetic and real-world transpar- ent object benchmarks, as well as robot grasping tasks, show that TranSplat achieves accurate and dense depth completion, demonstrating its effectiveness in practical applications. We open-source synthetic dataset and model: https://github. com/jeongyun0609/TranSplat
透明物体的操作在机器人技术中仍是一个重大挑战,原因在于难以获取准确且密集的深度测量数据。传统的深度传感器通常无法处理透明对象,导致采集到的深度信息不完整或有误。现有的深度完成方法在帧间一致性方面存在问题,并错误地将透明物体建模为朗伯表面,从而导致较差的深度重建效果。 为了应对这些挑战,我们提出了一种名为TranSplat的方法,这是一种针对透明物体设计的基于表面嵌入引导的3D高斯点阵生成技术。TranSplat利用潜在扩散模型来生成提供一致和连续表示的表面嵌入,使其能够适应视角和光照变化。通过将这些表面嵌入与输入的RGB图像相结合,TranSplat可以有效地捕捉透明表面的复杂性,并改进三维高斯点阵的生成,从而提高深度完成的质量。 在合成数据集及真实世界的透明物体基准测试上以及机器人抓取任务中的评估显示,TranSplat能够实现准确且密集的深度重建,在实际应用中表现出色。我们开源了合成数据集和模型:https://github.com/jeongyun0609/TranSplat
https://arxiv.org/abs/2502.07840
Embodied intelligence integrates multiple modalities, enabling agents to understand images, language, and actions simultaneously. However, existing models always depend on additional datasets or extensive pre-training to maximize performance improvements, consuming abundant training time and expensive hardware cost. To tackle this issue, we present RoboBERT, a novel end-to-end robotic manipulation model integrated with a unique training strategy. This model utilizes a CNN-based diffusion policy, enhancing and stabilizing the effectiveness of this model by separating training processes for different modalities. It also underscores the importance of data augmentation, verifying various techniques to significantly boost performance. Unlike models that depend on extra data or large foundation models, RoboBERT achieves a highly competitive success rate while using only language-labeled expert demonstrations and maintaining a relatively smaller model size. Specifically, RoboBERT achieves an average length of 4.52 on the CALVIN benchmark for \(ABCD \rightarrow D\) task, setting a new state-of-the-art (SOTA) record. Furthermore, when tested on a real robot, the model demonstrates superior performance, achieving a higher success rate than other methods trained with the same data. We propose that these concepts and methodologies of RoboBERT demonstrate extensive versatility and compatibility, contributing significantly to the development of lightweight multimodal robotic models. The code can be accessed on this https URL
身临其境的智能融合了多种模态,使代理能够同时理解图像、语言和动作。然而,现有的模型总是依赖于额外的数据集或广泛的预训练来最大化性能改进,这会消耗大量的训练时间和昂贵的硬件成本。为了解决这个问题,我们提出了RoboBERT,这是一种新颖的端到端机器人操作模型,结合了独特的训练策略。该模型采用基于CNN(卷积神经网络)的扩散政策,在不同模态之间分离训练过程,从而增强和稳定模型的有效性。它还强调数据增强的重要性,并验证了各种技术以显著提高性能。与依赖额外数据或大型基础模型的模型不同,RoboBERT在使用仅标记的语言专家演示的同时实现了极高的成功率,且保持较小的模型规模。 具体而言,在CALVIN基准测试中的\(ABCD \rightarrow D\)任务中,RoboBERT达到了4.52的平均长度,创造了新的最先进的(SOTA)记录。此外,当在实际机器人上进行测试时,该模型表现出卓越性能,成功率高于使用相同数据训练的其他方法。 我们提出,这些概念和方法展示了RoboBERT广泛的灵活性和兼容性,对轻量级多模态机器人的发展做出了重要贡献。该项目的代码可在此[链接]获取(请将括号内的文本替换为实际网址)。
https://arxiv.org/abs/2502.07837
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection, addressing critical challenges in the security domain. Traditional methods, such as static and dynamic analysis, often falter due to inefficiencies, high false positive rates, and the growing complexity of modern software systems. By leveraging their ability to analyze code structures, identify patterns, and generate repair sugges- tions, LLMs, exemplified by models like GPT, BERT, and CodeBERT, present a novel and scalable approach to mitigating vulnerabilities. This paper provides a detailed survey of LLMs in vulnerability detection. It examines key aspects, including model architectures, application methods, target languages, fine-tuning strategies, datasets, and evaluation metrics. We also analyze the scope of current research problems, highlighting the strengths and weaknesses of existing approaches. Further, we address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis. Based on these findings, we propose solutions for issues like dataset scalability, model interpretability, and applications in low-resource scenarios. Our contributions are threefold: (1) a systematic review of how LLMs are applied in vulnerability detection; (2) an analysis of shared patterns and differences across studies, with a unified framework for understanding the field; and (3) a summary of key challenges and future research directions. This work provides valuable insights for advancing LLM-based vulnerability detection. We also maintain and regularly update latest selected paper on this https URL
大型语言模型(LLMs)正逐渐成为软件漏洞检测领域的变革性工具,能够解决安全领域中的关键挑战。传统的静态和动态分析方法往往因为效率低下、误报率高以及现代软件系统的日益复杂而效果不佳。通过利用其分析代码结构、识别模式并生成修复建议的能力,以GPT、BERT和CodeBERT为代表的LLMs提供了一种新颖且可扩展的方式来缓解漏洞问题。 本文对用于漏洞检测的大型语言模型进行了详细的综述。文中探讨了包括模型架构、应用方法、目标编程语言、微调策略、数据集以及评估指标等关键方面,并分析当前研究面临的问题,突出现有方法的优势与不足。此外,还讨论了一些挑战,例如跨语言漏洞检测、多模态数据整合和仓库级分析问题。基于这些发现,我们提出了针对数据集规模扩展、模型解释性和低资源场景应用等问题的解决方案。 我们的贡献有三方面:(1)系统性地回顾了LLMs在漏洞检测中的应用场景;(2)通过统一框架对不同研究中共享的模式与差异进行了分析;(3)总结了关键挑战及未来的研究方向。这项工作为推进基于大型语言模型的漏洞检测提供了有价值的见解。 请注意,文中提到的网址链接可能需要您直接访问以获取最新精选论文列表和进一步信息。
https://arxiv.org/abs/2502.07049
This study explores strategies for efficiently classifying scientific full texts using both small, BERT-based models and local large language models like Llama-3.1 8B. We focus on developing methods for selecting subsets of input sentences to reduce input size while simultaneously enhancing classification performance. To this end, we compile a novel dataset consisting of full-text scientific papers from the field of invasion biology, specifically addressing the impacts of invasive species. These papers are aligned with publicly available impact assessments created by researchers for the International Union for Conservation of Nature (IUCN). Through extensive experimentation, we demonstrate that various sources like human evidence annotations, LLM-generated annotations or explainability scores can be used to train sentence selection models that improve the performance of both encoder- and decoder-based language models while optimizing efficiency through the reduction in input length, leading to improved results even if compared to models like ModernBERT that are able to handle the complete text as input. Additionally, we find that repeated sampling of shorter inputs proves to be a very effective strategy that, at a slightly increased cost, can further improve classification performance.
这项研究探讨了利用基于BERT的小型模型和本地大型语言模型(如Llama-3.1 8B)高效分类科学全文的方法。我们专注于开发选择输入句子子集的策略,以减少输入大小同时提升分类性能。为此,我们编译了一个新颖的数据集,其中包括来自入侵生物学领域的完整文本科学论文,特别关注外来物种的影响。这些论文与国际自然保护联盟(IUCN)的研究人员为公众提供的影响评估相一致。 通过广泛的实验,我们展示了人类证据注释、大型语言模型生成的注释或可解释性评分等各种来源可以用于训练句子选择模型,从而在减少输入长度的同时提高基于编码器和解码器的语言模型的性能。即使与能够处理完整文本输入的现代BERT等模型相比,这种方法也能带来更好的结果。此外,我们发现重复采样较短的输入是一种非常有效的策略,在略微增加成本的情况下可以进一步提升分类性能。
https://arxiv.org/abs/2502.06551
Unsupervised Continuous Anomaly Detection (UCAD) faces significant challenges in multi-task representation learning, with existing methods suffering from incomplete representation and catastrophic forgetting. Unlike supervised models, unsupervised scenarios lack prior information, making it difficult to effectively distinguish redundant and complementary multimodal features. To address this, we propose the Multimodal Task Representation Memory Bank (MTRMB) method through two key technical innovations: A Key-Prompt-Multimodal Knowledge (KPMK) mechanism that uses concise key prompts to guide cross-modal feature interaction between BERT and ViT. Refined Structure-based Contrastive Learning (RSCL) leveraging Grounding DINO and SAM to generate precise segmentation masks, pulling features of the same structural region closer while pushing different structural regions apart. Experiments on MVtec AD and VisA datasets demonstrate MTRMB's superiority, achieving an average detection accuracy of 0.921 at the lowest forgetting rate, significantly outperforming state-of-the-art methods. We plan to open source on GitHub.
无监督连续异常检测(UCAD)在多任务表示学习中面临重大挑战,现有方法存在表示不完整和灾难性遗忘的问题。与有监督模型不同,在无监督场景中缺乏先验信息,使得难以有效地区分冗余和互补的多模态特征。为解决这一问题,我们通过两个关键技术创新提出了多模态任务表示记忆库(MTRMB)方法:一是Key-Prompt-Multimodal Knowledge (KPMK)机制,该机制利用简洁的关键提示引导BERT与ViT之间的跨模态特征交互;二是基于精炼结构的对比学习(RSCL),通过Grounding DINO和SAM生成精确的分割掩码,在同一结构区域内的特征更接近的同时将不同结构区域的特征推开。在MVtec AD和VisA数据集上的实验表明,MTRMB方法具有优越性,在最低遗忘率下实现了平均检测准确率为0.921,显著优于当前最先进的方法。我们计划在GitHub上开源此项目。
https://arxiv.org/abs/2502.06194
Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score (66.1\%), XLM-R semi-supervised (67.2\% accuracy, 64.1\% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8\%) and F1 score (31\%), mBERT semi-supervised (accuracy (59\% and F1 score 26.5\%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. this https URL
社交媒体已成为个人表达观点和分享经验的重要开放平台。然而,由于低资源语言数据稀缺且质量低下,并且存在俚语和代码混用等重大语言使用差异,在Twitter上利用这些低资源语言的数据面临着挑战。鉴于Twitter主要支持高资源语言,识别这些语言的推文变得尤为困难。 我们分析了肯尼亚地区的代码混用数据,并评估了四种最先进的(SOTA)基于转换器的预训练模型在情感和情绪分类中的表现,采用了监督学习与半监督学习方法。文中详细描述了数据收集、标注的方法及在数据整理阶段遇到的各种挑战。 我们的结果显示XLM-R超越其他所有模型;对于情感分析,XLM-R监督模型实现了最高准确率(69.2%)和F1分数(66.1%),而半监督模型则为67.2%的准确率与64.1%的F1分数。在情绪分析中,DistilBERT监督学习方法领先于其他模型,在准确性(59.8%)和F1值(31%)方面表现突出,mBERT半监督学习模型的准确率为59%,而F1得分为26.5%。AfriBERTa模型则在准确性和F1得分上均最低。 所有模型都倾向于预测中立情感,并且Afri-BERT表现出最高的情感偏差和对共情情绪的独特敏感性。 相关研究详情请参阅:[此链接](https://example.com)(请注意,示例链接实际不可用,请替换为有效URL)。
https://arxiv.org/abs/2502.06180
Metonymy plays an important role in our daily communication. People naturally think about things using their most salient properties or commonly related concepts. For example, by saying "The bus decided to skip our stop today," we actually mean that the bus driver made the decision, not the bus. Prior work on metonymy resolution has mainly focused on named entities. However, metonymy involving common nouns (such as desk, baby, and school) is also a frequent and challenging phenomenon. We argue that NLP systems should be capable of identifying the metonymic use of common nouns in context. We create a new metonymy dataset ConMeC, which consists of 6,000 sentences, where each sentence is paired with a target common noun and annotated by humans to indicate whether that common noun is used metonymically or not in that context. We also introduce a chain-of-thought based prompting method for detecting metonymy using large language models (LLMs). We evaluate our LLM-based pipeline, as well as a supervised BERT model on our dataset and three other metonymy datasets. Our experimental results demonstrate that LLMs could achieve performance comparable to the supervised BERT model on well-defined metonymy categories, while still struggling with instances requiring nuanced semantic understanding. Our dataset is publicly available at: this https URL.
通义在我们日常交流中扮演着重要角色。人们自然会用事物最显著的特性或相关概念来思考问题。例如,当我们说“今天的公交车决定不在我家附近停车了”,实际上是指司机做出了这个决定,并非是公交车本身作出了决策。过去关于通义解析的研究主要集中在命名实体上。然而,涉及普通名词(如办公桌、婴儿和学校)的通义现象也非常频繁且具有挑战性。我们认为,自然语言处理系统应该能够识别出在特定语境中普通名词的通义使用方式。为此,我们创建了一个新的通义数据集ConMeC,该数据集包含6,000个句子,每个句子都与一个目标普通名词配对,并由人工标注以表明该普通名词是否在这个上下文中被用作通义。此外,我们还提出了一种基于链式思维的提示方法,用于利用大规模语言模型(LLMs)检测通义现象。我们在我们的数据集和另外三个通义数据集上评估了基于LLM的管道以及一个监督下的BERT模型的表现。实验结果表明,LLMs在定义明确的通义类别中能够达到与监督式BERT模型相当的性能水平,但仍然难以处理那些需要细微语义理解的情况。我们的数据集可以公开访问:[这个URL]。
https://arxiv.org/abs/2502.06087
Smart word substitution aims to enhance sentence quality by improving word choices; however current benchmarks rely on human-labeled data. Since word choices are inherently subjective, ground-truth word substitutions generated by a small group of annotators are often incomplete and likely not generalizable. To circumvent this issue, we instead employ a model-based score (BARTScore) to quantify sentence quality, thus forgoing the need for human annotations. Specifically, we use this score to define a distribution for each word substitution, allowing one to test whether a substitution is statistically superior relative to others. In addition, we propose a loss function that directly optimizes the alignment between model predictions and sentence scores, while also enhancing the overall quality score of a substitution. Crucially, model learning no longer requires human labels, thus avoiding the cost of annotation while maintaining the quality of the text modified with substitutions. Experimental results show that the proposed approach outperforms both masked language models (BERT, BART) and large language models (GPT-4, LLaMA). The source code is available at this https URL.
智能单词替换旨在通过改进词选择来提升句子质量;然而,当前的基准测试依赖于人工标注的数据。由于词语的选择本质上是主观的,由一小群注释者生成的“事实”上的单词替换往往不完整且不具备普适性。为了解决这一问题,我们转而采用基于模型的评分系统(BARTScore)来量化句子质量,从而避免了对人类注解的需求。具体来说,我们使用这个分数定义每个单词替换的概率分布,这样就可以测试一个替换单词是否在统计学上优于其他选择。 此外,我们还提出了一种损失函数,该函数直接优化模型预测与句子评分之间的对应关系,并同时提升了替换单词的整体质量得分。关键的是,模型学习不再需要人工标签,从而避免了标注的成本,同时还保持了使用替换单词修改后的文本质量。 实验结果显示,所提出的这种方法在性能上优于被遮蔽语言模型(如BERT、BART)和大型语言模型(如GPT-4、LLaMA)。相关源代码可以在提供的链接地址获取。
https://arxiv.org/abs/2502.05933
In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we evaluate multiple state-of-the-art models, including Hierarchical BiLSTM-CRF, TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an instruction-tuned large language model. Our results demonstrate that models incorporating broader context, structural relationships, and sequential sentence information outperform those relying solely on sentence-level features. Additionally, we conducted experiments using surrounding context and predicted or actual labels of neighboring sentences to assess their impact on classification accuracy. Despite these advancements, challenges persist in distinguishing between closely related roles and addressing class imbalance. Our work underscores the potential of advanced techniques for improving legal document understanding and sets a strong foundation for future research in legal NLP.
在这篇论文中,我们通过修辞角色分类的方法来解决法律文件语义分割的任务,并重点关注印度的法律判决。我们介绍了LegalSeg,这是迄今为止最大的针对这一任务的标注数据集,包含超过7,000份文档和140万条句子,标记了7种修辞角色。为了评估性能基准,我们评测了几种最先进的模型,包括分层BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、图神经网络(GNNs)以及角色感知变换器,并且还探索性地使用了一个指令调优的大语言模型RhetoricLLaMA。我们的结果表明,那些结合了更广泛的上下文信息、结构关系和句子顺序信息的模型优于仅依赖于句子级别特征的模型。此外,我们利用周围环境语境以及相邻句子预测或实际标签进行了实验,评估它们对分类准确度的影响。尽管取得了这些进展,但在区分密切相关角色和解决类别不平衡方面仍然存在挑战。我们的工作强调了高级技术在改善法律文档理解中的潜力,并为未来法律自然语言处理的研究奠定了坚实的基础。
https://arxiv.org/abs/2502.05836
Europe's healthcare systems require enhanced interoperability and digitalization, driving a demand for innovative solutions to process legacy clinical data. This paper presents the results of our project, which aims to leverage Large Language Models (LLMs) to extract structured information from unstructured clinical reports, focusing on patient history, diagnoses, treatments, and other predefined categories. We developed a workflow with a user interface and evaluated LLMs of varying sizes through prompting strategies and fine-tuning. Our results show that fine-tuned smaller models match or surpass larger counterparts in performance, offering efficiency for resource-limited settings. A new dataset of 60,000 annotated English clinical summaries and 24,000 German translations was validated with automated and manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. The work highlights the approach's viability and outlines future improvements.
欧洲的医疗系统需要增强互操作性和数字化,从而推动了对创新解决方案的需求,以处理传统的临床数据。本文介绍了我们项目的研究成果,该项目旨在利用大型语言模型(LLMs)从非结构化的临床报告中提取结构化信息,重点关注患者的病史、诊断、治疗及其他预定义类别。我们开发了一个包含用户界面的工作流程,并通过不同的提示策略和微调方法评估了不同规模的LLM。我们的研究结果表明,在某些情况下,经过微调的小型模型在性能上可以与大型模型相匹敌甚至超越,这为资源有限的环境提供了效率优势。 为了验证这一方法的有效性,我们创建了一个新的数据集,包括60,000份英语临床总结和24,000份德语翻译,并通过自动检查和手动审核进行了验证。评估使用了ROUGE、BERTScore以及实体层面的指标进行性能评价。这项研究强调了所提出方法的有效性,并概述了未来改进的方向。 总的来说,我们的项目展示了利用大型语言模型从非结构化临床数据中提取有价值信息的潜力,这为提高欧洲及其他地区的医疗保健互操作性和效率提供了新的途径。
https://arxiv.org/abs/2502.05638
Accurate greenhouse gas (GHG) emission reporting is critical for governments, businesses, and investors. However, adoption remains limited particularly among small and medium enterprises due to high implementation costs, fragmented emission factor databases, and a lack of robust sector classification methods. To address these challenges, we introduce Group Reasoning Emission Estimation Networks (GREEN), an AI-driven carbon accounting framework that standardizes enterprise-level emission estimation, constructs a large-scale benchmark dataset, and leverages a novel reasoning approach with large language models (LLMs). Specifically, we compile textual descriptions for 20,850 companies with validated North American Industry Classification System (NAICS) labels and align these with an economic model of carbon intensity factors. By reframing sector classification as an information retrieval task, we fine-tune Sentence-BERT models using a contrastive learning loss. To overcome the limitations of single-stage models in handling thousands of hierarchical categories, we propose a Group Reasoning method that ensembles LLM classifiers based on the natural NAICS ontology, decomposing the task into multiple sub-classification steps. We theoretically prove that this approach reduces classification uncertainty and computational complexity. Experiments on 1,114 NAICS categories yield state-of-the-art performance (83.68% Top-1, 91.47% Top-10 accuracy), and case studies on 20 companies report a mean absolute percentage error (MAPE) of 45.88%. The project is available at: this https URL.
准确的温室气体(GHG)排放报告对于政府、企业和投资者来说至关重要。然而,由于高昂的实施成本、碎片化的排放因子数据库以及缺乏强大的行业分类方法等原因,其应用在中小型企业中仍然有限。为了解决这些挑战,我们引入了基于人工智能驱动的碳会计框架——集团推理排放估算网络(GREEN)。该框架标准化企业级别的排放估算,构建大规模基准数据集,并采用一种新颖的大规模语言模型(LLM)推理方法。 具体而言,我们收集了20,850家具有验证后的北美行业分类系统(NAICS)标签的公司文本描述,并将其与碳强度因素的经济模型对齐。通过将行业分类重新定义为信息检索任务,我们使用对比学习损失微调Sentence-BERT模型。为了克服单一阶段模型在处理数千个层级类别时的局限性,我们提出了一种基于自然NAICS本体论的集团推理方法,该方法集合了大规模语言模型分类器,并将任务分解成多个子分类步骤。理论上证明,这种方法可以降低分类不确定性并减少计算复杂度。 实验显示,在1,114个NAICS类别上的测试中,我们的方法达到了最先进的性能(Top-1准确率为83.68%,Top-10准确率为91.47%)。针对20家公司的案例研究显示,平均绝对百分比误差(MAPE)为45.88%。该项目的详细信息可在此URL访问:this https URL。
https://arxiv.org/abs/2502.06874
Synthetic data generation is widely recognized as a way to enhance the quality of neural grammatical error correction (GEC) systems. However, current approaches often lack diversity or are too simplistic to generate the wide range of grammatical errors made by humans, especially for low-resource languages such as Arabic. In this paper, we will develop the error tagging model and the synthetic data generation model to create a large synthetic dataset in Arabic for grammatical error correction. In the error tagging model, the correct sentence is categorized into multiple error types by using the DeBERTav3 model. Arabic Error Type Annotation tool (ARETA) is used to guide multi-label classification tasks in an error tagging model in which each sentence is classified into 26 error tags. The synthetic data generation model is a back-translation-based model that generates incorrect sentences by appending error tags before the correct sentence that was generated from the error tagging model using the ARAT5 model. In the QALB-14 and QALB-15 Test sets, the error tagging model achieved 94.42% F1, which is state-of-the-art in identifying error tags in clean sentences. As a result of our syntactic data training in grammatical error correction, we achieved a new state-of-the-art result of F1-Score: 79.36% in the QALB-14 Test set. We generate 30,219,310 synthetic sentence pairs by using a synthetic data generation model.
合成数据生成被广泛认为是提升神经语法错误修正(GEC)系统质量的一种方法。然而,目前的方法在生成多样化或复杂的语法错误方面往往不足,尤其是在像阿拉伯语这样的低资源语言中。在这篇论文中,我们开发了一种错误标记模型和一种合成数据生成模型,以创建用于语法错误校正的大型阿拉伯语合成数据集。 在错误标记模型中,使用DeBERTav3模型将正确的句子分类为多种错误类型。通过阿拉伯语错误类型注释工具(ARETA)来引导多标签分类任务,在该错误标记模型中,每个句子被分为26个错误标签之一。合成数据生成模型是一种基于逆向翻译的模型,它通过在从错误标记模型产生的正确句子之前添加错误标签来生成带有语法错误的句子。此过程使用ARAT5模型实现。 在QALB-14和QALB-15测试集中,我们的错误标记模型实现了94.42%的F1值,在识别清洁句子中的错误标签方面达到了最新的技术水平。通过我们在语法错误修正中使用合成数据训练,我们取得了新的最佳结果,即在QALB-14测试集上的F1分数为79.36%。 利用这种生成模型,我们共产生了30,219,310对合成句子。
https://arxiv.org/abs/2502.05312
Machine learning is widely believed to be one of the most promising practical applications of quantum computing. Existing quantum machine learning schemes typically employ a quantum-classical hybrid approach that relies crucially on gradients of model parameters. Such an approach lacks provable convergence to global minima and will become infeasible as quantum learning models scale up. Here, we introduce quantum automated learning, where no variational parameter is involved and the training process is converted to quantum state preparation. In particular, we encode training data into unitary operations and iteratively evolve a random initial state under these unitaries and their inverses, with a target-oriented perturbation towards higher prediction accuracy sandwiched in between. Under reasonable assumptions, we rigorously prove that the evolution converges exponentially to the desired state corresponding to the global minimum of the loss function. We show that such a training process can be understood from the perspective of preparing quantum states by imaginary time evolution, where the data-encoded unitaries together with target-oriented perturbations would train the quantum learning model in an automated fashion. We further prove that the quantum automated learning paradigm features good generalization ability with the generalization error upper bounded by the ratio between a logarithmic function of the Hilbert space dimension and the number of training samples. In addition, we carry out extensive numerical simulations on real-life images and quantum data to demonstrate the effectiveness of our approach and validate the assumptions. Our results establish an unconventional quantum learning strategy that is gradient-free with provable and explainable trainability, which would be crucial for large-scale practical applications of quantum computing in machine learning scenarios.
机器学习被认为可能是量子计算最有前景的实际应用之一。现有的量子机器学习方案通常采用依赖于模型参数梯度的量子-经典混合方法,这种方法缺乏全局最小值收敛性的证明,并且随着量子学习模型规模的增长将变得不可行。在这里,我们引入了量子自动化学习的概念,在这种情况下没有可变参数,训练过程被转化为量子状态制备的过程。具体来说,我们将训练数据编码到酉操作中,并在这些酉操作及其逆向之间迭代演化一个随机初始状态,中间插入目标导向的扰动以提高预测精度。在合理的假设下,我们严格证明了这种演化的指数收敛性,该演化会达到对应于损失函数全局最小值的状态。 我们表明,这样的训练过程可以从量子态由虚时间演化制备的角度来理解,在这种方法中,数据编码酉操作与目标导向的扰动将自动训练量子学习模型。此外,我们还证明了量子自动化学习范式具有良好的泛化能力,其泛化误差上限为希尔伯特空间维度的对数函数与训练样本数量之比。 为了展示我们的方法的有效性并验证假设,我们在真实生活图像和量子数据上进行了广泛的数值模拟实验。本研究确立了一种无梯度、可证明且易解释的新型量子学习策略,这对于机器学习场景中的大规模实际应用至关重要。
https://arxiv.org/abs/2502.05264
Recent advancements in large language models have demonstrated significant potential in the automated construction of knowledge graphs from unstructured text. This paper builds upon our previous work [16], which evaluated various models using metrics like precision, recall, F1 score, triple matching, and graph matching, and introduces a refined approach to address the critical issues of hallucination and omission. We propose an enhanced evaluation framework incorporating BERTScore for graph similarity, setting a practical threshold of 95% for graph matching. Our experiments focus on the Mistral model, comparing its original and fine-tuned versions in zero-shot and few-shot settings. We further extend our experiments using examples from the KELM-sub training dataset, illustrating that the fine-tuned model significantly improves knowledge graph construction accuracy while reducing the exact hallucination and omission. However, our findings also reveal that the fine-tuned models perform worse in generalization tasks on the KELM-sub dataset. This study underscores the importance of comprehensive evaluation metrics in advancing the state-of-the-art in knowledge graph construction from textual data.
近期在大型语言模型领域的进展展示了从非结构化文本自动构建知识图谱的巨大潜力。本文基于我们之前的工作[16],该工作评估了各种模型的性能指标,包括准确率、召回率、F1分数、三元组匹配和图匹配,并提出了一种改进的方法来解决幻觉(hallucination)和遗漏(omission)的关键问题。我们提出了一个增强的评价框架,其中包括使用BERTScore进行图相似性评估,并为图匹配设定了95%的实际阈值。我们的实验主要集中在Mistral模型上,比较其原始版本与微调后的版本在零样本(zero-shot)和少量样本(few-shot)设置下的表现。我们进一步扩展了实验,使用KELM-sub训练数据集中的示例来说明:微调后模型显著提高了知识图谱构建的准确性,并减少了精确幻觉和遗漏。然而,我们的研究结果还显示,在KELM-sub数据集中的一般化任务中,微调后的模型表现较差。 这项研究表明,全面的评价指标对于从文本数据中推进知识图谱构建领域的最新技术具有重要意义。
https://arxiv.org/abs/2502.05239