In this paper we investigate the use of decoder-based generative transformers for extracting sentiment towards the named entities in Russian news articles. We study sentiment analysis capabilities of instruction-tuned large language models (LLMs). We consider the dataset of RuSentNE-2023 in our study. The first group of experiments was aimed at the evaluation of zero-shot capabilities of LLMs with closed and open transparencies. The second covers the fine-tuning of Flan-T5 using the "chain-of-thought" (CoT) three-hop reasoning framework (THoR). We found that the results of the zero-shot approaches are similar to the results achieved by baseline fine-tuned encoder-based transformers (BERT-base). Reasoning capabilities of the fine-tuned Flan-T5 models with THoR achieve at least 5% increment with the base-size model compared to the results of the zero-shot experiment. The best results of sentiment analysis on RuSentNE-2023 were achieved by fine-tuned Flan-T5-xl, which surpassed the results of previous state-of-the-art transformer-based classifiers. Our CoT application framework is publicly available: this https URL
在本文中,我们研究了使用基于解码器的生成转换器提取针对俄罗斯新闻文章中命名实体的情感。我们研究了指令微调的大型语言模型的情感分析能力。在我们的研究中,我们考虑了RuSentNE-2023数据集。第一组实验旨在评估LLMs的零样本性能。第二组实验涉及使用“思考链”(CoT)三步推理框架(THoR)对Flan-T5进行微调。我们发现,零样本方法的结果与基线微调的编码器基转换器类似。使用THoR对微调的Flan-T5模型的推理能力至少与基线大小模型相比增加了5%。在RuSentNE-2023上的情感分析最佳结果是由微调的Flan-T5-xl取得的,这超过了以往基于转换器的分类器的最佳结果。我们的CoT应用框架是公开可用的:这是https://this URL。
https://arxiv.org/abs/2404.12342
This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
本研究介绍了一种新颖的 Irony 检测方法,该方法采用基于提示的学习方法(LLMs)来促进情感中心化文本增强。传统的 Irony 检测技术通常因为其依赖静态语言特征和预定义知识库而不足,往往忽视了 Irony 中至关重要的细微情感维度。相比之下,我们的方法通过将微妙的情感线索通过 LLMs 增强,将三种广泛认为是 Irony 检测基础的预训练 NLP 模型 - BERT、T5 和 GPT-2 - 集成到检测过程中,从而增强了 Irony 检测能力。我们对该方法使用 SemEval-2018 任务 3 数据集进行了评估,并观察到 Irony 检测能力得到了显著提升。
https://arxiv.org/abs/2404.12291
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
数据分析师一直试图将无结构文本数据转化为有意义的概念。尽管常见,主题建模和聚类关注较低级别的关键词,需要进行大量解释性工作。我们引入了概念归纳,一种计算过程,它从无结构文本中产生高层次的概念,定义了明确的包括标准。对于一个包含有毒在线评论的 dataset,其中最先进的 BERTopic 模型输出“女性、权力、女性”,概念归纳产生了类似于“对传统性别角色批评”和“对女性关注的不屑”的高层次概念。我们介绍了 LLooM,一种利用大型语言模型迭代生成抽样文本并提出具有普遍性的人解释性概念的概念。然后将 LLooM 实例化到一个混合文本分析工具中,使分析员可以将注意力从解释主题转向进行理论驱动的分析。通过技术评估和四个分析场景(文献综述到内容审查),我们发现,LLooM 的概念在主题模型的先前艺术品质和数据覆盖方面有所提高。在专家案例研究中,LLooM 甚至帮助研究人员从熟悉的數據中发现新的见解,例如通过建议政治社交媒體數據中 previously unnoticed 的攻击姿态的概念。
https://arxiv.org/abs/2404.12259
Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.
姿态检测是自然语言处理中的一个关键任务,它通过文本分析来确定作者的观点。这项研究评估了姿态检测方法的演变,从早期的机器学习方法到突破性的BERT模型,最终到现代的大型语言模型(LLMs),如ChatGPT、LLLM-2和Mistral-7B。尽管ChatGPT的闭源性和相关成本带来了挑战,但像LLMa-2和Mistral-7B这样的开源模型仍然具有鼓舞人心的 alternative。最初,我们的研究专注于通过几个公开可用的数据集对ChatGPT、LLMa-2和Mistral-7B进行微调。随后,为了提供全面的比较,我们评估了这些模型在零散和少散学习场景下的性能。结果强调了LLMs在准确检测立场方面的非凡能力,所有测试模型都超过了现有基准。值得注意的是,LLMa-2和Mistral-7B展示了令人印象深刻的效率和立场检测潜力,尽管它们相对于ChatGPT来说较小。这项研究强调了LLMs在立场检测方面的潜力,并呼吁在這個領域进行更廣泛的研究。
https://arxiv.org/abs/2404.12171
Lexicon-based retrieval has gained siginificant popularity in text retrieval due to its efficient and robust performance. To further enhance performance of lexicon-based retrieval, researchers have been diligently incorporating state-of-the-art methodologies like Neural retrieval and text-level contrastive learning approaches. Nonetheless, despite the promising outcomes, current lexicon-based retrieval methods have received limited attention in exploring the potential benefits of feature context representations and term-level knowledge guidance. In this paper, we introduce an innovative method by introducing FEature Context and TErm-level Knowledge modules(FecTek). To effectively enrich the feature context representations of term weight, the Feature Context Module (FCM) is introduced, which leverages the power of BERT's representation to determine dynamic weights for each element in the embedding. Additionally, we develop a term-level knowledge guidance module (TKGM) for effectively utilizing term-level knowledge to intelligently guide the modeling process of term weight. Evaluation of the proposed method on MS Marco benchmark demonstrates its superiority over the previous state-of-the-art approaches.
基于词汇的检索在文本检索中取得了显著的流行, due其高效且鲁棒的性能。为了进一步提高基于词汇的检索的性能,研究人员一直在努力将最先进的方法如神经检索和文本级对比学习方法融入其中。然而,尽管取得了 promising 的结果,现有的基于词汇的检索方法在探索特征上下文表示和词级知识指导的潜在好处方面也受到了限制。在本文中,我们提出了一种创新的方法,即引入了FEature Context和TErm-level Knowledge模块(FecTek)。为了有效地丰富词级权重特征上下文的表示,引入了Feature Context模块(FCM),它利用了BERT表示的力量来确定每个元素嵌入的动态权重。此外,我们还开发了一个词级知识指导模块(TKGM),用于有效地利用词级知识指导模型的训练过程。在MS Marco基准上对所提出的方法进行评估,证明了其优越性超过以前的最先进方法。
https://arxiv.org/abs/2404.12152
Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at this https URL.
机器阅读理解(MRC)在塑造医疗问答系统(QAS)和访问和使用医疗信息的地形方面具有关键作用。然而,医疗领域的固有挑战,如复杂的术语和问题不明确,需要创新解决方案。一个关键解决方案涉及将专业医学数据集整合并创建专用数据集。这种策略提高了QAS的准确性,促进了临床决策和医学研究的进步。为解决医学用语的复杂性,专用的医学数据集被整合了,例如,由emrQA生成的新颖的跨度提取数据集,但重新结构为163,695个问题和发展4,136个手动获得的答案,这个新数据集被称为emrQA-msquad数据集。此外,为解决不明确的 questions,还引入了一个专门用于跨度提取任务的医学数据集,增强了系统的稳健性。对BERT、RoBERTa和Tiny RoBERTa等模型的微调,在医学上下文中的响应准确性从10.1%到37.4%,18.7%到44.7%和16.0%到46.8%分别改进。最后,emrQA-msquad数据集可以在此链接 https://url.cn/Zhangqi_emrQA_msquad_dataset 公开使用。
https://arxiv.org/abs/2404.12050
Address matching is an important task for many businesses especially delivery and take out companies which help them to take out a certain address from their data warehouse. Existing solution uses similarity of strings, and edit distance algorithms to find out the similar addresses from the address database, but these algorithms could not work effectively with redundant, unstructured, or incomplete address data. This paper discuss semantic Address matching technique, by which we can find out a particular address from a list of possible addresses. We have also reviewed existing practices and their shortcoming. Semantic address matching is an essentially NLP task in the field of deep learning. Through this technique We have the ability to triumph the drawbacks of existing methods like redundant or abbreviated data problems. The solution uses the OCR on invoices to extract the address and create the data pool of addresses. Then this data is fed to the algorithm BM-25 for scoring the best matching entries. Then to observe the best result, this will pass through BERT for giving the best possible result from the similar queries. Our investigation exhibits that our methodology enormously improves both accuracy and review of cutting-edge technology existing techniques.
地址匹配对于许多企业来说特别是送餐和外卖公司,帮助他们从数据仓库中提取特定地址。现有解决方案使用字符串的相似性和编辑距离算法来查找地址数据库中的类似地址,但这些算法对于冗余、无结构或未完整地址数据无法有效工作。本文讨论了语义地址匹配技术,通过它可以从可能的地址列表中找到特定地址。我们还回顾了现有实践及其不足之处。语义地址匹配是深度学习领域中一个基本的语言处理任务。通过这种技术,我们能够克服现有方法中冗余或缩写数据问题的缺点。解决方案使用发票上的OCR提取地址并创建地址数据池。然后将该数据输入到算法BM-25中进行评分,以观察最佳结果。为了观察最佳结果,这还将通过BERT进行处理,从而从类似查询中获得最佳结果。我们的研究结果表明,我们的方法大大提高了现有技术的准确性和尖端技术的审查。
https://arxiv.org/abs/2404.11691
Inquisitive questions -- open-ended, curiosity-driven questions people ask as they read -- are an integral part of discourse processing (Kehler and Rohde, 2017; Onea, 2016) and comprehension (Prince, 2004). Recent work in NLP has taken advantage of question generation capabilities of LLMs to enhance a wide range of applications. But the space of inquisitive questions is vast: many questions can be evoked from a given context. So which of those should be prioritized to find answers? Linguistic theories, unfortunately, have not yet provided an answer to this question. This paper presents QSALIENCE, a salience predictor of inquisitive questions. QSALIENCE is instruction-tuned over our dataset of linguist-annotated salience scores of 1,766 (context, question) pairs. A question scores high on salience if answering it would greatly enhance the understanding of the text (Van Rooy, 2003). We show that highly salient questions are empirically more likely to be answered in the same article, bridging potential questions (Onea, 2016) with Questions Under Discussion (Roberts, 2012). We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
好奇的问题 -- 开放性的、以好奇心为导向的问题,人们在阅读中提出的问题 -- 是语义处理(Kehler和Rohde,2017;Onea,2016)和理解(Prince,2004)的重要组成部分。近年来,自然语言处理(NLP)工作充分利用了大型语言模型的问句生成能力,增强了广泛的应用。但是,好奇的问题的空间是广阔的:可以从给定的上下文中引发许多问题。那么,应该优先考虑哪些问题来寻找答案呢?不幸的是,语言理论尚未回答这个问题。本文介绍了 QSALIENCE,一个好奇问题预测器。QSALIENCE 是通过我们数据集中的1766个(上下文,问题)对进行语言学家标注的语义分数进行指令调整的。问题得分高,如果回答它会大大增强对文本的理解(Van Rooy,2003)。我们证明了,高度耸人听闻的问题在实证上更有可能在相同的文章中被回答,将潜在问题(Onea,2016)与正在讨论的问题(Roberts,2012)联系起来。我们进一步验证了我们的研究结果,通过展示回答耸人听闻的问题是新闻摘要质量的指标,来进一步验证我们的发现。
https://arxiv.org/abs/2404.10917
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.
扩散模型在文本到图像生成方面的表现引人注目。然而,在图像到文本生成方面,特别是图像标题生成,它们的性能已经落后于自回归模型。在这项工作中,我们重新审视了扩散模型,突出了它们整体上下文建模和并行解码的能力。借助这些优点,扩散模型可以减轻AR方法固有的限制,包括其缓慢的推理速度、错误传播和单向约束。此外,我们指出了扩散模型由于缺乏有效的图像文本对齐的潜在空间而表现出的先前的低性能,以及连续扩散过程和离散文本数据之间的差异。为了应对这些问题,我们引入了一种名为LaDiC的新架构,它利用分裂的BERT创建了专用的潜在空间,并包括一个正则化模块来管理不同的文本长度。我们的框架还包括一个扩散器用于语义图像到文本转换和 Back&Refine 技术,用于在推理过程中增强标记交互。LaDiC 在基于扩散的方法在 MS COCO 数据集上实现了最先进的性能,达到38.2 BLEU@4 和126.2 CIDEr,这表明 LaDiC 在没有预训练或辅助模块的情况下具有出色的性能。这揭示了扩散模型在图像到文本生成方面的潜力,这是 AR 模型所无法匹敌的。
https://arxiv.org/abs/2404.10763
The field of natural language processing (NLP) has made significant progress with the rapid development of deep learning technologies. One of the research directions in text sentiment analysis is sentiment analysis of medical texts, which holds great potential for application in clinical diagnosis. However, the medical field currently lacks sufficient text datasets, and the effectiveness of sentiment analysis is greatly impacted by different model design approaches, which presents challenges. Therefore, this paper focuses on the medical domain, using bidirectional encoder representations from transformers (BERT) as the basic pre-trained model and experimenting with modules such as convolutional neural network (CNN), fully connected network (FCN), and graph convolutional networks (GCN) at the output layer. Experiments and analyses were conducted on the METS-CoV dataset to explore the training performance after integrating different deep learning networks. The results indicate that CNN models outperform other networks when trained on smaller medical text datasets in combination with pre-trained models like BERT. This study highlights the significance of model selection in achieving effective sentiment analysis in the medical domain and provides a reference for future research to develop more efficient model architectures.
自然语言处理(NLP)领域在深度学习技术的快速发展中取得了显著的进步。文本情感分析的一个研究方向是医疗文本的情感分析,这在临床诊断中有很大的应用潜力。然而,目前医疗领域缺乏足够的文本数据,情感分析的效果受到不同模型设计方法的影响,这带来了挑战。因此,本文重点关注医疗领域,使用来自Transformer(BERT)双向编码器表示作为基本预训练模型,并尝试在输出层使用卷积神经网络(CNN)、全连接网络(FCN)和图卷积网络(GCN)等模块。在METS-CoV数据集上进行了实验和分析,以探索在整合不同深度学习网络后进行训练的性能。实验结果表明,当用较小的医疗文本数据集与预训练模型如BERT相结合训练时,CNN模型在其他网络中表现优异。本研究突出了在医疗领域实现有效情感分析的重要性,并为未来研究提供了参考,以开发更有效的模型架构。
https://arxiv.org/abs/2404.10503
A contract is a type of legal document commonly used in organizations. Contract review is an integral and repetitive process to avoid business risk and liability. Contract analysis requires the identification and classification of key provisions and paragraphs within an agreement. Identification and validation of contract clauses can be a time-consuming and challenging task demanding the services of trained and expensive lawyers, paralegals or other legal assistants. Classification of legal provisions in contracts using artificial intelligence and natural language processing is complex due to the requirement of domain-specialized legal language for model training and the scarcity of sufficient labeled data in the legal domain. Using general-purpose models is not effective in this context due to the use of specialized legal vocabulary in contracts which may not be recognized by a general model. To address this problem, we propose the use of a pre-trained large language model which is subsequently calibrated on legal taxonomy. We propose LegalPro-BERT, a BERT transformer architecture model that we fine- tune to efficiently handle classification task for legal provisions. We conducted experiments to measure and compare metrics with current benchmark results. We found that LegalPro-BERT outperforms the previous benchmark used for comparison in this research.
合同是一种在组织中常用的法律文件类型。合同审查是避免业务风险和责任的重要和重复的过程。合同分析需要识别和分类协议中的关键条款和段落。识别和验证合同条款可能是一个耗时且具有挑战性的任务,需要训练有素且成本高昂的律师、法律助理或其他法律专业人士的帮助。使用人工智能和自然语言处理对合同法律条款进行分类由于对训练和标注数据的需求以及法律领域中可获得足够标注数据的稀缺性而变得复杂。使用通用模型在这种情况下并不有效,因为合同中使用的专业化法律词汇可能不会被通用模型所识别。为解决这个问题,我们提出了一个预训练的大型语言模型,然后在法律分类上进行微调。我们提出了LegalPro-BERT,是一个我们对其进行微调以高效处理法律条款分类任务的BERT变换器架构模型。我们进行了实验来衡量和比较与现有基准结果相关的指标。我们发现,LegalPro-BERT超越了本次研究中的比较基准。
https://arxiv.org/abs/2404.10097
In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance.
在过去的几年里,对视觉与语言导航(VLN)的研究兴趣显著增长。VLN 是一个具有挑战性的任务,要求智能体遵循人类指令并在未知环境中进行导航,以达到预定的目标。文献中关注的是通过利用合成训练数据来增强指令可用数据集的不同方式,以提高导航代理的性能。在这项工作中,我们提出了 AIGeN,一种以生成对抗网络(GANs)为灵感的全新架构,旨在生成有意义且形式良好的合成指令,提高导航代理的性能。该模型由Transformer解码器(GPT-2)和Transformer编码器(BERT)组成。在训练阶段,解码器为一系列图像描述代理路径到特定点的句子生成,而编码器区分真实和虚假指令。实验证明,我们生成的指令的质量,并进行了广泛的消融研究。此外,我们在Habitat-Matterport 3D数据集(HM3D)上使用AIGeN生成了217K条轨迹的合成指令,并展示了离线VLN方法性能的提高。我们对我们的建议的验证分析是在REVERIE和R2R上进行的,突出了我们建议的有前景的方面,实现了最先进的性能。
https://arxiv.org/abs/2404.10054
Recent advances in natural language processing (NLP) may enable artificial intelligence (AI) models to generate writing that is identical to human written form in the future. This might have profound ethical, legal, and social repercussions. This study aims to address this problem by offering an accurate AI detector model that can differentiate between electronically produced text and human-written text. Our approach includes machine learning methods such as XGB Classifier, SVM, BERT architecture deep learning models. Furthermore, our results show that the BERT performs better than previous models in identifying information generated by AI from information provided by humans. Provide a comprehensive analysis of the current state of AI-generated text identification in our assessment of pertinent studies. Our testing yielded positive findings, showing that our strategy is successful, with the BERT emerging as the most probable answer. We analyze the research's societal implications, highlighting the possible advantages for various industries while addressing sustainability issues pertaining to morality and the environment. The XGB classifier and SVM give 0.84 and 0.81 accuracy in this article, respectively. The greatest accuracy in this research is provided by the BERT model, which provides 0.93% accuracy.
近年来自然语言处理(NLP)的进步可能使人工智能(AI)模型在未来的某个时候能够生成与人类书面形式相同的写作。这可能对伦理、法律和社会产生深远的影响。本研究旨在通过提供一个准确的AI检测器模型来解决这个问题,该模型可以区分电子文本和人类撰写的文本。我们的方法包括机器学习方法,如XGB分类器、SVM和BERT架构的深度学习模型。此外,我们的结果表明,BERT在识别人工智能从人类提供的信息生成的信息方面比以前的表现更好。在我们对相关研究的评估中,提供了关于AI生成的文本识别的当前状态的全面分析。我们的测试得到了积极的结果,表明我们的策略是成功的,BERT成为了最有可能的答案。我们分析了研究的社会影响,强调了各种行业在道德和环境方面的潜在优势,并解决相关可持续性问题。XGB分类器和SVM分别为0.84和0.81的准确度。本研究中的最高准确度是由BERT模型提供的0.93%的准确度。
https://arxiv.org/abs/2404.10032
StackOverflow, with its vast question repository and limited labeled examples, raise an annotation challenge for us. We address this gap by proposing RoBERTa+MAML, a few-shot named entity recognition (NER) method leveraging meta-learning. Our approach, evaluated on the StackOverflow NER corpus (27 entity types), achieves a 5% F1 score improvement over the baseline. We improved the results further domain-specific phrase processing enhance results.
StackOverflow作为一个庞大的问题库,其有限的带标签示例,对我们提出了一个注释挑战。为了应对这个空白,我们提出了RoBERTa+MAML,一种利用元学习技术的几 shot 命名实体识别(NER)方法。我们的方法在StackOverflow NER数据集(27个实体类型)上评估,与基线相比,实现了5%的F1分数提高。我们还在领域特定的短语处理和增强结果方面进一步提高了结果。
https://arxiv.org/abs/2404.09405
Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.
安全和可靠的自然语言推理对于从临床试验报告中提取洞见至关重要,但大型预训练语言模型的偏见导致其具有挑战性。本文提出了一种新的数据增强技术,以提高生物医学自然语言推理在临床试验中的模型稳健性。通过通过语义扰动和领域特定词汇替换生成合成示例,并添加一个新的任务为数值和数量推理,我们引入了更大的多样性和减少了短路学习。我们与多任务学习和DeBERTa架构相结合的方法在NLI4CT 2024基准测试中的性能优于原始语言模型。消融研究证实了每种增强方法在提高稳健性方面的贡献。我们表现最佳模型的准确性和一致性在32个参与者中分别排名第12和第8。
https://arxiv.org/abs/2404.09206
This paper introduces novel methodologies for the Natural Language Inference for Clinical Trials (NLI4CT) task. We present TLDR (T5-generated clinical-Language summaries for DeBERTa Report Analysis) which incorporates T5-model generated premise summaries for improved entailment and contradiction analysis in clinical NLI tasks. This approach overcomes the challenges posed by small context windows and lengthy premises, leading to a substantial improvement in Macro F1 scores: a 0.184 increase over truncated premises. Our comprehensive experimental evaluation, including detailed error analysis and ablations, confirms the superiority of TLDR in achieving consistency and faithfulness in predictions against semantically altered inputs.
本文介绍了自然语言推理在临床试验(NLI4CT)任务中的新方法。我们提出了TLDR(T5生成的临床语言摘要)方法,该方法结合了T5模型生成的前提摘要,以提高临床NLI任务的准确性和矛盾分析。这种方法克服了小上下文窗口和长前提所带来的挑战,使得宏观F1得分提高了0.184:截短前提下的提高。我们的全面实验评估,包括详细的错误分析和消缺,证实了TLDR在实现对抗语义变换输入的预测一致性和可靠性方面具有优越性。
https://arxiv.org/abs/2404.09136
Timely identification is essential for the efficient handling of mental health illnesses such as depression. However, the current research fails to adequately address the prediction of mental health conditions from social media data in low-resource African languages like Swahili. This study introduces two distinct approaches utilising model-agnostic meta-learning and leveraging large language models (LLMs) to address this gap. Experiments are conducted on three datasets translated to low-resource language and applied to four mental health tasks, which include stress, depression, depression severity and suicidal ideation prediction. we first apply a meta-learning model with self-supervision, which results in improved model initialisation for rapid adaptation and cross-lingual transfer. The results show that our meta-trained model performs significantly better than standard fine-tuning methods, outperforming the baseline fine-tuning in macro F1 score with 18\% and 0.8\% over XLM-R and mBERT. In parallel, we use LLMs' in-context learning capabilities to assess their performance accuracy across the Swahili mental health prediction tasks by analysing different cross-lingual prompting approaches. Our analysis showed that Swahili prompts performed better than cross-lingual prompts but less than English prompts. Our findings show that in-context learning can be achieved through cross-lingual transfer through carefully crafted prompt templates with examples and instructions.
及时的疾病诊断对于处理诸如抑郁症等心理健康疾病至关重要。然而,当前的研究未能充分关注低资源非洲语言(如斯瓦希里语)中从社交媒体数据预测精神健康状况。本研究介绍了一种利用模型无关元学习以及大型语言模型(LLMs)来解决这一问题的独特方法。实验在三种翻译至低资源语言的数据集上进行,并应用于四种精神健康任务,包括压力、抑郁、抑郁严重程度和自杀观念预测。我们首先应用了一种具有自监督的元学习模型,结果是提高了模型初始化以实现快速适应和跨语言转移。结果显示,我们的元训练模型在标准微调方法中表现得比基线微调好得多,在XLM-R和mBERT上的宏观F1得分分别比18\%和0.8\%更高。 同时,我们利用LLMs的上下文学习能力对斯瓦希里心理健康预测任务的不同跨语言提示方法进行分析。我们的分析显示,斯瓦希里提示表现更好,但不如英语提示。我们的研究结果表明,通过精心制作具有示例和说明的跨语言提示模板,可以实现上下文学习。
https://arxiv.org/abs/2404.09045
This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model's ability to generalize from its training data. For more information, visit our GitHub repository: this https URL
本研究介绍了一种名为BERT-LSH的新模型,该模型将局部敏感哈希(LSH)集成到BERT架构中,以近似BERT模型的注意机制。我们比较了该模型与标准BERT模型的计算效率和性能。我们的研究结果表明,与标准BERT模型相比,BERT-LSH模型显著减少了自注意力层的计算需求,同时在预训练和微调任务中出人意料地超过了基线模型。这些结果表明,基于LSH的注意力机制不仅具有计算优势,而且可能增强模型从其训练数据中泛化的能力。更多相关信息,请访问我们的GitHub仓库:此链接。
https://arxiv.org/abs/2404.08836
In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTić model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
在本文中,我们重点使用自适应Translate-Align-Retrieve方法生成一个合成问题回答(QA)数据集。通过这种方法,我们创建了超过87K个样本的塞尔维亚QA数据集,我们称之为SQuAD-sr。为了承认塞尔维亚的脚本二元性,我们生成了塞尔维亚和拉丁版本的數據集。我们研究了數據集的質量,并使用它来微調多個预训练QA模型的精度。最佳结果是在我们的拉丁SQuAD-sr數據集上微調BERTić模型,实现了73.91%的准确匹配和82.97%的分数,我们在基准XQuAD數據集上的表现。結果表明,我们的模型超过了零散的基線,但沒有超越人類的表現。我們注意到了使用單語预訓練模型的優勢,以及使用拉丁文比使用 cyrillic 文本来實現的性能增加。通過進行進一步分析,我們發現,數值或日期等數值問題比其他類型的問題更可能得到正確回答。最後,我們得出結論,SQuAD-sr對於在缺乏手動製作和標註的數據集上微調塞尔维亚QA模型是足夠的質量。
https://arxiv.org/abs/2404.08617
Text classification systems have continuously improved in performance over the years. However, nearly all current SOTA classifiers have a similar shortcoming, they process text in a horizontal manner. Vertically written words will not be recognized by a classifier. In contrast, humans are easily able to recognize and read words written both horizontally and vertically. Hence, a human adversary could write problematic words vertically and the meaning would still be preserved to other humans. We simulate such an attack, VertAttack. VertAttack identifies which words a classifier is reliant on and then rewrites those words vertically. We find that VertAttack is able to greatly drop the accuracy of 4 different transformer models on 5 datasets. For example, on the SST2 dataset, VertAttack is able to drop RoBERTa's accuracy from 94 to 13%. Furthermore, since VertAttack does not replace the word, meaning is easily preserved. We verify this via a human study and find that crowdworkers are able to correctly label 77% perturbed texts perturbed, compared to 81% of the original texts. We believe VertAttack offers a look into how humans might circumvent classifiers in the future and thus inspire a look into more robust algorithms.
文本分类系统在过去几年中一直不断提高性能。然而,几乎所有当前的最优分类器都有类似的缺陷,它们以水平方式处理文本。水平书写的单词不会被分类器识别。相比之下,人类能够轻松地识别和阅读水平书写和垂直书写的单词。因此,一个的人类攻击者可以垂直书写有问题的单词,其他人类仍能够理解其含义。我们模拟了这种攻击,名为VertAttack。VertAttack会识别分类器所依赖的单词,然后将它们垂直地重写。我们发现,VertAttack能够在5个数据集上大大降低4种不同Transformer模型的准确性。例如,在SST2数据集上,VertAttack将RoBERTa的准确性从94%降低到13%。此外,由于VertAttack没有替换单词的含义,因此很容易保留。我们通过人类研究证实了这一点,并发现,在原文本上,工人能够正确地标记出77%的扰动文本,而原始文本上的81%则无法正确标记。我们相信,VertAttack提供了一个窗口,让人们思考未来人类可能会如何绕过分类器,从而激发了对更健壮算法的思考。
https://arxiv.org/abs/2404.08538