In recent years, Large Language Models (LLMs) have become integrated into our daily lives, serving as invaluable assistants in completing tasks. Widely embraced by users, the abuse of LLMs is inevitable, particularly in using them to generate text content for various purposes, leading to difficulties in distinguishing between text generated by LLMs and that written by humans. In this study, we present a dataset named ViDetect, comprising 6.800 samples of Vietnamese essay, with 3.400 samples authored by humans and the remainder generated by LLMs, serving the purpose of detecting text generated by AI. We conducted evaluations using state-of-the-art methods, including ViT5, BartPho, PhoBERT, mDeberta V3, and mBERT. These results contribute not only to the growing body of research on detecting text generated by AI but also demonstrate the adaptability and effectiveness of different methods in the Vietnamese language context. This research lays the foundation for future advancements in AI-generated text detection and provides valuable insights for researchers in the field of natural language processing.
近年来,大型语言模型(LLMs)已使我们日常生活变得更加便捷,成为完成任务的宝贵助手。广泛接受用户的使用,LLM的滥用是不可避免的,特别是在使用它们生成各种目的的文本内容时,导致在区分由LLM生成的文本和由人类撰写的文本方面遇到困难。在本次研究中,我们提出了一个名为ViDetect的数据集,包括6,800篇越南语文章的样本,其中3,400篇是由人类撰写的,其余是由LLM生成的,用于检测由AI生成的文本。我们使用最先进的方法进行了评估,包括ViT5、BartPho、PhoBERT、mDeberta V3和mBERT。这些结果不仅为检测由AI生成的文本的研究增添了新的内容,而且展示了在越南语背景下不同方法的适应性和有效性。这项研究为人工智能生成的文本检测未来的发展奠定了基础,并为自然语言处理领域的研究人员提供了宝贵的洞见。
https://arxiv.org/abs/2405.03206
Despite their improved capabilities in generation and reasoning, adapting large language models (LLMs) to the biomedical domain remains challenging due to their immense size and corporate privacy. In this work, we propose MedAdapter, a unified post-hoc adapter for test-time adaptation of LLMs towards biomedical applications. Instead of fine-tuning the entire LLM, MedAdapter effectively adapts the original model by fine-tuning only a small BERT-sized adapter to rank candidate solutions generated by LLMs. Experiments demonstrate that MedAdapter effectively adapts both white-box and black-box LLMs in biomedical reasoning, achieving average performance improvements of 25.48% and 11.31%, respectively, without requiring extensive computational resources or sharing data with third parties. MedAdapter also yields superior performance when combined with train-time adaptation, highlighting a flexible and complementary solution to existing adaptation methods. Faced with the challenges of balancing model performance, computational resources, and data privacy, MedAdapter provides an efficient, privacy-preserving, cost-effective, and transparent solution for adapting LLMs to the biomedical domain.
尽管生成和推理能力有所提高,将大型语言模型(LLMs)适应生物医学领域仍然具有挑战性,因为它们具有巨大的规模和企业隐私。在本文中,我们提出了MedAdapter,一种统一的后置适配器,用于在测试时对LLMs进行生物医学应用的适应。我们不是对整个LLM进行微调,而是通过微调只有BERT大小的适配器来有效地适应原始模型。实验证明,MedAdapter有效地将白盒和黑盒LLM在生物医学推理中进行适应,分别实现了平均性能提高25.48%和11.31%。与不需要大量计算资源或与第三方共享数据相比,MedAdapter还具有卓越的性能。当结合训练时适应时,MedAdapter更加凸显了其对现有适应方法的一个灵活且互补的解决方案。面对模型性能、计算资源和数据隐私的挑战,MedAdapter为将LLMs适应生物医学领域提供了高效、隐私保护、成本低廉和透明的解决方案。
https://arxiv.org/abs/2405.03000
Natural Language Inference (NLI) is a cornerstone of Natural Language Processing (NLP), providing insights into the entailment relationships between text pairings. It is a critical component of Natural Language Understanding (NLU), demonstrating the ability to extract information from spoken or written interactions. NLI is mainly concerned with determining the entailment relationship between two statements, known as the premise and hypothesis. When the premise logically implies the hypothesis, the pair is labeled ``entailment''. If the hypothesis contradicts the premise, the pair receives the ``contradiction'' label. When there is insufficient evidence to establish a connection, the pair is described as ``neutral''. Despite the success of Large Language Models (LLMs) in various tasks, their effectiveness in NLI remains constrained by issues like low-resource domain accuracy, model overconfidence, and difficulty in capturing human judgment disagreements. This study addresses the underexplored area of evaluating LLMs in low-resourced languages such as Bengali. Through a comprehensive evaluation, we assess the performance of prominent LLMs and state-of-the-art (SOTA) models in Bengali NLP tasks, focusing on natural language inference. Utilizing the XNLI dataset, we conduct zero-shot and few-shot evaluations, comparing LLMs like GPT-3.5 Turbo and Gemini 1.5 Pro with models such as BanglaBERT, Bangla BERT Base, DistilBERT, mBERT, and sahajBERT. Our findings reveal that while LLMs can achieve comparable or superior performance to fine-tuned SOTA models in few-shot scenarios, further research is necessary to enhance our understanding of LLMs in languages with modest resources like Bengali. This study underscores the importance of continued efforts in exploring LLM capabilities across diverse linguistic contexts.
自然语言推理(NLI)是自然语言处理(NLP)的一个基石,它揭示了文本对之间的蕴含关系。它是自然语言理解(NLU)的关键部分,展示了从口语或书面互动中提取信息的能力。NLI主要关注确定前提和结论之间的蕴含关系,即前提合理地推断出结论。当前提合理地推断出结论时,这对夫妇被标注为“一致性”。如果假设与前提相矛盾,这对夫妇获得“矛盾”标签。当证据不足以建立联系时,这对夫妇被描述为“中性”。尽管大型语言模型(LLMs)在各种任务上取得了成功,但它们在NLI上的有效性仍然受到诸如低资源领域准确性、模型过自信和难以捕捉人类判断分歧等问题 的限制。本研究探讨了在资源有限的语言如孟加拉语中评估LLMs的未知领域。通过全面的评估,我们评估了孟加拉语NLP任务中知名LLM和最先进的(SOTA)模型的性能,重点关注自然语言推理。利用XNLI数据集,我们进行了零散和少散评估,将LLM如GPT-3.5 Turbo和Gemini 1.5 Pro与BanglaBERT、Bangla BERT Base、DistilBERT、mBERT和sahajBERT等模型进行比较。我们的发现表明,尽管LLM可以在少散场景中实现与微调SOTA模型相当或更好的性能,但需要进一步研究来增强我们对在资源有限的语言如孟加拉语中LLM的理解。本研究强调在多样语言背景下继续探索LLM的能力的重要性。
https://arxiv.org/abs/2405.02937
Artificial neural networks trained on large, expert-labelled datasets are considered state-of-the-art for a range of medical image recognition tasks. However, categorically labelled datasets are time-consuming to generate and constrain classification to a pre-defined, fixed set of classes. For neuroradiological applications in particular, this represents a barrier to clinical adoption. To address these challenges, we present a self-supervised text-vision framework that learns to detect clinically relevant abnormalities in brain MRI scans by directly leveraging the rich information contained in accompanying free-text neuroradiology reports. Our training approach consisted of two-steps. First, a dedicated neuroradiological language model - NeuroBERT - was trained to generate fixed-dimensional vector representations of neuroradiology reports (N = 50,523) via domain-specific self-supervised learning tasks. Next, convolutional neural networks (one per MRI sequence) learnt to map individual brain scans to their corresponding text vector representations by optimising a mean square error loss. Once trained, our text-vision framework can be used to detect abnormalities in unreported brain MRI examinations by scoring scans against suitable query sentences (e.g., 'there is an acute stroke', 'there is hydrocephalus' etc.), enabling a range of classification-based applications including automated triage. Potentially, our framework could also serve as a clinical decision support tool, not only by suggesting findings to radiologists and detecting errors in provisional reports, but also by retrieving and displaying examples of pathologies from historical examinations that could be relevant to the current case based on textual descriptors.
通过在大型、专家标注的数据集上训练的人工神经网络被认为是各种医学图像识别任务的当前最先进的。然而,分类标注的数据集需要花费较长的时间来生成,并限制将分类限制为预定义、固定的类。特别是,在神经放射学应用中,这代表了临床采用的障碍。为了应对这些挑战,我们提出了一个自监督的文本视觉框架,通过直接利用伴随的免费文本神经放射学报告中的丰富信息来检测临床相关的异常脑MRI扫描。我们的训练方法包括两个步骤。首先,一个专门的语言模型——NeuroBERT 通过领域特定的自监督学习任务训练,生成固定维度的神经放射学报告的固定维向量表示(N = 50,523)。接下来,卷积神经网络(每个MRI序列一个)通过优化均方误差损失来学习将单个脑扫描映射到相应的文本向量表示。经过训练后,我们的文本视觉框架可用于通过评分扫描与适当的查询句子(例如,“有急性中风”,“有高血压”等)相匹配来检测未报告的脑MRI examination中的异常,实现各种分类基础应用(包括自动分类分诊)。可能的是,我们的框架还可以作为临床决策支持工具,不仅通过向放射科医生建议发现,还通过根据文本描述检索和显示历史检查中的疾病实例来发挥作用。
https://arxiv.org/abs/2405.02782
The vast collection of Holocaust survivor testimonies presents invaluable historical insights but poses challenges for manual analysis. This paper leverages advanced Natural Language Processing (NLP) techniques to explore the USC Shoah Foundation Holocaust testimony corpus. By treating testimonies as structured question-and-answer sections, we apply topic modeling to identify key themes. We experiment with BERTopic, which leverages recent advances in language modeling technology. We align testimony sections into fixed parts, revealing the evolution of topics across the corpus of testimonies. This highlights both a common narrative schema and divergences between subgroups based on age and gender. We introduce a novel method to identify testimonies within groups that exhibit atypical topic distributions resembling those of other groups. This study offers unique insights into the complex narratives of Holocaust survivors, demonstrating the power of NLP to illuminate historical discourse and identify potential deviations in survivor experiences.
这份论文利用自然语言处理(NLP)技术探索usc shoah基金会大屠杀见证人资料库。通过将证言视为结构化的问题和答案部分,我们应用主题建模来确定关键主题。我们尝试使用BERTopic,它利用了最近在语言建模技术方面的进展。我们将证言部分对齐为固定的部分,揭示了证言库中主题随时间演变的趋势。这揭示了共性叙事模式以及基于年龄和性别差异的子群体之间的分歧。我们引入了一种新方法,用于在团体内识别表现出与其他团体异常主题分布的证言。这项研究为探究大屠杀幸存者的复杂叙事提供了独特的见解,证明了自然语言处理的力量可以阐明历史论述和揭示幸存者的经历中的潜在偏差。
https://arxiv.org/abs/2405.02650
Recently, many studies have shown the efficiency of using Bidirectional Encoder Representations from Transformers (BERT) in various Natural Language Processing (NLP) tasks. Specifically, English spelling correction task that uses Encoder-Decoder architecture and takes advantage of BERT has achieved state-of-the-art result. However, to our knowledge, there is no implementation in Vietnamese yet. Therefore, in this study, a combination of Transformer architecture (state-of-the-art for Encoder-Decoder model) and BERT was proposed to deal with Vietnamese spelling correction. The experiment results have shown that our model outperforms other approaches as well as the Google Docs Spell Checking tool, achieves an 86.24 BLEU score on this task.
最近,许多研究都表明了使用双向编码器表示来自Transformer(BERT)在各种自然语言处理(NLP)任务中的效率。具体来说,利用BERT的编码器-解码器架构的英语拼写纠错任务已经达到了最先进的水平。然而,据我们所知,在越南还没有实现。因此,在本文中,我们提出了结合Transformer架构(对于编码器-解码器模型状态最佳)和BERT来处理越南拼写纠错的想法。实验结果表明,我们的模型在表现为其他方法和Google Docs拼写检查工具方面均表现优异,并且达到了86.24 BLEU得分。
https://arxiv.org/abs/2405.02573
Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
预训练语言模型(PLM),例如BERT或RoBERTa,在有标签数据对其进行微调时,在自然语言理解任务上达到了最先进的水平。然而,它们的大型尺寸在部署为现实世界应用的推理时提出了挑战,因为它们对GPU内存需求很高,且推理延迟很高。本文探讨了神经架构搜索(NAS)用于结构剪枝,以找到微调网络的最佳子部分,在效率、模型大小或延迟方面实现最优的平衡,并提高泛化性能。我们还展示了如何利用最近开发的两阶段权重共享NAS方法来加速搜索过程。与传统的剪枝方法不同,我们提出了一个多目标方法,该方法可以找到帕累托最优的子网络集合,实现更加灵活和自动化的压缩过程。
https://arxiv.org/abs/2405.02267
The complex challenge of detecting sarcasm in Arabic speech on social media is increased by the language diversity and the nature of sarcastic expressions. There is a significant gap in the capability of existing models to effectively interpret sarcasm in Arabic, which mandates the necessity for more sophisticated and precise detection methods. In this paper, we investigate the impact of a fundamental preprocessing component on sarcasm speech detection. While emojis play a crucial role in mitigating the absence effect of body language and facial expressions in modern communication, their impact on automated text analysis, particularly in sarcasm detection, remains underexplored. We investigate the impact of emoji exclusion from datasets on the performance of sarcasm detection models in social media content for Arabic as a vocabulary-super rich language. This investigation includes the adaptation and enhancement of AraBERT pre-training models, specifically by excluding emojis, to improve sarcasm detection capabilities. We use AraBERT pre-training to refine the specified models, demonstrating that the removal of emojis can significantly boost the accuracy of sarcasm detection. This approach facilitates a more refined interpretation of language, eliminating the potential confusion introduced by non-textual elements. The evaluated AraBERT models, through the focused strategy of emoji removal, adeptly navigate the complexities of Arabic sarcasm. This study establishes new benchmarks in Arabic natural language processing and presents valuable insights for social media platforms.
社会媒体中检测讽刺语的复杂性增加了语言多样性和讽刺表达的性质。现有模型有效解释阿拉伯语中的讽刺的能力存在显著的差距,这迫使需要更复杂和精确的检测方法。在本文中,我们研究了基本预处理组件对讽刺语音检测的影响。尽管表情符号在减轻现代通信中肢体语言和面部表情缺失效应方面起着关键作用,但它们对自动文本分析(特别是讽刺检测)的影响仍没有被深入研究。我们研究了表情符号从数据集中排除对阿拉伯语讽刺检测模型性能的影响。这项调查包括使用AraBERT预训练模型进行调整和增强,特别是通过排除表情符号,以提高讽刺检测能力。我们使用AraBERT预训练来优化指定模型,证明删除表情符号可以显著提高讽刺检测的准确性。这种方法使得对语言的解读更加精准,消除了非文本元素可能引起的混淆。评估的AraBERT模型通过移除表情符号,巧妙地处理了阿拉伯语讽刺的复杂性。本研究为阿拉伯自然语言处理设立了新的基准,并为社交媒体平台提供了宝贵的洞见。
https://arxiv.org/abs/2405.02195
The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, there is still potential for enhancement in the performance of these methods. In this paper, we present GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), a novel HuBERT-based adaptive transfer learning framework for SER. Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level gender-augmented multi-scale pseudo-labels. Then, to fully leverage both obtained frame-level and utterance-level emotion labels, we incorporate model retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a WAR of 80.0\% and a UAR of 82.0\%, surpassing state-of-the-art unimodal SER methods, while also yielding comparable results with multimodal SER approaches.
预训练语音模型的连续进化已经极大地推动了情感识别(SER)。然而,这些方法在表现上还有很大的提升潜力。在本文中,我们提出了GMP-ATL(性别增强多尺度伪标签自适应转移学习),一种新颖的HuBERT基情感识别(SER)自适应转移学习框架。具体来说,GMP-ATL首先采用预训练的HuBERT,实现多任务学习和多尺度k-means聚类,以获取帧级的性别增强多尺度伪标签。然后,为了充分利用获得的帧级和语料水平情感标签,我们引入模型重构和微调方法,进一步优化GMP-ATL。在IEMOCAP上的实验表明,我们的GMP-ATL取得了卓越的识别性能,具有80.0%的准确率(WAR)和82.0%的召回率(UAR),超越了当前最先进的单模态SER方法,同时与多模态SER方法相当。
https://arxiv.org/abs/2405.02151
The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from this https URL.
准确确定代码相似性的能力在许多与软件开发相关的任务中至关重要。例如,识别代码复制是进行软件维护必不可少的。这项研究引入了一种新颖的集成学习方法来评估代码相似性,结合多个无监督相似度的优势。关键思想是,多样化的相似度度量具有相互补充的优势,并减轻了各个弱点的负面影响,从而提高了性能。初步结果表明,尽管Transformer-based CodeBERT和其变体GraphCodeBERT在大量训练数据存在的情况下无疑是最佳选择,但在具体小数据集(最多500个样本)情况下,我们的集成方法同样可以达到类似的结果,而不会影响所得到解决方案的可解释性,同时训练导致的附带碳足迹要低得多。这种新颖方法的源代码可以从以下链接下载:https://github.com/your-username/your-repo-name
https://arxiv.org/abs/2405.02095
The ability to transmit and receive complex information via language is unique to humans and is the basis of traditions, culture and versatile social interactions. Through the disruptive introduction of transformer based large language models (LLMs) humans are not the only entity to "understand" and produce language any more. In the present study, we have performed the first steps to use LLMs as a model to understand fundamental mechanisms of language processing in neural networks, in order to make predictions and generate hypotheses on how the human brain does language processing. Thus, we have used ChatGPT to generate seven different stylistic variations of ten different narratives (Aesop's fables). We used these stories as input for the open source LLM BERT and have analyzed the activation patterns of the hidden units of BERT using multi-dimensional scaling and cluster analysis. We found that the activation vectors of the hidden units cluster according to stylistic variations in earlier layers of BERT (1) than narrative content (4-5). Despite the fact that BERT consists of 12 identical building blocks that are stacked and trained on large text corpora, the different layers perform different tasks. This is a very useful model of the human brain, where self-similar structures, i.e. different areas of the cerebral cortex, can have different functions and are therefore well suited to processing language in a very efficient way. The proposed approach has the potential to open the black box of LLMs on the one hand, and might be a further step to unravel the neural processes underlying human language processing and cognition in general.
通过语言进行复杂信息传输和接收是人类独有的能力,也是传统、文化和多才多艺的社会互动的基础。通过颠覆性的基于Transformer的大型语言模型(LLMs)引入,人类不再是唯一能够理解和产生语言的实体。在当前的研究中,我们使用了LLMs作为模型来理解神经网络中语言处理的基本机制,以进行预测和生成关于人类大脑如何进行语言处理的研究。因此,我们使用ChatGPT生成了10个不同叙事的7种不同风格。我们将这些故事作为输入传送到开源LLM BERT,并使用多维缩放和聚类分析来分析BERT隐藏层单元的激活模式。我们发现,隐藏单元的激活矢量根据BERT早期层风格的stylistic variations(1)比故事内容(4-5)更加聚类。尽管BERT由12个相同的构建模块组成,这些不同的层执行不同的任务。这是一个非常有用的描述人类大脑的模型,因为自我相似结构(即不同的大脑皮层区域)可以具有不同的功能,因此非常适于以非常高效的方式处理语言。所提出的方法有望打开LLMs的黑盒,同时也许是一个进一步揭开人类语言处理和认知的神经过程的步骤。
https://arxiv.org/abs/2405.02024
This paper introduces OARelatedWork, the first large-scale multi-document summarization dataset for related work generation containing whole related work sections and full-texts of cited papers. The dataset includes 94 450 papers and 5 824 689 unique referenced papers. It was designed for the task of automatically generating related work to shift the field toward generating entire related work sections from all available content instead of generating parts of related work sections from abstracts only, which is the current mainstream in this field for abstractive approaches. We show that the estimated upper bound for extractive summarization increases by 217% in the ROUGE-2 score, when using full content instead of abstracts. Furthermore, we show the benefits of full content data on naive, oracle, traditional, and transformer-based baselines. Long outputs, such as related work sections, pose challenges for automatic evaluation metrics like BERTScore due to their limited input length. We tackle this issue by proposing and evaluating a meta-metric using BERTScore. Despite operating on smaller blocks, we show this meta-metric correlates with human judgment, comparably to the original BERTScore.
本文介绍了OARelatedWork,第一个大型多文档相关工作生成数据集,包含整个相关工作段落和引用论文的完整文本。该数据集包括94,450篇论文和5,824,689篇唯一引用的论文。这个数据集是为自动生成相关工作来改变领域,从仅从摘要中生成相关工作段落转向从所有可用内容生成整个相关工作段落而设计的。我们证明了,当使用完整内容而不是摘要时,估计的上限增加了217%。此外,我们还证明了完整内容数据在自然、预言、传统和Transformer基线上的优势。由于其有限输入长度,长输出(如相关工作段落)对自动评估指标BERTScore造成了挑战。我们通过使用BERTScore提出并评估了一个元数据。尽管操作在较小的块上,但我们证明了这种元数据与人类判断相当相关,与原始BERTScore相当。
https://arxiv.org/abs/2405.01930
Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.
大语言模型(LLMs)越来越多地成为用于对特定领域分类文本并根据多个标签(类别)进行分类的骨架。在遇到领域变化时,例如将IMDb上的电影评论分类到Rotten Tomatoes,基于LLM的跨域多标签分类器在目标领域和多个标签(类别)的情况下进行调整是非常具有挑战性的,因为目标领域的标签集不完整,训练开销巨大。现有的领域迁移方法要么是图像多标签分类器,要么是文本二分类器。在本文中,我们设计了一个第一性的基于LLM的半监督领域迁移方法——DALLMi,一种基于LLM的文本数据模型的第一性的半监督领域迁移方法,特别是BERT。DALLMi的核心是新颖的变体损失和MixUp正则化,它们共同利用有限的正例标签和大量未标记文本,以及它们从BERT词向量之间的插值,同时引入了标签平衡抽样策略,以克服目标领域中标签和未标记数据之间的不平衡。我们在三个数据集的不同场景下,对目标领域进行半监督和无监督方法进行了评估。我们的结果表明,DALLMi在半监督和无监督方法的基础上分别实现了19.9%和52.2%的mAP提升。
https://arxiv.org/abs/2405.01883
Diagnosing language disorders associated with autism is a complex and nuanced challenge, often hindered by the subjective nature and variability of traditional assessment methods. Traditional diagnostic methods not only require intensive human effort but also often result in delayed interventions due to their lack of speed and specificity. In this study, we explored the application of ChatGPT, a state of the art large language model, to overcome these obstacles by enhancing diagnostic accuracy and profiling specific linguistic features indicative of autism. Leveraging ChatGPT advanced natural language processing capabilities, this research aims to streamline and refine the diagnostic process. Specifically, we compared ChatGPT's performance with that of conventional supervised learning models, including BERT, a model acclaimed for its effectiveness in various natural language processing tasks. We showed that ChatGPT substantially outperformed these models, achieving over 13% improvement in both accuracy and F1 score in a zero shot learning configuration. This marked enhancement highlights the model potential as a superior tool for neurological diagnostics. Additionally, we identified ten distinct features of autism associated language disorders that vary significantly across different experimental scenarios. These features, which included echolalia, pronoun reversal, and atypical language usage, were crucial for accurately diagnosing ASD and customizing treatment plans. Together, our findings advocate for adopting sophisticated AI tools like ChatGPT in clinical settings to assess and diagnose developmental disorders. Our approach not only promises greater diagnostic precision but also aligns with the goals of personalized medicine, potentially transforming the evaluation landscape for autism and similar neurological conditions.
诊断与自闭症相关的语言障碍是一个复杂而微妙的挑战,常常受到传统评估方法主观性和可变性的阻碍。传统的评估方法不仅需要大量的人力,而且通常由于其速度和准确性不足而导致延迟干预。在这项研究中,我们探讨了将 ChatGPT(一种最先进的 large language model)应用于克服这些障碍,通过提高诊断准确性和鉴定自闭症特定语言特征来提高诊断过程。利用 ChatGPT 先进自然语言处理能力,这项研究旨在简化并优化诊断过程。 具体来说,我们将 ChatGPT 的性能与包括 BERT(在各种自然语言处理任务中表现出色)在内的传统监督学习模型进行比较。我们发现 ChatGPT 远远超过了这些模型,在零散学习配置下实现了超过 13% 的准确性和 F1 分数的提高。这一显著的增强突显了该模型的潜力,作为神经诊断工具的优越性。此外,我们识别出十种与自闭症相关的语言障碍,这些障碍在不同的实验场景中具有显著的差异。这些特征(包括重复、指代词倒置和异常语言使用)对于准确诊断 ASD 和定制治疗计划至关重要。 我们在一起得出的研究结果主张,在临床环境中采用先进的 AI 工具如 ChatGPT 来评估和诊断发展障碍。我们的方法不仅承诺更高的诊断精确性,而且与个性化医疗的目标相一致,可能改变自闭症和其他神经障碍的评估格局。
https://arxiv.org/abs/2405.01799
The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets in various Transformer architectures, including ViT, Swin-T, GPT-2, and RoBERTa. Our experimental results demonstrate that early-bird tickets can be consistently found within the first few epochs of training or fine-tuning, enabling significant resource optimization without compromising performance. The pruned models obtained from early-bird tickets achieve comparable or even superior accuracy to their unpruned counterparts while substantially reducing memory usage. Furthermore, our comparative analysis highlights the generalizability of the early-bird ticket phenomenon across different Transformer models and tasks. This research contributes to the development of efficient training strategies for Transformer models, making them more accessible and resource-friendly. By leveraging early-bird tickets, practitioners can accelerate the progress of natural language processing and computer vision applications while reducing the computational burden associated with training Transformer models.
Transformer模型的训练已经颠覆了自然语言处理和计算机视觉,但仍是一个资源密集和耗时的过程。本文研究了早期鸟票假设对优化Transformer模型的训练效率的适用性。我们提出了一种结合迭代修剪、遮罩距离计算和选择性重置的方法来识别各种Transformer架构中的早期鸟票。我们的实验结果表明,在训练或微调的早期几轮中,早期鸟票可以在各种Transformer架构中持续发现,从而实现显著的资源优化,同时不牺牲性能。通过早期鸟票获得的修剪模型在保持准确性的同时,大大减少了内存使用。此外,我们的比较分析强调了早期鸟票现象在不同Transformer模型和任务上的普遍性。这项研究为Transformer模型的有效训练策略的发展做出了贡献,使这些模型更加易于使用和资源友好。通过利用早期鸟票,实践者可以加速自然语言处理和计算机视觉应用的发展,同时降低训练Transformer模型的计算负担。
https://arxiv.org/abs/2405.02353
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them inevitably and increasingly susceptible to hardware faults (e.g., bit flips) that can potentially corrupt model parameters. Given this challenge, this paper aims to answer a critical question: How likely is a parameter corruption to result in an incorrect model output? To systematically answer this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model resilience/vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT). PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
人工智能系统的可靠性是成功部署和广泛采用人工智能技术的关键因素。然而,人工智能硬件系统的日益复杂和异质性使得它们越来越容易受到硬件故障(例如比特翻转)的潜在影响,这可能导致模型参数的错误。鉴于这一挑战,本文旨在回答一个关键问题:参数腐蚀导致错误模型的可能性有多大?为了系统地回答这个问题,我们提出了一个新型的定量指标——参数安全风险因子(PVF),灵感来自计算机架构社区中的架构漏洞因素(AVF),旨在统一和标准化人工智能模型对参数腐蚀的抵抗力。我们定义了一个模型参数的PVF为,在该特定模型参数中的腐蚀导致错误输出的概率。与AVF相似,这个统计概念可以从统计上广泛和有意义的故障注入(FI)实验中推导出来。在本文中,我们展示了将PVF应用于推理过程中的三种任务/模型的几个用例——推荐(DLRM)、视觉分类(CNN)和文本分类(BERT)。PVF可以为人工智能硬件设计师在故障保护与性能/效率之间取得平衡提供关键见解,将易受腐蚀的AI参数组件映射到得到良好保护的硬件模块。PVF指标适用于任何人工智能模型,有潜力帮助统一和标准化人工智能漏洞/抵抗力评估实践。
https://arxiv.org/abs/2405.01741
Automatic conversion of free-text radiology reports into structured data using Natural Language Processing (NLP) techniques is crucial for analyzing diseases on a large scale. While effective for tasks in widely spoken languages like English, generative large language models (LLMs) typically underperform with less common languages and can pose potential risks to patient privacy. Fine-tuning local NLP models is hindered by the skewed nature of real-world medical datasets, where rare findings represent a significant data imbalance. We introduce SMP-BERT, a novel prompt learning method that leverages the structured nature of reports to overcome these challenges. In our studies involving a substantial collection of Crohn's disease radiology reports in Hebrew (over 8,000 patients and 10,000 reports), SMP-BERT greatly surpassed traditional fine-tuning methods in performance, notably in detecting infrequent conditions (AUC: 0.99 vs 0.94, F1: 0.84 vs 0.34). SMP-BERT empowers more accurate AI diagnostics available for low-resource languages.
自动将自由文本断层扫描报告转换为结构化数据使用自然语言处理(NLP)技术对大规模分析疾病至关重要。虽然对像英语这样的广泛使用语言有效,但生成大型语言模型(LLMs)通常在较不常见语言上表现不佳,并可能对患者隐私构成潜在风险。对局部NLP模型的微调受到真实世界医学数据集中数据不平衡性的限制。我们引入了SMP-BERT,一种利用报告结构的新提示学习方法,克服了这些挑战。在涉及大量希伯来语(超过8,000名患者和10,000个报告)的Crohn's病放射学报告的研究中,SMP-BERT在性能上大大超过了传统的微调方法,特别是在检测罕见的疾病方面(AUC:0.99 vs 0.94,F1:0.84 vs 0.34)。SMP-BERT为低资源语言提供了更准确的AI诊断。
https://arxiv.org/abs/2405.01682
Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide a dataset for creative, witty text generation based on Reddit Showerthoughts posts.
近年来,大型语言模型(LLMs)已经证明了生成难以或无法与人类写作区分的内容的能力。我们在Showerthoughts领域,即在普通活动中可能出现的想法,研究了不同大小的LLM是否具有复制人类写作风格的短小、有创意的文本的能力。我们将GPT-2和GPT-Neo在Reddit数据上微调以及通过零散的方式启动GPT-3.5,与人类创作的文章进行比较。我们在具体体现创意、幽默的文本质量的各个维度上衡量人类偏好。此外,我们还比较了人类和微调后的RoBERTa分类器在检测人工智能生成文本方面的能力。我们得出结论,人类评估者平均将生成的文本创意质量评价为略逊一筹,但他们无法可靠地区分人类创作和人工智能生成的文本。我们还基于Reddit Showerthoughts帖子提供了用于创意、幽默文本生成的数据集。
https://arxiv.org/abs/2405.01660
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at this http URL.
本文介绍了一个名为UQA的新型数据集,用于 Urdu 语料库中问题回答和文本理解。UQA 由翻译斯坦福问题回答数据集 (SQuAD2.0) 生成,这是一个大型英语问题回答数据集,使用一种称为 EATS (将括号内保留翻译文本上下文中的答案范围) 的技术生成。本文描述了从两个候选者(Google 翻译器和 Seamless M4T)中选择和评估最佳翻译模型的过程。此外,本文还在 UQA 上基准了多种最先进的跨语言 QA 模型,包括 mBERT、XLM-RoBERTa 和 mT5,并报告了有希望的结果。对于 XLM-RoBERTa-XL,我们的 F1 分数为 85.99 和 74.56。UQA 是一个有价值的资源,可用于开发和测试 Urdu 和其他多语言 NLP 系统,以及增强现有模型的跨语言可转移性。此外,本文还证明了 EATS 在创建高质量数据集在其他语言和领域中的有效性。UQA 数据集和代码可在此处下载:<http://www.aclweb.org/anthology/N18-1196>
https://arxiv.org/abs/2405.01458
Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation. However, Translate-Distill only supports a single document language. Multilingual information retrieval (MLIR), which ranks a multilingual document collection, is harder to train than CLIR because the model must assign comparable relevance scores to documents in different languages. This work extends Translate-Distill and propose Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, which is the previous state-of-the-art training approach, by 5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is robust to the way languages are mixed in training batches. Our implementation is available on GitHub.
近年来,在跨语言信息检索(CLIR)领域,使用不同语言的查询和文档进行研究,已经证明了使用翻译和蒸馏训练跨语言神经双编码器模型的优势。然而,Translate-Distill 仅支持单个文档语言。多语言信息检索(MLIR)比 CLIR 更难训练,因为模型必须为不同语言的文档分配相似的 relevance 分数。这项工作扩展了 Translate-Distill,并提出了多语言 Translate-Distill(MTD)用于 MLIR。我们证明了,使用 ColBERT-X 模型在 MTD 训练的跨语言预训练模型在 nDCG@20 和 MAP 上优于使用多语言 Translate-Train 的先前最先进状态,其性能分别提高了 5% 到 25% 和 15% 到 45%。我们还证明了模型对训练批次中语言混合的这种方式非常鲁棒。我们的实现可以在 GitHub 上获取。
https://arxiv.org/abs/2405.00977