The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from this https URL.
准确确定代码相似性的能力在许多与软件开发相关的任务中至关重要。例如,识别代码复制是进行软件维护必不可少的。这项研究引入了一种新颖的集成学习方法来评估代码相似性,结合多个无监督相似度的优势。关键思想是,多样化的相似度度量具有相互补充的优势,并减轻了各个弱点的负面影响,从而提高了性能。初步结果表明,尽管Transformer-based CodeBERT和其变体GraphCodeBERT在大量训练数据存在的情况下无疑是最佳选择,但在具体小数据集(最多500个样本)情况下,我们的集成方法同样可以达到类似的结果,而不会影响所得到解决方案的可解释性,同时训练导致的附带碳足迹要低得多。这种新颖方法的源代码可以从以下链接下载:https://github.com/your-username/your-repo-name
https://arxiv.org/abs/2405.02095
The ability to transmit and receive complex information via language is unique to humans and is the basis of traditions, culture and versatile social interactions. Through the disruptive introduction of transformer based large language models (LLMs) humans are not the only entity to "understand" and produce language any more. In the present study, we have performed the first steps to use LLMs as a model to understand fundamental mechanisms of language processing in neural networks, in order to make predictions and generate hypotheses on how the human brain does language processing. Thus, we have used ChatGPT to generate seven different stylistic variations of ten different narratives (Aesop's fables). We used these stories as input for the open source LLM BERT and have analyzed the activation patterns of the hidden units of BERT using multi-dimensional scaling and cluster analysis. We found that the activation vectors of the hidden units cluster according to stylistic variations in earlier layers of BERT (1) than narrative content (4-5). Despite the fact that BERT consists of 12 identical building blocks that are stacked and trained on large text corpora, the different layers perform different tasks. This is a very useful model of the human brain, where self-similar structures, i.e. different areas of the cerebral cortex, can have different functions and are therefore well suited to processing language in a very efficient way. The proposed approach has the potential to open the black box of LLMs on the one hand, and might be a further step to unravel the neural processes underlying human language processing and cognition in general.
通过语言进行复杂信息传输和接收是人类独有的能力,也是传统、文化和多才多艺的社会互动的基础。通过颠覆性的基于Transformer的大型语言模型(LLMs)引入,人类不再是唯一能够理解和产生语言的实体。在当前的研究中,我们使用了LLMs作为模型来理解神经网络中语言处理的基本机制,以进行预测和生成关于人类大脑如何进行语言处理的研究。因此,我们使用ChatGPT生成了10个不同叙事的7种不同风格。我们将这些故事作为输入传送到开源LLM BERT,并使用多维缩放和聚类分析来分析BERT隐藏层单元的激活模式。我们发现,隐藏单元的激活矢量根据BERT早期层风格的stylistic variations(1)比故事内容(4-5)更加聚类。尽管BERT由12个相同的构建模块组成,这些不同的层执行不同的任务。这是一个非常有用的描述人类大脑的模型,因为自我相似结构(即不同的大脑皮层区域)可以具有不同的功能,因此非常适于以非常高效的方式处理语言。所提出的方法有望打开LLMs的黑盒,同时也许是一个进一步揭开人类语言处理和认知的神经过程的步骤。
https://arxiv.org/abs/2405.02024
This paper introduces OARelatedWork, the first large-scale multi-document summarization dataset for related work generation containing whole related work sections and full-texts of cited papers. The dataset includes 94 450 papers and 5 824 689 unique referenced papers. It was designed for the task of automatically generating related work to shift the field toward generating entire related work sections from all available content instead of generating parts of related work sections from abstracts only, which is the current mainstream in this field for abstractive approaches. We show that the estimated upper bound for extractive summarization increases by 217% in the ROUGE-2 score, when using full content instead of abstracts. Furthermore, we show the benefits of full content data on naive, oracle, traditional, and transformer-based baselines. Long outputs, such as related work sections, pose challenges for automatic evaluation metrics like BERTScore due to their limited input length. We tackle this issue by proposing and evaluating a meta-metric using BERTScore. Despite operating on smaller blocks, we show this meta-metric correlates with human judgment, comparably to the original BERTScore.
本文介绍了OARelatedWork,第一个大型多文档相关工作生成数据集,包含整个相关工作段落和引用论文的完整文本。该数据集包括94,450篇论文和5,824,689篇唯一引用的论文。这个数据集是为自动生成相关工作来改变领域,从仅从摘要中生成相关工作段落转向从所有可用内容生成整个相关工作段落而设计的。我们证明了,当使用完整内容而不是摘要时,估计的上限增加了217%。此外,我们还证明了完整内容数据在自然、预言、传统和Transformer基线上的优势。由于其有限输入长度,长输出(如相关工作段落)对自动评估指标BERTScore造成了挑战。我们通过使用BERTScore提出并评估了一个元数据。尽管操作在较小的块上,但我们证明了这种元数据与人类判断相当相关,与原始BERTScore相当。
https://arxiv.org/abs/2405.01930
Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.
大语言模型(LLMs)越来越多地成为用于对特定领域分类文本并根据多个标签(类别)进行分类的骨架。在遇到领域变化时,例如将IMDb上的电影评论分类到Rotten Tomatoes,基于LLM的跨域多标签分类器在目标领域和多个标签(类别)的情况下进行调整是非常具有挑战性的,因为目标领域的标签集不完整,训练开销巨大。现有的领域迁移方法要么是图像多标签分类器,要么是文本二分类器。在本文中,我们设计了一个第一性的基于LLM的半监督领域迁移方法——DALLMi,一种基于LLM的文本数据模型的第一性的半监督领域迁移方法,特别是BERT。DALLMi的核心是新颖的变体损失和MixUp正则化,它们共同利用有限的正例标签和大量未标记文本,以及它们从BERT词向量之间的插值,同时引入了标签平衡抽样策略,以克服目标领域中标签和未标记数据之间的不平衡。我们在三个数据集的不同场景下,对目标领域进行半监督和无监督方法进行了评估。我们的结果表明,DALLMi在半监督和无监督方法的基础上分别实现了19.9%和52.2%的mAP提升。
https://arxiv.org/abs/2405.01883
Diagnosing language disorders associated with autism is a complex and nuanced challenge, often hindered by the subjective nature and variability of traditional assessment methods. Traditional diagnostic methods not only require intensive human effort but also often result in delayed interventions due to their lack of speed and specificity. In this study, we explored the application of ChatGPT, a state of the art large language model, to overcome these obstacles by enhancing diagnostic accuracy and profiling specific linguistic features indicative of autism. Leveraging ChatGPT advanced natural language processing capabilities, this research aims to streamline and refine the diagnostic process. Specifically, we compared ChatGPT's performance with that of conventional supervised learning models, including BERT, a model acclaimed for its effectiveness in various natural language processing tasks. We showed that ChatGPT substantially outperformed these models, achieving over 13% improvement in both accuracy and F1 score in a zero shot learning configuration. This marked enhancement highlights the model potential as a superior tool for neurological diagnostics. Additionally, we identified ten distinct features of autism associated language disorders that vary significantly across different experimental scenarios. These features, which included echolalia, pronoun reversal, and atypical language usage, were crucial for accurately diagnosing ASD and customizing treatment plans. Together, our findings advocate for adopting sophisticated AI tools like ChatGPT in clinical settings to assess and diagnose developmental disorders. Our approach not only promises greater diagnostic precision but also aligns with the goals of personalized medicine, potentially transforming the evaluation landscape for autism and similar neurological conditions.
诊断与自闭症相关的语言障碍是一个复杂而微妙的挑战,常常受到传统评估方法主观性和可变性的阻碍。传统的评估方法不仅需要大量的人力,而且通常由于其速度和准确性不足而导致延迟干预。在这项研究中,我们探讨了将 ChatGPT(一种最先进的 large language model)应用于克服这些障碍,通过提高诊断准确性和鉴定自闭症特定语言特征来提高诊断过程。利用 ChatGPT 先进自然语言处理能力,这项研究旨在简化并优化诊断过程。 具体来说,我们将 ChatGPT 的性能与包括 BERT(在各种自然语言处理任务中表现出色)在内的传统监督学习模型进行比较。我们发现 ChatGPT 远远超过了这些模型,在零散学习配置下实现了超过 13% 的准确性和 F1 分数的提高。这一显著的增强突显了该模型的潜力,作为神经诊断工具的优越性。此外,我们识别出十种与自闭症相关的语言障碍,这些障碍在不同的实验场景中具有显著的差异。这些特征(包括重复、指代词倒置和异常语言使用)对于准确诊断 ASD 和定制治疗计划至关重要。 我们在一起得出的研究结果主张,在临床环境中采用先进的 AI 工具如 ChatGPT 来评估和诊断发展障碍。我们的方法不仅承诺更高的诊断精确性,而且与个性化医疗的目标相一致,可能改变自闭症和其他神经障碍的评估格局。
https://arxiv.org/abs/2405.01799
The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets in various Transformer architectures, including ViT, Swin-T, GPT-2, and RoBERTa. Our experimental results demonstrate that early-bird tickets can be consistently found within the first few epochs of training or fine-tuning, enabling significant resource optimization without compromising performance. The pruned models obtained from early-bird tickets achieve comparable or even superior accuracy to their unpruned counterparts while substantially reducing memory usage. Furthermore, our comparative analysis highlights the generalizability of the early-bird ticket phenomenon across different Transformer models and tasks. This research contributes to the development of efficient training strategies for Transformer models, making them more accessible and resource-friendly. By leveraging early-bird tickets, practitioners can accelerate the progress of natural language processing and computer vision applications while reducing the computational burden associated with training Transformer models.
Transformer模型的训练已经颠覆了自然语言处理和计算机视觉,但仍是一个资源密集和耗时的过程。本文研究了早期鸟票假设对优化Transformer模型的训练效率的适用性。我们提出了一种结合迭代修剪、遮罩距离计算和选择性重置的方法来识别各种Transformer架构中的早期鸟票。我们的实验结果表明,在训练或微调的早期几轮中,早期鸟票可以在各种Transformer架构中持续发现,从而实现显著的资源优化,同时不牺牲性能。通过早期鸟票获得的修剪模型在保持准确性的同时,大大减少了内存使用。此外,我们的比较分析强调了早期鸟票现象在不同Transformer模型和任务上的普遍性。这项研究为Transformer模型的有效训练策略的发展做出了贡献,使这些模型更加易于使用和资源友好。通过利用早期鸟票,实践者可以加速自然语言处理和计算机视觉应用的发展,同时降低训练Transformer模型的计算负担。
https://arxiv.org/abs/2405.02353
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them inevitably and increasingly susceptible to hardware faults (e.g., bit flips) that can potentially corrupt model parameters. Given this challenge, this paper aims to answer a critical question: How likely is a parameter corruption to result in an incorrect model output? To systematically answer this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model resilience/vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT). PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
人工智能系统的可靠性是成功部署和广泛采用人工智能技术的关键因素。然而,人工智能硬件系统的日益复杂和异质性使得它们越来越容易受到硬件故障(例如比特翻转)的潜在影响,这可能导致模型参数的错误。鉴于这一挑战,本文旨在回答一个关键问题:参数腐蚀导致错误模型的可能性有多大?为了系统地回答这个问题,我们提出了一个新型的定量指标——参数安全风险因子(PVF),灵感来自计算机架构社区中的架构漏洞因素(AVF),旨在统一和标准化人工智能模型对参数腐蚀的抵抗力。我们定义了一个模型参数的PVF为,在该特定模型参数中的腐蚀导致错误输出的概率。与AVF相似,这个统计概念可以从统计上广泛和有意义的故障注入(FI)实验中推导出来。在本文中,我们展示了将PVF应用于推理过程中的三种任务/模型的几个用例——推荐(DLRM)、视觉分类(CNN)和文本分类(BERT)。PVF可以为人工智能硬件设计师在故障保护与性能/效率之间取得平衡提供关键见解,将易受腐蚀的AI参数组件映射到得到良好保护的硬件模块。PVF指标适用于任何人工智能模型,有潜力帮助统一和标准化人工智能漏洞/抵抗力评估实践。
https://arxiv.org/abs/2405.01741
Automatic conversion of free-text radiology reports into structured data using Natural Language Processing (NLP) techniques is crucial for analyzing diseases on a large scale. While effective for tasks in widely spoken languages like English, generative large language models (LLMs) typically underperform with less common languages and can pose potential risks to patient privacy. Fine-tuning local NLP models is hindered by the skewed nature of real-world medical datasets, where rare findings represent a significant data imbalance. We introduce SMP-BERT, a novel prompt learning method that leverages the structured nature of reports to overcome these challenges. In our studies involving a substantial collection of Crohn's disease radiology reports in Hebrew (over 8,000 patients and 10,000 reports), SMP-BERT greatly surpassed traditional fine-tuning methods in performance, notably in detecting infrequent conditions (AUC: 0.99 vs 0.94, F1: 0.84 vs 0.34). SMP-BERT empowers more accurate AI diagnostics available for low-resource languages.
自动将自由文本断层扫描报告转换为结构化数据使用自然语言处理(NLP)技术对大规模分析疾病至关重要。虽然对像英语这样的广泛使用语言有效,但生成大型语言模型(LLMs)通常在较不常见语言上表现不佳,并可能对患者隐私构成潜在风险。对局部NLP模型的微调受到真实世界医学数据集中数据不平衡性的限制。我们引入了SMP-BERT,一种利用报告结构的新提示学习方法,克服了这些挑战。在涉及大量希伯来语(超过8,000名患者和10,000个报告)的Crohn's病放射学报告的研究中,SMP-BERT在性能上大大超过了传统的微调方法,特别是在检测罕见的疾病方面(AUC:0.99 vs 0.94,F1:0.84 vs 0.34)。SMP-BERT为低资源语言提供了更准确的AI诊断。
https://arxiv.org/abs/2405.01682
Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide a dataset for creative, witty text generation based on Reddit Showerthoughts posts.
近年来,大型语言模型(LLMs)已经证明了生成难以或无法与人类写作区分的内容的能力。我们在Showerthoughts领域,即在普通活动中可能出现的想法,研究了不同大小的LLM是否具有复制人类写作风格的短小、有创意的文本的能力。我们将GPT-2和GPT-Neo在Reddit数据上微调以及通过零散的方式启动GPT-3.5,与人类创作的文章进行比较。我们在具体体现创意、幽默的文本质量的各个维度上衡量人类偏好。此外,我们还比较了人类和微调后的RoBERTa分类器在检测人工智能生成文本方面的能力。我们得出结论,人类评估者平均将生成的文本创意质量评价为略逊一筹,但他们无法可靠地区分人类创作和人工智能生成的文本。我们还基于Reddit Showerthoughts帖子提供了用于创意、幽默文本生成的数据集。
https://arxiv.org/abs/2405.01660
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at this http URL.
本文介绍了一个名为UQA的新型数据集,用于 Urdu 语料库中问题回答和文本理解。UQA 由翻译斯坦福问题回答数据集 (SQuAD2.0) 生成,这是一个大型英语问题回答数据集,使用一种称为 EATS (将括号内保留翻译文本上下文中的答案范围) 的技术生成。本文描述了从两个候选者(Google 翻译器和 Seamless M4T)中选择和评估最佳翻译模型的过程。此外,本文还在 UQA 上基准了多种最先进的跨语言 QA 模型,包括 mBERT、XLM-RoBERTa 和 mT5,并报告了有希望的结果。对于 XLM-RoBERTa-XL,我们的 F1 分数为 85.99 和 74.56。UQA 是一个有价值的资源,可用于开发和测试 Urdu 和其他多语言 NLP 系统,以及增强现有模型的跨语言可转移性。此外,本文还证明了 EATS 在创建高质量数据集在其他语言和领域中的有效性。UQA 数据集和代码可在此处下载:<http://www.aclweb.org/anthology/N18-1196>
https://arxiv.org/abs/2405.01458
Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation. However, Translate-Distill only supports a single document language. Multilingual information retrieval (MLIR), which ranks a multilingual document collection, is harder to train than CLIR because the model must assign comparable relevance scores to documents in different languages. This work extends Translate-Distill and propose Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, which is the previous state-of-the-art training approach, by 5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is robust to the way languages are mixed in training batches. Our implementation is available on GitHub.
近年来,在跨语言信息检索(CLIR)领域,使用不同语言的查询和文档进行研究,已经证明了使用翻译和蒸馏训练跨语言神经双编码器模型的优势。然而,Translate-Distill 仅支持单个文档语言。多语言信息检索(MLIR)比 CLIR 更难训练,因为模型必须为不同语言的文档分配相似的 relevance 分数。这项工作扩展了 Translate-Distill,并提出了多语言 Translate-Distill(MTD)用于 MLIR。我们证明了,使用 ColBERT-X 模型在 MTD 训练的跨语言预训练模型在 nDCG@20 和 MAP 上优于使用多语言 Translate-Train 的先前最先进状态,其性能分别提高了 5% 到 25% 和 15% 到 45%。我们还证明了模型对训练批次中语言混合的这种方式非常鲁棒。我们的实现可以在 GitHub 上获取。
https://arxiv.org/abs/2405.00977
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effective in batch experiments, its performance degrades in streaming settings where documents arrive over time because representations of new tokens may be poorly modeled by the earlier tokens used to select cluster centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase incremental indexing based on hierarchical sharding. Experiments on ClueWeb09 and the multilingual NeuCLIR collection demonstrate the effectiveness of this approach both for the largest collection indexed to date by the ColBERT architecture and in the multilingual setting, respectively.
PLAID是一种高效实现ColBERT late interaction bi-encoder的预训练语言模型用于排序,在单语种、跨语言和多语言检索中始终实现最先进的性能。PLAID与ColBERT的区别在于,它将词分配给簇,并将这些词表示为簇中心加压缩残余向量。虽然PLAID在批处理实验中非常有效,但在流式设置中,其性能会因为早期选定的簇中心表示不佳而下降。PLAID基于分层分区的多阶段索引在Terabytes of Temporal Text (PLAID SHIRTTT)上运行解决了这个问题。ClueWeb09和多语言NeuCLIR收藏库的实验表明,这种方法在当前由ColBERT架构编写的最大索引集中的单语种和多语言检索中都是有效的。
https://arxiv.org/abs/2405.00975
This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks. Employing a uniform training methodology, we assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties using the Simplified Molecular Input Line Entry System (SMILES) as a universal molecular representation format. Our comparative analysis involved pre-training 18 configurations of these models, with varying parameter sizes and dataset scales, followed by fine-tuning them on six benchmarking tasks from DeepChem. We maintained consistent training environments across models to ensure reliable comparisons. This approach allowed us to assess the influence of model type, size, and training dataset size on model performance. Specifically, we found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales. However, we observed that absolute validation loss is not a definitive indicator of model performance - contradicts previous research - at least for fine-tuning tasks: instead, model size plays a crucial role. Through rigorous replication and validation, involving multiple training and fine-tuning cycles, our study not only delineates the strengths and limitations of each model type but also provides a robust methodology for selecting the most suitable LLM for specific cheminformatics applications. This research underscores the importance of considering model architecture and dataset characteristics in deploying AI for molecular property prediction, paving the way for more informed and effective utilization of AI in drug discovery and related fields.
本研究建立了一个系统性的框架,以比较大型语言模型(LLMs)在各种药物化学任务上的效果。采用统一的训练方法,我们评估了三种广为人知的模型-RoBERTa、BART和LLaMA-使用Simplified Molecular Input Line Entry System(SMILES)作为通用分子表示格式预测分子性质的能力。我们的比较分析涉及预训练这些模型的18个配置,具有不同的参数大小和数据集规模,然后将它们在DeepChem的六个基准任务上进行微调。我们保持了模型之间的训练环境一致,以确保可靠的比较。这种方法让我们能够评估模型类型、大小和训练数据集大小对模型性能的影响。具体来说,我们发现基于LLaMA的模型通常具有最低的验证损失,表明其在任务和规模上的优越适应性。然而,我们观察到,绝对验证损失并不是衡量模型性能的最终指标-至少在微调任务上是如此。相反,模型大小在选择最合适的LLM为特定药物化学应用时发挥着至关重要的作用。通过严谨的重复和验证,包括多个训练和微调周期,我们的研究不仅揭示了每种模型类型的优势和局限性,还为选择最合适的LLM为特定药物化学应用提供了稳健的评估方法。这项研究强调了在部署人工智能进行分子 property预测时考虑模型架构和数据集特征的重要性,为药物发现和相关领域更明智和有效的利用人工智能铺平道路。
https://arxiv.org/abs/2405.00949
Over the last decade, similar to other application domains, social media content has been proven very effective in disaster informatics. However, due to the unstructured nature of the data, several challenges are associated with disaster analysis in social media content. To fully explore the potential of social media content in disaster informatics, access to relevant content and the correct geo-location information is very critical. In this paper, we propose a three-step solution to tackling these challenges. Firstly, the proposed solution aims to classify social media posts into relevant and irrelevant posts followed by the automatic extraction of location information from the posts' text through Named Entity Recognition (NER) analysis. Finally, to quickly analyze the topics covered in large volumes of social media posts, we perform topic modeling resulting in a list of top keywords, that highlight the issues discussed in the tweet. For the Relevant Classification of Twitter Posts (RCTP), we proposed a merit-based fusion framework combining the capabilities of four different models namely BERT, RoBERTa, Distil BERT, and ALBERT obtaining the highest F1-score of 0.933 on a benchmark dataset. For the Location Extraction from Twitter Text (LETT), we evaluated four models namely BERT, RoBERTa, Distil BERTA, and Electra in an NER framework obtaining the highest F1-score of 0.960. For topic modeling, we used the BERTopic library to discover the hidden topic patterns in the relevant tweets. The experimental results of all the components of the proposed end-to-end solution are very encouraging and hint at the potential of social media content and NLP in disaster management.
在过去的十年里,与其他应用领域一样,社交媒体内容在灾难信息学中已经被证明非常有效。然而,由于数据的无结构性质,社交媒体内容灾难分析面临着几个挑战。为了全面探索社交媒体内容在灾难信息学中的潜力,访问相关内容并获取正确的美地位置信息非常重要。在本文中,我们提出了一个解决这些挑战的三步解决方案。首先,所提出的解决方案旨在对社交媒体帖子进行分类,包括相关和不相关帖子,然后通过命名实体识别(NER)分析从帖子文本中自动提取位置信息。最后,为了快速分析大量社交媒体帖子中涵盖的主题,我们执行了主题建模,得到了一组关键词,突出了推特中讨论的问题。对于Twitter帖子相关分类(RCTP),我们提出了一个基于贡献的融合框架,结合了四种不同模型的功能,即BERT、RoBERTa、Distil BERT和ALBERT,在基准数据集上取得了最高F1分数为0.933。对于从Twitter文本中提取位置(LETT),我们在NER框架中评估了四种模型,即BERT、RoBERTa、Distil BERTA和Electra,取得了最高F1分数为0.960。对于主题建模,我们使用了BERTopic库来发现相关推特中的隐藏主题模式。所有组件的实验结果都非常鼓舞人心,表明了社交媒体内容和自然语言处理在灾难管理中的潜力。
https://arxiv.org/abs/2405.00903
Semantic textual relatedness is a broader concept of semantic similarity. It measures the extent to which two chunks of text convey similar meaning or topics, or share related concepts or contexts. This notion of relatedness can be applied in various applications, such as document clustering and summarizing. SemRel-2024, a shared task in SemEval-2024, aims at reducing the gap in the semantic relatedness task by providing datasets for fourteen languages and dialects including Arabic. This paper reports on our participation in Track A (Algerian and Moroccan dialects) and Track B (Modern Standard Arabic). A BERT-based model is augmented and fine-tuned for regression scoring in supervised track (A), while BERT-based cosine similarity is employed for unsupervised track (B). Our system ranked 1st in SemRel-2024 for MSA with a Spearman correlation score of 0.49. We ranked 5th for Moroccan and 12th for Algerian with scores of 0.83 and 0.53, respectively.
语义文本相关性是一个更广泛的术语,它衡量了两个文本片段传达相似含义或主题,或分享相关概念或上下文的程度。这个相关性概念可以在各种应用中使用,如文档聚类和总结。在SemEval-2024中的SemRel-2024共享任务旨在通过为包括阿利吉他和摩洛哥方言在内的14种语言和方言提供数据集来缩小语义相关性任务中的差距。本文报告了我们在A(阿尔及利亚和摩洛哥方言)和B(现代标准阿拉伯语) track的参与情况。在有监督track(A)中,基于BERT的模型进行了增强和微调,用于回归评分;而在无监督track(B)中,使用了基于BERT的余弦相似性。我们的系统在SemRel-2024中MSA的排名为1,余差相关分数为0.49。我们在摩洛哥语和阿尔及利亚语方面的排名分别为第5和第12。
https://arxiv.org/abs/2405.00659
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.
语音转换是将原始语音的语音特征进行转换,同时保留内容信息的过程。如今,自监督表示学习模型在内容提取中越来越受到欢迎。然而,在这些表示中,许多隐藏的说话者信息导致谐波泄漏,而隐藏单元的语调信息则缺乏使用。为了解决这些问题,我们提出了一种名为"SAVC"的新框架,基于HuBert-soft中的软语音单位。作为输入,我们设计了一个属性编码器来提取内容特征和语调特征。具体来说,我们首先引入了由对抗风格增强带来的统计畸变,以消除说话者信息。然后,我们通过知识蒸馏在软语音单位上隐含了语调信息。实验结果表明,转换后的语音的可听性和自然性超过了之前的 work。
https://arxiv.org/abs/2405.00603
In the distributed systems landscape, Blockchain has catalyzed the rise of cryptocurrencies, merging enhanced security and decentralization with significant investment opportunities. Despite their potential, current research on cryptocurrency trend forecasting often falls short by simplistically merging sentiment data without fully considering the nuanced interplay between financial market dynamics and external sentiment influences. This paper presents a novel Dual Attention Mechanism (DAM) for forecasting cryptocurrency trends using multimodal time-series data. Our approach, which integrates critical cryptocurrency metrics with sentiment data from news and social media analyzed through CryptoBERT, addresses the inherent volatility and prediction challenges in cryptocurrency markets. By combining elements of distributed systems, natural language processing, and financial forecasting, our method outperforms conventional models like LSTM and Transformer by up to 20\% in prediction accuracy. This advancement deepens the understanding of distributed systems and has practical implications in financial markets, benefiting stakeholders in cryptocurrency and blockchain technologies. Moreover, our enhanced forecasting approach can significantly support decentralized science (DeSci) by facilitating strategic planning and the efficient adoption of blockchain technologies, improving operational efficiency and financial risk management in the rapidly evolving digital asset domain, thus ensuring optimal resource allocation.
在分布式系统领域,区块链催生了加密货币的出现,将增强的安全性和去中心化与重要的投资机会相结合。尽管具有潜力,但当前关于加密货币趋势预测的研究往往止步于简单地将情感数据进行合并,而没有全面考虑金融市场动态与外部情感影响之间的复杂相互作用。本文提出了一种新颖的双注意力机制(DAM)用于预测加密货币趋势,利用多模态时间序列数据。我们的方法将加密货币指标与通过CryptoBERT分析的新闻和社会媒体中的情感数据相结合,解决了加密货币市场的固有波动性和预测挑战。通过结合分布式系统、自然语言处理和金融预测的元素,我们的方法在预测准确性上比传统的LSTM和Transformer模型提高了20%以上。这一进步加深了对分布式系统的理解,并在金融市场中具有实际意义,为加密货币和区块链技术利益相关者带来好处。此外,我们增强的预测方法可以显著支持去中心化科学(DeSci),通过促进战略规划以及区块链技术的有效采用,提高运营效率和金融风险管理,从而确保最优资源分配。
https://arxiv.org/abs/2405.00522
The use of automatic short answer grading (ASAG) models may help alleviate the time burden of grading while encouraging educators to frequently incorporate open-ended items in their curriculum. However, current state-of-the-art ASAG models are large neural networks (NN) often described as "black box", providing no explanation for which characteristics of an input are important for the produced output. This inexplicable nature can be frustrating to teachers and students when trying to interpret, or learn from an automatically-generated grade. To create a powerful yet intelligible ASAG model, we experiment with a type of model called a Neural Additive Model that combines the performance of a NN with the explainability of an additive model. We use a Knowledge Integration (KI) framework from the learning sciences to guide feature engineering to create inputs that reflect whether a student includes certain ideas in their response. We hypothesize that indicating the inclusion (or exclusion) of predefined ideas as features will be sufficient for the NAM to have good predictive power and interpretability, as this may guide a human scorer using a KI rubric. We compare the performance of the NAM with another explainable model, logistic regression, using the same features, and to a non-explainable neural model, DeBERTa, that does not require feature engineering.
使用自动短答案评分(ASAG)模型可能有助于减轻评分的时间负担,同时鼓励教育者频繁地将开放性问题融入他们的课程中。然而,目前最先进的支持自动评分(ASAG)模型的神经网络(NN)通常被称为“黑盒”,无法解释输入的特征对于产生的输出有何重要性。这种无法解释的性质可能会让教师和学生感到沮丧,当他们试图解释或从自动生成的分数中学习时。为了创建一个强大且易于理解的ASAG模型,我们尝试了一种名为神经附加模型(NAM)的模型,该模型将NN的性能与添加模型的可解释性相结合。我们使用学习科学中的知识整合(KI)框架来指导特征工程,以创建反映学生回答中是否包含特定思想的输入。我们假设,将预定义思想的包含(或排除)作为特征,将使NAM具有足够的预测力和可解释性,因为这将指导使用KI评分标准的人类评分者。我们使用相同的特征比较NAM与另一个可解释模型(逻辑回归)以及不需要特征工程的非可解释神经模型(DeBERTa)的表现。
https://arxiv.org/abs/2405.00489
This study addresses the challenge of detecting semantic column types in relational tables, a key task in many real-world applications. While language models like BERT have improved prediction accuracy, their token input constraints limit the simultaneous processing of intra-table and inter-table information. We propose a novel approach using Graph Neural Networks (GNNs) to model intra-table dependencies, allowing language models to focus on inter-table information. Our proposed method not only outperforms existing state-of-the-art algorithms but also offers novel insights into the utility and functionality of various GNN types for semantic type detection. The code is available at this https URL
本研究解决了在关系表中检测语义行类型的挑战,这是许多现实应用中的关键任务。虽然像BERT这样的语言模型已经提高了预测准确性,但它们的词输入约束限制了同时处理表内和表间信息的效率。我们提出了一种新颖的方法,使用图神经网络(GNNs)建模内表依赖,使语言模型集中关注表间信息。我们所提出的方法不仅超越了现有最先进的算法,而且为各种GNN类型语义类型检测的实用性和功能提供了新颖的见解。代码可在此处访问:https://url
https://arxiv.org/abs/2405.00123
Current text generation models are trained using real data which can potentially contain sensitive information, such as confidential patient information and the like. Under certain conditions output of the training data which they have memorised can be triggered, exposing sensitive data. To mitigate against this risk we propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together shared instead of full texts. Thus, text fragments that could re-identify an individual cannot be reproduced by the model in one sequence, giving significant protection against linkage attacks. We fine-tune several state-of-the-art LLMs using meaningful syntactic chunks to explore their utility. In particular, we fine-tune BERT-based models to predict two cardiovascular diagnoses. Our results demonstrate the capacity of LLMs to benefit from the pre-trained knowledge and deliver classification results when fine-tuned with fragmented data comparable to fine-tuning with full training data.
当前的文本生成模型使用真实数据进行训练,这可能包含敏感信息,如机密患者信息等。在某些情况下,它们训练数据的输出可能会触发包含敏感数据的输出,从而暴露敏感信息。为了减轻这种风险,我们提出了一个更安全的选择,即使用领域特定短语(domain-specific short phrases)随机组合而不是完整文本的破碎数据。因此,模型无法通过一个序列复制可能重新识别个人的文本片段,从而对链接攻击具有显著的防护作用。我们使用有意义的语义块对几个最先进的LLM进行微调,以探讨它们的使用价值。特别地,我们微调基于BERT的模型,以预测两个心血管诊断。我们的结果表明,LLM可以利用预训练知识并在与完整训练数据相似的破碎数据上进行微调,从而产生分类结果。
https://arxiv.org/abs/2404.19486