With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD) models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
随着基于Audio Language Model (ALM)的深度伪造音频的普及,有效的检测方法至关重要。与传统的深度伪造音频生成,往往涉及多步过程并最终使用语音合成器,ALM直接利用神经编解码器直接将离散代码解码成音频。此外,受到大规模数据的影响,ALMs表现出非凡的稳健性和多样性,对当前的音频深度伪造检测(ADD)模型构成了重大挑战。为了有效地检测基于ALM的深度伪造音频,我们专注于ALM基于音频生成的方法,从神经编解码器到波形的转换机制。我们最初构建了Codecfake数据集,一个开源的大型数据集,包括两种语言、数百万个音频样本以及各种测试条件,专门为基于ALM的音频检测定制。此外,为了实现对深度伪造音频的普遍检测,并解决原始SAM中的域升偏见问题,我们提出了CSAM策略,以学习一个域平衡和通用的最小值。实验结果表明,在Codecfake数据集和语音合成器数据集上进行CSAM策略的协同训练,在所有测试条件下的平均等误率(EER)最低,仅为0.616%。
https://arxiv.org/abs/2405.04880
In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code. Our goal is to empower and strengthen the open research community, fostering transparency and enabling a new wave of innovation in the field of language modeling. Furthermore, we extend the context length to 1M tokens through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. The weights for both models are available at Hugging Face to download and use.
在这份报告中,我们介绍了ChuXin,一个具有160亿参数的完全开源语言模型。与大多数只开源了模型权重和架构的大多数工作不同,我们为了让训练模型变得完整,提供了训练数据、训练过程和评估代码的一切。我们的目标是赋能和加强开放研究社区,促进透明度,并在语言建模领域推动创新的全新波浪。此外,我们通过轻量级持续预训练将上下文长度扩展到10万 tokens,并展示了强大的针山竹检索性能。两个模型的权重都可以在Hugging Face上下载和使用。
https://arxiv.org/abs/2405.04828
Generalized Entity Matching (GEM), which aims at judging whether two records represented in different formats refer to the same real-world entity, is an essential task in data management. The prompt tuning paradigm for pre-trained language models (PLMs), including the recent PromptEM model, effectively addresses the challenges of low-resource GEM in practical applications, offering a robust solution when labeled data is scarce. However, existing prompt tuning models for GEM face the challenges of prompt design and information gap. This paper introduces an augmented prompt tuning framework for the challenges, which consists of two main improvements. The first is an augmented contextualized soft token-based prompt tuning method that extracts a guiding soft token benefit for the PLMs' prompt tuning, and the second is a cost-effective information augmentation strategy leveraging large language models (LLMs). Our approach performs well on the low-resource GEM challenges. Extensive experiments show promising advancements of our basic model without information augmentation over existing methods based on moderate-size PLMs (average 5.24%+), and our model with information augmentation achieves comparable performance compared with fine-tuned LLMs, using less than 14% of the API fee.
通用实体匹配(GEM)是一种在数据管理中判断两个以不同格式表示的记录是否指向相同现实实体的重要任务。预训练语言模型(PLMs)的提示调整范式包括最近的PromptEM模型有效地解决了实际应用中低资源GEM的挑战,为标注数据有限的情况提供了一个稳健的解决方案。然而,现有的GEM提示调整模型面临提示设计和信息差距的挑战。本文介绍了一种增强提示调整框架,用于解决这些挑战,包括两个主要改进。第一个是增加基于上下文软标记的提示调整方法,为PLMs的提示调整提供指导软标记的利益;第二个是利用大型语言模型(LLMs)进行信息增强的策略。我们的方法在低资源GEM挑战中表现出色。大量的实验结果表明,与基于中等规模PLM的现有方法相比,我们的基本模型在信息增强方面的表现具有显著的进步(平均5.24%+),而我们的带有信息增强的模型与微调的LLM的性能相当,使用了不到14%的API费用。
https://arxiv.org/abs/2405.04820
Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on studying Alzheimer's Disease (AD), a specialized sub-field in biomedicine and a global health priority. With a synergized framework of LLM and KG mutually enhancing each other, we first leverage LLM to construct an evolving AD-specific knowledge graph (KG) sourced from AD-related scientific literature, and then we utilize a coarse-to-fine sampling method with a novel self-aware knowledge retrieval approach to select appropriate knowledge from the KG to augment LLM inference capabilities. The experimental results, conducted on our constructed AD question answering (ADQA) benchmark, underscore the efficacy of DALK. Additionally, we perform a series of detailed analyses that can offer valuable insights and guidelines for the emerging topic of mutually enhancing KG and LLM. We will release the code and data at this https URL.
近年来,在自然语言处理(NLP)领域的大型语言模型(LLM)的进步已经取得了各种应用领域的 promising 表现。然而,长尾知识整合持续挑战仍然阻碍了 LLM 在专业领域的无缝采用。在这项工作中,我们引入了 DALK(动态协同增强LLM和KG),以解决这一局限,并证明其在研究阿尔茨海默病(AD)方面的能力,这是生物医学领域的一个专业子领域,也是全球健康优先事项。通过 LLM 和 KG 相互增强的协同框架,我们首先利用 LLM 构建一个从 AD 相关科学文献中不断演变的 AD 特定知识图(KG),然后我们利用一种新颖的自我感知知识检索方法,对 KG 进行粗到细的采样,以选择适当的知识来增强 LLM 的推理能力。实验结果表明,在构建的 AD 问题回答(ADQA)基准上进行实验时,DALK 的有效性得到了充分验证。此外,我们进行了一系列详细分析,可以提供有关增强 KG 和 LLM 的有益见解和指导。我们将发布代码和数据于该链接处。
https://arxiv.org/abs/2405.04819
Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: this https URL.
评估自由文本解释是一个多方面、主观和劳动密集型任务。大型语言模型(LLMs)因其一致性、可扩展性和成本效益而具有吸引力。在这项工作中,我们提出了一个由3,500个自由文本解释和每个方面的质量评分组成的新数据集ACORN,并使用它来探讨LLMs如何评估解释。我们观察到,用一个人类评分替换另一个时,有时候会保持一致,但大多数情况下会降低不同设置和质量方面的互信度,表明他们的判断并不总是与人类评分者一致。我们进一步通过比较LLM生成的评分与不同质量方面的多数投票人类评分之间的相关性来量化这个差异。在最佳系统中,Spearman秩相关范围在0.53到0.95之间,平均为0.72,表明适度高但存在不完美的一致性。最后,我们考虑了当人类评分者稀缺时使用LLM作为额外评分者的替代方案,并测量了多数投票标签与有限人类池和LLM作为额外评分者之间的相关性。虽然GPT-4在只有两个人类评分者时提高了结果,但所有观察到的其他情况中,LLM在有三个或更多人类评分者时都是中立的,甚至是有害的。我们公开发布这个数据集,以支持未来在LLM循环评估中的改进:此https URL。
https://arxiv.org/abs/2405.04818
Counterfactual examples are frequently used for model development and evaluation in many natural language processing (NLP) tasks. Although methods for automated counterfactual generation have been explored, such methods depend on models such as pre-trained language models that are then fine-tuned on auxiliary, often task-specific datasets. Collecting and annotating such datasets for counterfactual generation is labor intensive and therefore, infeasible in practice. Therefore, in this work, we focus on a novel problem setting: \textit{zero-shot counterfactual generation}. To this end, we propose a structured way to utilize large language models (LLMs) as general purpose counterfactual example generators. We hypothesize that the instruction-following and textual understanding capabilities of recent LLMs can be effectively leveraged for generating high quality counterfactuals in a zero-shot manner, without requiring any training or fine-tuning. Through comprehensive experiments on various downstream tasks in natural language processing (NLP), we demonstrate the efficacy of LLMs as zero-shot counterfactual generators in evaluating and explaining black-box NLP models.
反事实例子在许多自然语言处理(NLP)任务中常被用于模型开发和评估。尽管已经探索了自动反事实生成的方法,但这些方法依赖于诸如预训练语言模型这样的模型,然后在辅助、通常针对特定任务的辅助数据集上进行微调。因此,收集和标注这样的数据集对于反事实生成来说是不够的,在实践中是不可行的。因此,在本文中,我们关注一个新颖的问题场景:零击反事实生成。为此,我们提出了一个有结构的方法,将大型语言模型(LLMs)用作通用反事实例子生成器。我们假设,LLMs最近的指令跟随和文本理解能力可以有效地用于以零击的方式生成高质量的反事实,而无需进行训练或微调。通过对自然语言处理(NLP)中的各种下游任务的全面实验,我们证明了LLMs作为零击反事实生成器在评估和解释黑盒NLP模型方面的有效性。
https://arxiv.org/abs/2405.04793
Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely DiffMatch. The insight of DiffMatch is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of DiffMatch. For instance, DiffMatch improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.
变化检测(CD)旨在识别图像之间语义变化的像素。然而,标注大量像素级别图像劳动密集且代价昂贵,尤其是在需要通过人类专家进行逐像素比较的多时间尺度图像上。考虑到视觉语言模型(VLMs)在零散、开词等提示下推理的优秀表现,我们有望在有限的标注数据下使用VLMs实现更好的CD。在本文中,我们提出了一种基于VLM指导的半监督CD方法,即DiffMatch。DiffMatch的洞察力在于利用VLMs合成自由变化标签,为未标注数据提供额外的监督信号。然而,几乎所有当前的VLMs都是为单时间尺度图像设计的,不能直接应用于双或多时间尺度图像。因此,我们首先提出了一种基于VLM的混合变化事件生成(CEG)策略,为未标注的CD数据生成伪标签。由于这些VLM驱动的伪标签可能与一致性正则化范式(例如FixMatch)中的伪标签发生冲突,我们提出了双投影头以解开不同信号源。此外,我们通过两个辅助分割解码器明确地解耦双时间尺度图像的语义表示。最后,为了使模型更好地捕捉变化表示,我们在辅助分支上引入基于特征级的对比损失的度量指导。大量实验证明,DiffMatch具有优势。例如,DiffMatch在WHU-CD上的IoU提高了+5.3,而在LEVIR-CD上的IoU提高了+2.4,同时我们的CEG策略在没有监督的情况下可以达到比最先进的无监督CD方法更出色的性能。
https://arxiv.org/abs/2405.04788
Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
图像异常检测是计算机视觉领域的一个具有挑战性的任务。随着Vision-Language模型的出现,特别是基于CLIP的框架的出现,为零散 anomaly 检测带来了新的途径。最近的研究探索了通过将图像与正常和提示描述对齐来使用CLIP。然而,对文本指导的依赖往往不足,突显了添加额外视觉参考的重要性。在这项工作中,我们引入了一种称为Dual-Image Enhanced CLIP的方法,利用了联合视觉-语言评分系统。我们的方法处理成对图像,将每个图像作为另一个图像的视觉参考,从而丰富推理过程。这种双图像策略显著增强了异常分类和定位性能。此外,我们还通过引入测试时间自适应模块来加强我们的模型,该模块通过合成异常来优化定位能力。我们的方法明显地利用了视觉-语言联合异常检测的潜力,并在各种数据集上的性能与当前SOTA方法相当。
https://arxiv.org/abs/2405.04782
Large language models (LLMs) have demonstrated astonishing capabilities in natural language processing (NLP) tasks, sparking interest in their application to professional domains with higher specialized requirements. However, restricted access to closed-source LLMs via APIs and the difficulty in collecting massive high-quality datasets pose obstacles to the development of large language models in education fields of various courses. Given these challenges, we propose CourseGPT-zh, a course-oriented education LLM that supports customization and low-cost deployment. To address the comprehensiveness and diversity requirements of course-specific corpora, we design a high-quality question-answering corpus distillation framework incorporating prompt optimization, which effectively mines textbook knowledge and enhances its diversity. Moreover, considering the alignment of LLM responses with user needs, a novel method for discrete prompt optimization based on LLM-as-Judge is introduced. During optimization, this framework leverages the LLM's ability to reflect on and exploit error feedback and patterns, allowing for prompts that meet user needs and preferences while saving response length. Lastly, we obtain CourseGPT-zh based on the open-source LLM using parameter-efficient fine-tuning. Experimental results show that our discrete prompt optimization framework effectively improves the response quality of ChatGPT, and CourseGPT-zh exhibits strong professional capabilities in specialized knowledge question-answering, significantly outperforming comparable open-source models.
大语言模型(LLMs)在自然语言处理(NLP)任务中表现出惊人的能力,引发了在专业领域应用这些模型以满足更高专业要求的热议。然而,通过API访问受限制的闭源LLM以及收集大量高质量数据集的困难,为教育领域各种课程开发大型语言模型设置了障碍。鉴于这些挑战,我们提出了 CourseGPT-zh,一种课程导向的教育LLM,支持定制化和低成本部署。为了满足课程特定语料库的全面性和多样性需求,我们设计了一个包括提示优化的高质量问题回答语料库,有效挖掘教科书知识并增强其多样性。此外,考虑到LLM回答与用户需求的一致性,我们引入了一种基于LLM-as-Judge的新颖的离线提示优化方法。在优化过程中,该框架利用LLM反思和利用错误反馈和模式的能力,实现满足用户需求和偏好的提示,同时节省响应长度。最后,我们通过参数高效的微调获得 CourseGPT-zh,基于开源LLM。实验结果表明,我们的离线提示优化框架有效地提高了ChatGPT的响应质量,而CourseGPT-zh在专业知识问题回答中表现出强大的专业能力,显著优于同类开源模型。
https://arxiv.org/abs/2405.04781
Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLM-based CHA engineered for rich, multimodal dialogue-especially in the realm of mental health support. It adeptly interprets and responds to users' emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA's empathic delivery, with findings revealing a striking concordance between the CHA's outputs and evaluators' assessments. These results affirm the indispensable role of vocal (soon multimodal) emotion recognition in strengthening the empathetic connection built by CHAs, cementing their place at the forefront of interactive, compassionate digital health solutions.
代理是大型语言模型(LLMs)和生成式人工智能(GA)的一种最新兴的应用,其有效性取决于多模态能力来解决复杂的用户环境。会话健康代理(CHAs)是这种最著名的例子,通过提供 nuanced 的支持,超越了文本分析,并纳入了情商。本文介绍了一种基于LLM的CHA,特别适用于心理健康支持领域。它通过分析多模态线索来理解并回应用户的情绪状态,从而提供具有上下文感知和富有同情心的口头回应。我们的实现利用了灵活的openCHA框架,综合评估涉及多种情绪调度的中立提示。我们评估了所提出的CHA的规划能力的一致性和可重复性。此外,人类评估员批评了CHA的富有同情心的交付,这些研究结果揭示了CHA的输出与评估员评价之间令人印象深刻的一致性。这些结果证实了在加强CHAs所建立的同情连接方面,语音(很快将多模态)情感识别至关重要,巩固了它们在交互式、富有同情心的数字健康解决方案前沿的地位。
https://arxiv.org/abs/2405.04777
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated by modifying prompts to include examples with chains of thought--demonstrations of solution procedures--with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examine the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations and depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially because of the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.
大语言模型(LLM)在推理问题上的表现通常不会泛化到分布之外。之前的工作声称,通过修改提示包括一系列思考过程的示例--解决方案的演示,可以缓解这一问题。本文以 Blocksworld 问题为例,探讨了两种最先进的 LLM 在两个轴上的表现:提示中给出的示例的普遍性,以及每个提示解决问题的复杂性。虽然我们的问题非常简单,但仅当那些提示非常具体到问题类别时,我们才发现了有意义的表现改进。而且,随着查询指定栈的大小 n 超过示例中栈的大小,这些改进会迅速恶化。我们的结果暗示,与文献中之前提出的观点相反,CoT 的性能改进并非通过演示和学习通用算法程序来实现,而是依赖于仔细工程高度问题特定的提示。这一研究突出了思考过程的不足之处,特别是因为其高性价比的性能提升与生成正确推理痕迹所需的人力劳动之间的尖锐权衡。
https://arxiv.org/abs/2405.04776
To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
为了在3D人类运动和语言之间构建跨模态潜在空间,获取大规模且高质量的人类运动数据至关重要。然而,与图像数据的丰富相比,运动数据的稀疏性限制了现有运动-语言模型的性能。为了应对这个问题,我们引入了“运动补丁”,一种新的运动序列表示,并通过迁移学习使用Vision Transformers(ViT)作为运动编码器,旨在从图像域提取有用知识并将其应用于运动域。这些运动补丁是由基于运动部件在运动序列中进行拆分和排序的骨骼关节创建的,对不同的骨架结构具有鲁棒性,可以被视为ViT中的颜色图像补丁。我们发现,通过使用通过2D图像数据训练得到的预训练ViT权重的迁移学习,可以提高运动分析的性能,为解决运动数据有限的问题提供了一个有前途的方向。我们的广泛实验表明,与ViT共同使用的运动补丁在文本到运动检索基准测试和其他新颖挑战任务(如跨骨架识别、零散射击运动分类和人类交互识别)上实现了最先进的性能,这些任务目前由于缺乏数据而受到阻碍。
https://arxiv.org/abs/2405.04771
The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in various domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we conduct a comprehensive review of the literature on the application of LLMs in cybersecurity (LLM4Security). By comprehensively collecting over 30K relevant papers and systematically analyzing 127 papers from top security and software engineering venues, we aim to provide a holistic view of how LLMs are being used to solve diverse problems across the cybersecurity domain. Through our analysis, we identify several key findings. First, we observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Second, we find that the datasets used for training and evaluating LLMs in these tasks are often limited in size and diversity, highlighting the need for more comprehensive and representative datasets. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training. Finally, we discuss the main challenges and opportunities for future research in LLM4Security, including the need for more interpretable and explainable models, the importance of addressing data privacy and security concerns, and the potential for leveraging LLMs for proactive defense and threat hunting. Overall, our survey provides a comprehensive overview of the current state-of-the-art in LLM4Security and identifies several promising directions for future research.
大规模语言模型的快速发展为利用人工智能在各个领域提供了新的机会,包括网络安全。随着网络威胁的数量和复杂性不断增加,人们越来越需要能够自动检测漏洞、分析恶意软件并应对攻击的智能系统。在本次调查中,我们对大规模语言模型在网络安全领域的应用进行全面回顾(LLM4Security)。通过全面收集超过3000篇相关论文并系统分析来自顶级安全和水下工程领域的127篇论文,我们希望为读者提供全面了解大规模语言模型在网络安全领域应用的视角。通过我们的分析,我们发现了几个关键发现。首先,我们观察到大规模语言模型被应用于广泛的网络安全任务,包括漏洞检测、恶意软件分析、网络入侵检测和网络钓鱼检测。其次,我们发现用于训练和评估大规模语言模型在这些任务中所用的数据集往往规模有限且多样性不足,强调了对更全面和代表性的数据集的需求。第三,我们识别出几种将大规模语言模型适应特定网络安全领域的有前途的方法,例如微调、迁移学习和领域特定的预训练。最后,我们讨论了在LLM4Security领域未来研究的主要挑战和机遇,包括需要更可解释和可解释的模型、解决数据隐私和安全问题的迫切需要以及利用大规模语言模型进行主动防御和威胁狩猎的可能性。总的来说,我们的调查为LLM4Security领域提供了全面回顾,并提出了几个有前途的研究方向。
https://arxiv.org/abs/2405.04760
Modern large language models (LLMs) have a significant amount of world knowledge, which enables strong performance in commonsense reasoning and knowledge-intensive tasks when harnessed properly. The language model can also learn social biases, which has a significant potential for societal harm. There have been many mitigation strategies proposed for LLM safety, but it is unclear how effective they are for eliminating social biases. In this work, we propose a new methodology for attacking language models with knowledge graph augmented generation. We refactor natural language stereotypes into a knowledge graph, and use adversarial attacking strategies to induce biased responses from several open- and closed-source language models. We find our method increases bias in all models, even those trained with safety guardrails. This demonstrates the need for further research in AI safety, and further work in this new adversarial space.
现代大型语言模型(LLMs)具有大量的世界知识,在恰当的利用下,在常识推理和知识密集型任务中表现出强大的性能。语言模型还可以学习社会偏见,这有可能对社会造成严重伤害。为解决LLM的安全性问题,已经提出了许多缓解策略,但目前尚不清楚它们是否对消除社会偏见有效。在本文中,我们提出了一个新的方法来攻击知识图增强生成语言模型。我们将自然语言刻板印象重构为知识图,并使用对抗攻击策略促使多个开源和闭源语言模型产生有偏的响应。我们发现,我们的方法在所有模型上都增加了偏见,即使是经过安全网保护的模型也不例外。这表明需要进一步研究AI安全问题,并进一步探索这个新的对抗领域。
https://arxiv.org/abs/2405.04756
Attack knowledge graph construction seeks to convert textual cyber threat intelligence (CTI) reports into structured representations, portraying the evolutionary traces of cyber attacks. Even though previous research has proposed various methods to construct attack knowledge graphs, they generally suffer from limited generalization capability to diverse knowledge types as well as requirement of expertise in model design and tuning. Addressing these limitations, we seek to utilize Large Language Models (LLMs), which have achieved enormous success in a broad range of tasks given exceptional capabilities in both language understanding and zero-shot task fulfillment. Thus, we propose a fully automatic LLM-based framework to construct attack knowledge graphs named: AttacKG+. Our framework consists of four consecutive modules: rewriter, parser, identifier, and summarizer, each of which is implemented by instruction prompting and in-context learning empowered by LLMs. Furthermore, we upgrade the existing attack knowledge schema and propose a comprehensive version. We represent a cyber attack as a temporally unfolding event, each temporal step of which encapsulates three layers of representation, including behavior graph, MITRE TTP labels, and state summary. Extensive evaluation demonstrates that: 1) our formulation seamlessly satisfies the information needs in threat event analysis, 2) our construction framework is effective in faithfully and accurately extracting the information defined by AttacKG+, and 3) our attack graph directly benefits downstream security practices such as attack reconstruction. All the code and datasets will be released upon acceptance.
攻击知识图构建旨在将文本形式的网络威胁情报(CTI)报告转换为结构化的表示形式,描绘网络攻击的演变轨迹。尽管之前的研究提出了各种方法来构建攻击知识图,但它们通常都存在对不同知识类型的泛化能力有限以及模型设计和调整的要求。为解决这些限制,我们寻求利用大型语言模型(LLMs),因为它们在广泛的任务上取得了巨大的成功,并且在语言理解和零击任务满足方面具有卓越的能力。因此,我们提出了一个完全自动化的LLM-为基础的攻击知识图构建框架,名为:AttacKG+。 我们的框架由四个连续的模块组成:改写器、解析器、标识器和总结器,每个模块都通过指令提示和上下文学习由LLM实现。此外,我们升级了现有的攻击知识模式并提出了全面版本。我们用一个时间展开的事件来表示网络攻击,每个时间步都包含三层表示,包括行为图、MITRE TTP标签和状态概述。丰富的评估表明:1)我们的公式在威胁事件分析中无缝地满足信息需求,2)我们的构建框架有效地忠实并准确地提取了由AttacKG+定义的信息,3)我们的攻击图直接受益于下游安全实践,如攻击重建。所有代码和数据将在接受提交时发布。
https://arxiv.org/abs/2405.04753
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
大语言模型(LLMs)在边缘设备上通过微调和完善其参数来学习。尽管这种学习方法可以优化以减少资源利用率,但总体上边缘设备所需的资源仍然沉重负担。相反,检索增强生成(RAG)是一种资源高效的LLM学习方法,可以在不更新模型参数的情况下提高LLM生成的内容的质量。然而,基于RAG的LLM可能需要在每个用户-LLM交互过程中对用户数据进行重复搜索。这种搜索可能导致延迟的积累以及随着用户数据的增长而降低RAG的可扩展性。仍然是一个未解决的问题:如何从边缘设备的延迟和可扩展性约束中解放RAG?在本文中,我们提出了通过计算在内存中的架构加速RAG的新框架。它通过在内存中进行原地计算来加速矩阵乘法,同时避免计算单元和内存之间进行昂贵的数据传输。我们的框架Robust CiM-backed RAG(RoCR)使用了一种新的基于对比学习的学习方法和新颖的噪声感知训练,可以实现RAG与CiM的 efficiently搜索用户数据。据我们所知,这是第一个利用CiM加速RAG的工作。
https://arxiv.org/abs/2405.04700
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.
大语言模型(LLMs)在各个领域都变得至关重要,突出了在少数代表语言中高质量模型的紧迫性。这项研究探讨了低资源语言所面临独特的挑战,例如数据稀缺性、模型选择、评估和计算限制,特别关注土耳其。我们深入分析了训练策略、模型选择和数据可用性对为少数代表语言设计的LLM的性能影响。我们的方法包括两种:(i)将原始在英语上预训练的LLM适配到土耳其语,以了解土耳其语;(ii)使用土耳其预训练数据从头构建模型,并在旨在增强推理能力的全新土耳其指令微调数据上进行监督微调。我们通过创建一个新的土耳其LLM领导者板来评估这些方法的表现,该板包括评估不同推理和知识技能的基准。此外,我们还进行了在数据和模型缩放的同时进行的实验,强调知识在语言之间的传递以及在不同语言上进行微调时遇到的挑战。我们的目标是为在低资源语言环境中推动LLM框架的发展提供详细指南,从而使自然语言处理(NLP)的益处在全球范围内更加易于获取。
https://arxiv.org/abs/2405.04685
Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs' general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks. Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally aware language models.
大语言模型(LLMs)通过多个基准测试展示了实质性的常识理解。然而,它们在文化常识任务中的理解仍然很大程度上没有被检验。在本文中,我们对几种最先进的LLM在文化常识任务中的能力和局限进行全面评估。使用几个通用的和文化常识基准,我们发现:(1)在测试文化特定常识知识时,LLM的表现存在显著的差异;(2)LLM的通用常识能力受到文化上下文的影响;(3)用于查询LLM的语言可以影响它们在文化相关任务中的表现。我们的研究揭示了LLM在文化理解方面的固有偏见,并为开发具有文化意识的语言模型提供了启示。
https://arxiv.org/abs/2405.04655
We propose a novel tensor network language model based on the simplest tensor network (i.e., tensor trains), called `Tensor Train Language Model' (TTLM). TTLM represents sentences in an exponential space constructed by the tensor product of words, but computing the probabilities of sentences in a low-dimensional fashion. We demonstrate that the architectures of Second-order RNNs, Recurrent Arithmetic Circuits (RACs), and Multiplicative Integration RNNs are, essentially, special cases of TTLM. Experimental evaluations on real language modeling tasks show that the proposed variants of TTLM (i.e., TTLM-Large and TTLM-Tiny) outperform the vanilla Recurrent Neural Networks (RNNs) with low-scale of hidden units. (The code is available at this https URL.)
我们提出了一个基于最简单的张量网络(即张量训练)的新型张量网络语言模型,称为`Tensor Train Language Model'(TTLM)。TTLM表示由单词张量乘积构成的指数空间中的句子,但以低维方式计算句子的概率。我们证明了Second-order RNN,Recurrent Arithmetic Circuits(RACs)和Multiplicative Integration RNNs的架构本质上与TTLM相同。在真实语言建模任务上的实验评估表明,与普通循环神经网络(RNNs)相比,所提出的TTLM变体(即TTLM-Large和TTLM-Tiny)在低隐藏单元规模上表现出更好的性能。(代码可在此处访问:https://this URL。)
https://arxiv.org/abs/2405.04590
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. Each of these methods works in isolation instead of synergistically. Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans. The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding. Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.
已有许多方法被提出用于检测、估计和分析图像中的人,包括估计3D姿势、形状、接触、人机交互、情绪等。这些方法彼此独立工作而不是协同工作。我们解决这个问题,并构建了一个语言驱动的人理解系统--ChatHuman,它结合和整合了许多不同方法的技能。为此,我们通过微调一个大语言模型(LLM)来选择和使用广泛的现有工具来响应用户输入。这样做,ChatHuman能够将多个工具的信息结合在一起,使其更准确地解决问题,并利用工具的输出来提高其对人类的推理能力。ChatHuman的新特点包括利用学术出版物指导3D与人相关的工具的应用,采用检索增强生成模型生成处理新工具的上下文学习示例,以及通过区分和整合工具结果来增强3D人类理解。我们的实验证明,ChatHuman在工具选择准确性和多个3D人机交互任务中的表现优于现有模型。ChatHuman是将多样化方法合并为单个、强大的3D人类推理系统的一步。
https://arxiv.org/abs/2405.04533