Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
由大型视觉和语言模型(VLM)驱动的自主代理已经展现出了在完成日常计算机任务方面的巨大潜力,如浏览网页预订旅行和操作桌面软件,这需要这些代理能够理解相应的界面。尽管此类视觉输入正逐渐融入到代理应用中,围绕它们存在的风险和攻击类型仍然不甚明朗。在这项工作中,我们展示了VLM代理可以被一组精心设计的对抗性弹出窗口轻易攻击,而这种弹出窗口通常会被人类用户识别并忽略。这些干扰使代理点击了弹出窗口而不是像往常一样执行任务。将这些弹出窗口整合到现有的代理测试环境(如OSWorld和VisualWebArena)中会导致攻击成功率平均达到86%(即代理点击弹出窗口的频率),同时降低了47%的任务成功率。诸如要求代理忽略弹出窗口或包含广告通知等基本防御技术对这种攻击无效。
https://arxiv.org/abs/2411.02391
Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 \times 5$ integer multiplication task, our approach achieves $99.5\%$ exact match accuracy, outperforming models of the same size (which yield $0\%$ accuracy) and GPT-4 with five-shot CoT prompting ($44\%$). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.
解码器仅有的Transformer模型在处理复杂的推理任务时,尤其是需要多步操作的算术推理任务时,通常会遇到困难。在这项研究中,我们发现模型中间层中的表示坍缩是限制其推理能力的关键因素之一。为了解决这个问题,我们提出了顺序方差-协方差正则化(Seq-VCR),该方法增强了中间表示的熵,并防止了表示坍缩。结合用作链式思考(CoT)标记替代品的虚拟暂停标记,我们的方法在算术推理问题上显著提高了性能。在具有挑战性的$5 \times 5$整数乘法任务中,我们的方法实现了99.5%的确切匹配准确率,优于相同大小模型(0%准确率)和使用五步链式思考提示的GPT-4(44%)。我们还在算术表达式数据集和最长递增子序列(LIS)数据集上展示了优越的结果。我们的研究结果强调了防止中间层表示坍缩对于增强Transformer推理能力的重要性,并表明Seq-VCR提供了无需显式链式思考监督的有效解决方案。
https://arxiv.org/abs/2411.02344
Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
大型语言模型(LLMs)在作为自主代理,特别是在基于网络的任务中展现出了显著的潜力。然而,现有的LLM网络代理严重依赖昂贵的专有LLM API,而开放型LLMs缺乏必要的决策能力。本文介绍了WebRL,这是一个自我演化的在线课程强化学习框架,旨在利用开放型LLMs训练高性能网络代理。WebRL解决了构建LLM网络代理时面临的三个主要挑战:培训任务稀缺、反馈信号稀疏以及在线学习中的策略分布漂移。具体而言,WebRL包括1)一个自我演化的课程,能够从失败的尝试中生成新任务;2)一个健壮的结果监督奖励模型(ORM);3)适应性强化学习策略以确保持续改进。我们将WebRL应用于将开放型Llama-3.1和GLM-4模型转化为熟练的网络代理。在WebArena-Lite上,WebRL使Llama-3.1-8B的成功率从4.8%提升到42.4%,并将GLM-4-9B的成功率从6.1%提高至43%。这些开放型模型显著超越了GPT-4-Turbo(17.6%)和GPT-4o(13.9%)的性能,并优于基于开放LLMs训练的先前最先进网络代理(AutoWebGLM,18.2%)。我们的研究结果表明,WebRL在缩小开放式和专有型基于LLM的网络代理之间的差距方面具有有效性,为更易于访问且功能强大的自主网络交互系统铺平了道路。
https://arxiv.org/abs/2411.02337
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
激活稀疏性表示在激活输出中存在大量贡献较小的元素,这些元素可以被消除,并且这对涉及大型语言模型(LLM)的重要应用有好处。尽管促进LLMs中的更大激活稀疏性值得深入研究,但现有工作缺乏对激活稀疏性和潜在影响因素之间相关性的全面和定量研究。本文提出了关于解码器仅基于Transformer的LLMs中激活稀疏性的量化扩展特性和影响因素的全面研究。具体而言,我们提出了一种精确且性能感知的激活稀疏性度量PPL-$p\%$稀疏性,该度量适用于任何激活函数。通过广泛的实验,我们发现了几个重要的现象。首先,不同的激活函数表现出相近的性能但训练时的稀疏趋势相反。对于SiLU激活和ReLU激活的LLMs,随着训练数据量的增加,激活比(即$1-\mathrm{稀疏性比例}$)分别以收敛递增幂律和递减对数空间幂律演变。这表明在作为激活函数方面,ReLU比SiLU更有效,并且可以利用更多的训练数据来改善激活稀疏性。其次,在某个瓶颈点以下,激活比率随宽度-深度比线性增加,表明固定参数规模下深层架构具有潜在优势。最后,在类似的宽度-深度比条件下,我们惊讶地发现激活稀疏性的极限值与参数规模的弱相关性,即LLMs内的激活模式对参数规模不敏感。这些关于具有更大激活稀疏性的LLM的经验法则对于提高LLMs的效率和可解释性有重要的启示意义。
https://arxiv.org/abs/2411.02335
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
代码大型语言模型(LLMs)在直接基于错误的代码片段生成正确代码方面取得了显著进展,从而改进了代码调试。编程基准测试通常包含有缺陷的代码片段及其相关测试用例,用于评估LLMs的调试能力。然而,许多现有的基准主要集中在Python上,并且在语言多样性方面往往有限(例如DebugBench和DebugEval)。为了推进多语言环境下的调试领域,我们提出了第一个大规模多语言调试基准,该基准包含18种编程语言的3.6K个测试样本,并涵盖了自动化程序修复(APR)任务、代码审查(CR)任务以及缺陷识别(BI)任务。此外,通过将错误注入到正确的多语言查询和解决方案(xDebugGen)中,我们引入了调试指令语料库MDEVAL-INSTRUCT。进一步地,基于MDEVAL-INSTRUCT训练了一个多语言调试器xDebugCoder作为强大的基线模型,专门用于处理各种编程语言的bug(例如,在Rust语言中的"Missing Mut"和C语言中的"Misused Macro Definition")。我们在MDEVAL上的广泛实验表明,开源模型与闭源LLMs(如GPT和Claude系列)之间存在显著性能差距,这突显了在多语言代码调试场景中存在巨大的改进空间。
https://arxiv.org/abs/2411.02310
Large language models (LLMs) exhibit remarkable capabilities on not just language tasks, but also various tasks that are not linguistic in nature, such as logical reasoning and social inference. In the human brain, neuroscience has identified a core language system that selectively and causally supports language processing. We here ask whether similar specialization for language emerges in LLMs. We identify language-selective units within 18 popular LLMs, using the same localization approach that is used in neuroscience. We then establish the causal role of these units by demonstrating that ablating LLM language-selective units -- but not random units -- leads to drastic deficits in language tasks. Correspondingly, language-selective LLM units are more aligned to brain recordings from the human language system than random units. Finally, we investigate whether our localization method extends to other cognitive domains: while we find specialized networks in some LLMs for reasoning and social capabilities, there are substantial differences among models. These findings provide functional and causal evidence for specialization in large language models, and highlight parallels with the functional organization in the brain.
大型语言模型(LLMs)不仅在语言任务上表现出色,还在诸如逻辑推理和社会推断等非语言性质的任务中展现出显著的能力。在人类大脑中,神经科学已经识别出一个核心的语言系统,该系统专门且因果性地支持语言处理。我们在此探讨类似的语言专化是否也在大型语言模型(LLMs)中出现。通过使用与神经科学研究相同的位置定位方法,我们在18个流行的LLMs中确定了特定于语言的单元。接着,通过展示消除LLM特定于语言的单元——而不是随机选择的单元——会导致语言任务上的严重缺陷,我们确立了这些单元的因果作用。相应地,特定于语言的LLM单元比随机单元更与人类语言系统的大脑记录对齐。最后,我们调查了我们的定位方法是否适用于其他认知领域:虽然我们在一些LLMs中发现了专门用于推理和社会能力的网络,但各模型之间存在显著差异。这些发现为大型语言模型中的专化提供了功能性和因果性证据,并突显出与大脑功能组织之间的相似之处。
https://arxiv.org/abs/2411.02280
This study examines the impact of DevOps practices on enterprise software delivery success, focusing on enhancing R&D efficiency and source code management (SCM). Using a qualitative methodology, data were collected from case studies of large-scale enterprises implementing DevOps to explore how these practices streamline software development processes. Findings reveal that DevOps significantly improves R&D productivity by fostering cross-functional collaboration, reducing development cycle times, and enhancing software quality through effective SCM practices, such as version control and continuous integration. Additionally, SCM tools within DevOps enable precise change tracking and reliable code maintenance, further supporting faster, more robust software delivery. However, the study identifies challenges, including cultural resistance and tool integration issues, that can hinder DevOps implementation. Additionally, This research contributes to the growing body of DevOps literature by highlighting the role of R&D efficiency and SCM as crucial factors for software delivery success. Future studies should investigate these factors across diverse industries to validate findings.
这项研究考察了DevOps实践对企业软件交付成功的影响,重点在于提高研发效率和源代码管理(SCM)。通过定性方法收集了大型企业实施DevOps案例的研究数据,以探讨这些实践如何简化软件开发流程。研究结果表明,DevOps通过促进跨职能合作、减少开发周期时间以及通过有效的SCM实践(如版本控制和持续集成)提升软件质量,显著提高了研发效率。此外,DevOps中的SCM工具能够实现精确的变更跟踪和可靠的代码维护,进一步支持更快更稳健的软件交付。然而,研究还指出了一些挑战,包括文化阻力和工具整合问题等,这些问题可能会阻碍DevOps的实施。此外,这项研究通过强调R&D效率和SCM在软件交付成功中的关键作用,为日益增长的DevOps文献做出了贡献。未来的研究应该跨不同行业验证这些因素的影响以进一步证实这些发现。
https://arxiv.org/abs/2411.02209
Designing and displaying haptic signals with sensory and emotional attributes can improve the user experience in various applications. Free-form user language provides rich sensory and emotional information for haptic design (e.g., ``This signal feels smooth and exciting''), but little work exists on linking user descriptions to haptic signals (i.e., language grounding). To address this gap, we conducted a study where 12 users described the feel of 32 signals perceived on a surface haptics (i.e., electrovibration) display. We developed a computational pipeline using natural language processing (NLP) techniques, such as GPT-3.5 Turbo and word embedding methods, to extract sensory and emotional keywords and group them into semantic clusters (i.e., concepts). We linked the keyword clusters to haptic signal features (e.g., pulse count) using correlation analysis. The proposed pipeline demonstrates the viability of a computational approach to analyzing haptic experiences. We discuss our future plans for creating a predictive model of haptic experience.
设计和展示具有感官和情感属性的触觉信号可以改善各种应用程序中的用户体验。自由形式的用户语言为触觉设计提供了丰富的感官和情感信息(例如,“这个信号感觉平滑且令人兴奋”),但几乎没有研究将用户描述与触觉信号联系起来(即,语言接地)。为了填补这一空白,我们进行了一项研究,让12名用户描述他们在表面触觉显示器(即电振动)上感知到的32个信号的感觉。我们使用自然语言处理技术(如GPT-3.5 Turbo和词嵌入方法)开发了一个计算流水线,以提取感官和情感关键词并将其分组为语义聚类(即概念)。我们通过相关性分析将关键词聚类与触觉信号特征(例如脉冲计数)联系起来。所提出的流水线展示了计算方法在分析触觉体验方面的可行性。我们讨论了未来创建触觉体验预测模型的计划。
https://arxiv.org/abs/2411.02118
The evaluation of layer importance in deep learning has been an active area of research, with significant implications for model optimization and interpretability. Recently, large language models (LLMs) have gained prominence across various domains, yet limited studies have explored the functional importance and performance contributions of individual layers within LLMs, especially from the perspective of activation distribution. In this work, we propose the Activation Variance-Sparsity Score (AVSS), a novel metric combining normalized activation variance and sparsity to assess each layer's contribution to model performance. By identifying and removing approximately the lowest 25% of layers based on AVSS, we achieve over 90% of original model performance across tasks such as question answering, language modeling, and sentiment classification, indicating that these layers may be non-essential. Our approach provides a systematic method for identifying less critical layers, contributing to efficient large language model architectures.
在深度学习中评估层的重要性一直是研究的热点,这对于模型优化和可解释性有着重要的影响。最近,大型语言模型(LLMs)已在各个领域崭露头角,然而,很少有研究探讨过从激活分布的角度来看,LLMs内部各单独层次的功能重要性和性能贡献。在此项工作中,我们提出了激活方差稀疏评分(AVSS),这是一种结合了标准化后的激活方差和稀疏性来评估每个层对模型性能贡献的新指标。通过基于AVSS识别并移除大约最低25%的层,我们在问答、语言建模和情感分类等任务中实现了超过原始模型90%的性能,这表明这些层可能是非必要的。我们的方法提供了一种系统的方法来识别不太关键的层次,有助于构建高效的大型语言模型架构。
https://arxiv.org/abs/2411.02117
Trained on vast corpora of human language, language models demonstrate emergent human-like reasoning abilities. Yet they are still far from true intelligence, which opens up intriguing opportunities to explore the parallels of humans and model behaviors. In this work, we study the ability to skip steps in reasoning - a hallmark of human expertise developed through practice. Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not inherently possess such motivations to minimize reasoning steps. To address this, we introduce a controlled framework that stimulates step-skipping behavior by iteratively refining models to generate shorter and accurate reasoning paths. Empirical results indicate that models can develop the step skipping ability under our guidance. Moreover, after fine-tuning on expanded datasets that include both complete and skipped reasoning sequences, the models can not only resolve tasks with increased efficiency without sacrificing accuracy, but also exhibit comparable and even enhanced generalization capabilities in out-of-domain scenarios. Our work presents the first exploration into human-like step-skipping ability and provides fresh perspectives on how such cognitive abilities can benefit AI models.
在大量人类语言语料库上训练的语言模型展现了类似人类的推理能力。然而,它们距离真正的智能仍有很大差距,这为探索人类与模型行为之间的平行关系提供了有趣的契机。在这项研究中,我们探讨了跳过推理步骤的能力——这是通过练习而发展起来的人类专长的一个标志。与人类不同的是,人类可能会为了提高效率或减少认知负担而省略某些步骤,但模型本身并不具备这种动机来最小化推理步骤。为了解决这一问题,我们引入了一个控制框架,通过迭代地优化模型,使其生成更短且准确的推理路径,从而激发跳过步骤的行为。实证结果显示,在我们的指导下,模型可以发展出跳过步骤的能力。此外,经过在包含完整和省略推理序列的扩展数据集上微调后,这些模型不仅可以在不牺牲准确性的情况下提高任务执行效率,还能在外域场景中展示出相当甚至增强的一般化能力。我们的工作首次探索了类似人类的跳步能力,并提供了新的视角,说明这种认知能力如何能为AI模型带来益处。
https://arxiv.org/abs/2411.01855
Accurate annotation of educational resources is critical in the rapidly advancing field of online education due to the complexity and volume of content. Existing classification methods face challenges with semantic overlap and distribution imbalance of labels in the multi-label context, which impedes effective personalized learning and resource recommendation. This paper introduces RR2QC, a novel Retrieval Reranking method To multi-label Question Classification by leveraging label semantics and meta-label refinement. Firstly, RR2QC leverages semantic relationships within and across label groups to enhance pre-training strategie in multi-label context. Next, a class center learning task is introduced, integrating label texts into downstream training to ensure questions consistently align with label semantics, retrieving the most relevant label sequences. Finally, this method decomposes labels into meta-labels and trains a meta-label classifier to rerank the retrieved label sequences. In doing so, RR2QC enhances the understanding and prediction capability of long-tail labels by learning from meta-labels frequently appearing in other labels. Addtionally, a Math LLM is used to generate solutions for questions, extracting latent information to further refine the model's insights. Experimental results demonstrate that RR2QC outperforms existing classification methods in Precision@k and F1 scores across multiple educational datasets, establishing it as a potent enhancement for online educational content utilization.
准确注解教育资源对于快速发展的在线教育领域至关重要,因为内容的复杂性和数量庞大。现有的分类方法在多标签环境下面临着语义重叠和标签分布不均衡的挑战,这阻碍了个性化学习和资源推荐的有效性。本文介绍了RR2QC,一种通过利用标签语义和元标签细化来实现多标签问题分类的新型检索重新排序方法。首先,RR2QC利用标签组内及跨组之间的语义关系,在多标签环境下增强了预训练策略。其次,引入了一个类别中心学习任务,将标签文本整合到下游训练中,确保问题与标签语义的一致性,并检索出最相关的标签序列。最后,该方法将标签分解为元标签,并训练一个元标签分类器来重新排序检索到的标签序列。通过这种方式,RR2QC通过从频繁出现在其他标签中的元标签学习,增强了对长尾标签的理解和预测能力。此外,使用数学大语言模型(LLM)生成问题解答,提取潜在信息以进一步细化模型的洞察力。实验结果表明,在多个教育数据集上,RR2QC在Precision@k和F1分数方面优于现有的分类方法,确立了其作为在线教育资源利用增强的重要地位。
https://arxiv.org/abs/2411.01841
Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.
不连续命名实体识别(DNER)提出了一个具有挑战性的问题,即实体可能分散在多个非相邻的标记中,这使得传统的序列标注方法变得不足。现有的方法主要依赖于定制化的标签方案来处理这些不连续的实体,导致模型与特定的标签策略紧密耦合,并且缺乏跨多种数据集的一般化能力。为了解决这些问题,我们提出了TriG-NER,一种新型的三元组网格框架,引入了一种通用的方法来学习用于提取不连续实体的健壮的词级别表示。我们的框架在词级别应用了三元组损失函数,其中相似性由存在于同一实体中的词对定义,有效拉近相似项并推开不相似项。这种做法增强了实体边界的检测,并通过专注于灵活网格结构内的词对关系减少了对特定标签方案的依赖。我们在三个基准DNER数据集上评估了TriG-NER,并展示了其在现有基于网格架构上的显著改进。这些结果突显了我们框架捕获复杂实体结构的有效性及其适应多种标签方案的能力,为不连续实体提取树立了一个新的标杆。
https://arxiv.org/abs/2411.01839
While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
尽管无文本的口语语言模型(SLMs)在端到端的语音到语音建模中显示出了潜力,但它们在语义连贯性和相关性方面仍然落后于基于文本的大语言模型(LLMs)。这项工作引入了Align-SLM框架,该框架利用受强化学习与AI反馈(RLAIF)启发的偏好优化来增强SLMs的语义理解。我们的方法从给定的提示生成多个语音延续,并使用语义指标创建用于直接偏好优化(DPO)的偏好数据。我们通过ZeroSpeech 2021基准测试对词汇和句法建模进行了框架评估,使用了StoryCloze数据集的口语版本来评估语义连贯性,以及其他语音生成度量标准,包括GPT4-o分数和人工评价。实验结果表明,我们的方法在大多数基准测试中实现了SLMs的最佳性能,强调了偏好优化对提升SLMs语义的重要性。
https://arxiv.org/abs/2411.01834
This paper investigates supervised fine-tuning of large language models (LLMs) to improve their pedagogical alignment in computing education, addressing concerns that LLMs may hinder learning outcomes. The project utilised a proprietary dataset of 2,500 high quality question/answer pairs from programming course forums, and explores two research questions: the suitability of university course forums in contributing to fine-tuning datasets, and how supervised fine-tuning can improve LLMs' alignment with educational principles such as constructivism. Initial findings suggest benefits in pedagogical alignment of LLMs, with deeper evaluations required.
本文研究了通过监督微调大型语言模型(LLMs)来改进其在计算教育中的教学一致性,以解决人们对于LLMs可能妨碍学习成果的担忧。该项目利用了一个包含2500个高质量问题/答案配对的专有数据集,这些数据来源于编程课程论坛,并探讨了两个研究问题:大学课程论坛是否适合贡献于微调数据集,以及如何通过监督微调来改善LLMs与建构主义等教育原则的一致性。初步结果表明,在改进LLMs的教学一致性方面存在潜在好处,但仍需进行更深入的评估。
https://arxiv.org/abs/2411.01765
Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in \href{this https URL}{this https URL}.
现有的LLM代理系统通常在每一步从一个固定且预定义的动作集中选择动作。虽然这种方法在封闭、狭义的环境中有效,但我们认为它在部署LLM代理到现实场景时面临两大挑战:(1) 从固定的行动集合中选择显著限制了LLM代理的规划和执行能力;(2) 这种方法需要大量的人力来枚举并实现所有可能的动作,在复杂环境和潜在动作数量巨大的情况下变得不切实际。在本研究中,我们提出了一种能够以在线方式动态创建和组合行动的LLM代理框架。在这个框架下,代理通过每一步生成并在通用编程语言中执行程序与环境互动。此外,生成的动作随着时间积累供未来重复使用。我们在GAIA基准上的广泛实验表明,这个框架提供了显著更大的灵活性,并且优于先前的方法。值得注意的是,它允许一个LLM代理在不存在相关预定义动作或现有动作因未预见的边缘情况而失败的情况下恢复运行。撰写本文时,我们在GAIA公开排行榜上占据首位。我们的代码可以在\href{this https URL}{此https网址}找到。
https://arxiv.org/abs/2411.01747
Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques failing to account for the autoregressive nature of state-of-the-art models and to provide fine-grained, phonetically meaningful explanations. We address this gap by introducing Spectrogram Perturbation for Explainable Speech-to-text Generation (SPES), a feature attribution technique applicable to sequence generation tasks with autoregressive models. SPES provides explanations for each predicted token based on both the input spectrogram and the previously generated tokens. Extensive evaluation on speech recognition and translation demonstrates that SPES generates explanations that are faithful and plausible to humans.
受到对可解释模型需求的推动,针对语言技术的可解释人工智能(eXplainable AI)研究经历了显著的增长,特征归因方法成为了这一进展的重要基石。尽管先前在自然语言处理领域的研究探索了此类方法用于分类任务和文本应用,但在生成与语音交叉领域的可解释性方面却相对滞后,现有技术未能充分考虑到最先进的模型的自回归特性,并且无法提供细粒度、具有音素意义的解释。我们通过引入Spectrogram Perturbation for Explainable Speech-to-text Generation(SPES)来填补这一空白,这是一种适用于基于自回归模型的序列生成任务的特征归因技术。SPES根据输入频谱图和先前生成的标记为每个预测出的标记提供解释。广泛的评估表明,在语音识别和翻译领域中,SPES能够生成忠实且可信的人类可理解解释。 注:为了保持专业性和准确性,“Spectrogram Perturbation for Explainable Speech-to-text Generation(SPES)”在中文语境下没有直接对应的简化表述,因此保留了其英文名称。
https://arxiv.org/abs/2411.01710
Complex Word Identification (CWI) is an essential step in the lexical simplification task and has recently become a task on its own. Some variations of this binary classification task have emerged, such as lexical complexity prediction (LCP) and complexity evaluation of multi-word expressions (MWE). Large language models (LLMs) recently became popular in the Natural Language Processing community because of their versatility and capability to solve unseen tasks in zero/few-shot settings. Our work investigates LLM usage, specifically open-source models such as Llama 2, Llama 3, and Vicuna v1.5, and closed-source, such as ChatGPT-3.5-turbo and GPT-4o, in the CWI, LCP, and MWE settings. We evaluate zero-shot, few-shot, and fine-tuning settings and show that LLMs struggle in certain conditions or achieve comparable results against existing methods. In addition, we provide some views on meta-learning combined with prompt learning. In the end, we conclude that the current state of LLMs cannot or barely outperform existing methods, which are usually much smaller.
复杂词识别(CWI)是词汇简化任务中的一个关键步骤,并且最近已经成为一项独立的任务。一些这种二分类任务的变体出现了,例如词汇复杂度预测(LCP)和多词表达式复杂度评估(MWE)。由于其多样性和在零样本/少样本设置中解决未见过任务的能力,大型语言模型(LLMs)最近受到了自然语言处理社区的关注。我们的工作研究了LLMs的使用情况,特别是开源模型如Llama 2、Llama 3和Vicuna v1.5,以及闭源模型如ChatGPT-3.5-turbo和GPT-4,在CWI、LCP和MWE设置中的应用。我们评估了零样本、少样本和微调设置,并展示了LLMs在某些条件下会遇到困难或能达到与现有方法相当的结果。此外,我们还提供了一些关于元学习结合提示学习的见解。最终,我们得出结论,目前的LLMs状态要么无法超越现有的方法,或者仅能略微超出,而这些方法通常规模要小得多。
https://arxiv.org/abs/2411.01706
Despite significant advancements, large language models (LLMs) still struggle with providing accurate answers when lacking domain-specific or up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge bases, but it also introduces new attack surfaces. In this paper, we investigate data extraction attacks targeting the knowledge databases of RAG systems. We demonstrate that previous attacks on RAG largely depend on the instruction-following capabilities of LLMs, and that simple fine-tuning can reduce the success rate of such attacks to nearly zero. This makes these attacks impractical since fine-tuning is a common practice when deploying LLMs in specific domains. To further reveal the vulnerability, we propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM. When this compromised LLM is integrated into a RAG system, attackers can exploit specific triggers in prompts to manipulate the LLM to leak documents from the retrieval database. By carefully designing the poisoned data, we achieve both verbatim and paraphrased document extraction. We show that with only 3\% poisoned data, our method achieves an average success rate of 79.7\% in verbatim extraction on Llama2-7B, with a ROUGE-L score of 64.21, and a 68.6\% average success rate in paraphrased extraction, with an average ROUGE score of 52.6 across four datasets. These results underscore the privacy risks associated with the supply chain when deploying RAG systems.
尽管有了显著的进步,大型语言模型(LLMs)在缺乏特定领域或最新知识的情况下仍然难以提供准确的答案。检索增强生成(RAG)通过整合外部知识库来解决这一限制,但也引入了新的攻击面。本文研究了针对RAG系统知识数据库的数据提取攻击。我们展示了之前对RAG的攻击很大程度上依赖于LLMs遵循指令的能力,并且简单的微调可以将此类攻击的成功率降至接近零。这使得这些攻击在实践中变得不可行,因为当在特定领域部署LLMs时进行微调是一种常见做法。 为了进一步揭示漏洞,我们提出了一种针对RAG系统的后门攻击方法,在微调阶段注入少量中毒数据以在LLM中创建一个后门。当这种被破坏的LLM集成到RAG系统中时,攻击者可以通过特定触发词操纵LLM从检索数据库泄漏文档。通过精心设计中毒数据,我们实现了逐字和释义形式的文档提取。结果显示,在仅有3%的数据被毒化的情况下,我们的方法在Llama2-7B上的逐字提取成功率平均为79.7%,ROUGE-L得分为64.21;在四个数据集上释义提取的成功率平均为68.6%,平均ROUGE得分为52.6。这些结果突显了部署RAG系统时供应链相关的隐私风险。
https://arxiv.org/abs/2411.01705
This research investigates the implementation of a real-time, microservices-oriented dynamic pricing system for the travel sector. The system is designed to address factors such as demand, competitor pricing, and other external circumstances in real-time. Both controlled simulation and real-life application showed a respectable gain of 22% in revenue generation and a 17% improvement in pricing response time which concern the issues of scaling and flexibility of classical pricing mechanisms. Demand forecasting, competitor pricing strategies, and event-based pricing were implemented as separate microservices to enhance their scalability and reduce resource consumption by 30% during peak loads. Customers were also more content as depicted by a 15% increase in satisfaction score post-implementation given the appreciation of more appropriate pricing. This research enhances the existing literature with practical illustrations of the possible application of microservices technology in developing dynamic pricing solutions in a complex and data-driven context. There exist however areas for improvement for instance inter-service latency and the need for extensive real-time data pipelines. The present research goes on to suggest combining these with direct data capture from customer behavior at the same time as machine learning capacity developments in pricing algorithms to assist in more accurate real time pricing. It is determined that the use of microservices is a reasonable and efficient model for dynamic pricing, allowing the tourism sector to employ evidence-based and customer centric pricing techniques, which ensures that their profits are not jeopardized because of the need for customers.
这项研究探讨了在旅游行业实施实时、面向微服务的动态定价系统。该系统旨在实时应对需求、竞争对手定价以及其他外部情况等因素。无论是受控模拟还是实际应用,都显示出了22%的收入增长和17%的报价响应时间改进,解决了传统定价机制中关于扩展性和灵活性的问题。需求预测、竞争对手定价策略以及基于事件的定价被实现为独立的微服务以增强其可扩展性,并在高峰时段将资源消耗降低了30%。实施后,客户满意度提高了15%,这表明更加合理的定价受到了客户的认可。本研究通过实际案例丰富了现有文献中关于如何利用微服务技术开发复杂且数据驱动环境下的动态定价解决方案的内容。然而,也存在改进的空间,例如服务间的延迟问题和对广泛实时数据流的需求等。当前的研究进一步建议将这些因素与直接从客户行为捕获的数据以及在定价算法中发展机器学习能力相结合,以帮助实现更加精准的实时定价。确定使用微服务是一种合理且高效的动态定价模型,使旅游业能够采用基于证据和以客户为中心的定价策略,确保其利润不会因为满足客户需求而受到威胁。
https://arxiv.org/abs/2411.01636
Contrastive decoding (CD) (Li et al., 2023) improves the next-token distribution of a large expert language model (LM) using a small amateur LM. Although CD is applied to various LMs and domains to enhance open-ended text generation, it is still unclear why CD often works well, when it could fail, and how we can make it better. To deepen our understanding of CD, we first theoretically prove that CD could be viewed as linearly extrapolating the next-token logits from a huge and hypothetical LM. We also highlight that the linear extrapolation could make CD unable to output the most obvious answers that have already been assigned high probabilities by the amateur LM. To overcome CD's limitation, we propose a new unsupervised decoding method called $\mathbf{A}$symptotic $\mathbf{P}$robability $\mathbf{D}$ecoding (APD). APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the asymptotic probabilities from an infinitely large LM without inducing more inference costs than CD. In FactualityPrompts, an open-ended text generation benchmark, sampling using APD significantly boosts factuality in comparison to the CD sampling and its variants, and achieves state-of-the-art results for Pythia 6.9B and OPT 6.7B. Furthermore, in five commonsense QA datasets, APD is often significantly better than CD and achieves a similar effect of using a larger LLM. For example, the perplexity of APD on top of Pythia 6.9B is even lower than the perplexity of Pythia 12B in CommonsenseQA and LAMBADA.
对比解码(CD)(Li等人,2023年)通过使用一个小的业余语言模型(LM),改善了一个大型专家LM的下一个词分布。尽管CD被应用于各种LM和领域以提升开放式文本生成,但其为何通常效果良好、何时可能失败以及如何改进这些方面仍不清楚。为了深化对CD的理解,我们首先从理论上证明了CD可以被视为通过一个巨大且假设性的LM线性外推下一个词的logits。我们也指出,这种线性外推可能导致CD无法输出已经被业余LM赋予高概率的最明显答案。为克服CD的局限性,我们提出了一种新的无监督解码方法——**渐近概率解码(APD)**。APD明确地从不同规模的LM中外推概率曲线,以推理出一个无限大的LM的渐近概率,而不会比CD增加更多的推理成本。在开放文本生成基准测试FactualityPrompts上,使用APD抽样显著提升了事实性,相较于CD及其变体取得了最先进的结果,在Pythia 6.9B和OPT 6.7B上尤为突出。此外,在五个常识问答数据集中,APD通常比CD表现得更好,并达到了类似于使用更大LLM的效果。例如,在CommonsenseQA和LAMBADA中,基于Pythia 6.9B的APD困惑度甚至低于Pythia 12B的困惑度。
https://arxiv.org/abs/2411.01610