Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online learning paradigm prevents its widespread adoption, especially in hazardous or costly scenarios. Offline RL has emerged as an alternative solution, learning from pre-collected static datasets. However, this offline learning introduces a new challenge known as distributional shift, degrading the performance when the policy is evaluated on scenarios that are Out-Of-Distribution (OOD) from the training dataset. Most existing offline RL resolves this issue by regularizing policy learning within the information supported by the given dataset. However, such regularization overlooks the potential for high-reward regions that may exist beyond the dataset. This motivates exploring novel offline learning techniques that can make improvements beyond the data support without compromising policy performance, potentially by learning causation (cause-and-effect) instead of correlation from the dataset. In this paper, we propose the MOOD-CRL (Model-based Offline OOD-Adapting Causal RL) algorithm, which aims to address the challenge of extrapolation for offline policy training through causal inference instead of policy-regularizing methods. Specifically, Causal Normalizing Flow (CNF) is developed to learn the transition and reward functions for data generation and augmentation in offline policy evaluation and training. Based on the data-invariant, physics-based qualitative causal graph and the observational data, we develop a novel learning scheme for CNF to learn the quantitative structural causal model. As a result, CNF gains predictive and counterfactual reasoning capabilities for sequential decision-making tasks, revealing a high potential for OOD adaptation. Our CNF-based offline RL approach is validated through empirical evaluations, outperforming model-free and model-based methods by a significant margin.
尽管强化学习(RL)取得了显著的成功,但在线学习范式普遍使用导致其广泛应用受限,尤其是在危险或昂贵的情景中。离线强化学习(Offline RL)作为一种替代方案应运而生,通过预先收集的静态数据集进行学习。然而,这种离线学习引入了一个名为分布平滑的新挑战,当策略在训练数据集之外的场景上评估时,会降低其性能。为解决此问题,大多数现有的离线强化学习方法通过在给定数据集支持的范围内对策略进行规范化来解决。然而,这种规范化方法忽视了数据之外可能存在高奖励区域的事实。因此,探索新的离线学习方法具有提高数据支持下的策略性能而不会牺牲策略性能潜力,通过从数据中学习因果关系(原因和结果)来解决此问题。在本文中,我们提出了MOOD-CRL(基于模型的离线OUD适应因果RL)算法,旨在通过因果推理而不是策略规范化方法来解决离线策略训练的扩展问题。具体来说,我们开发了因果正常化流(CNF)来学习数据生成和增强在离线策略评估和训练中的转移和奖励函数。基于数据无关、基于物理的定性因果图和观测数据,我们为CNF开发了一种新的学习方案,以学习量化结构因果模型。因此,CNF在序列决策任务中获得了预测和反事实推理能力,揭示了其在大数据迁移方面的巨大潜力。我们的基于CNF的离线强化学习方法通过实证评估证明了比模型免费和基于模型的方法具有显著的优越性。
https://arxiv.org/abs/2405.03892
This paper presents a framework for learning state and action abstractions in sequential decision-making domains. Our framework, planning abstraction from language (PARL), utilizes language-annotated demonstrations to automatically discover a symbolic and abstract action space and induce a latent state abstraction based on it. PARL consists of three stages: 1) recovering object-level and action concepts, 2) learning state abstractions, abstract action feasibility, and transition models, and 3) applying low-level policies for abstract actions. During inference, given the task description, PARL first makes abstract action plans using the latent transition and feasibility functions, then refines the high-level plan using low-level policies. PARL generalizes across scenarios involving novel object instances and environments, unseen concept compositions, and tasks that require longer planning horizons than settings it is trained on.
本文提出了一种在序列决策领域学习状态和动作抽象的框架。我们的框架规划抽象语言 (PARL),利用语言注释的演示来自动发现符号和抽象动作空间,并基于它诱导潜在状态抽象。PARL包括三个阶段:1)恢复物体级和动作概念,2)学习状态抽象、抽象动作可行性以及转移模型,3)应用低级策略对抽象动作进行规划。在推理过程中,给定任务描述,PARL首先使用潜在转移和可行性函数进行抽象动作计划,然后通过低级策略优化高级计划。PARL在涉及新颖物体实例和环境的场景、未见过的概念组合以及需要比其训练环境更长的规划视野的任务中具有泛化性。
https://arxiv.org/abs/2405.03864
Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.
零 shot学习在广泛的视觉识别领域得到了广泛研究,并吸引了最近显著的关注。然而,在文档图像分类领域,零 shot 学习的现有研究仍然很少。现有研究要么只专注于零 shot 推理,要么它们的评估标准与视觉识别域中的 established criteria 不相符。我们在 Zero-Shot Learning (ZSL) 和一般零 shot学习 (GZSL) 设置中提供了全面的文档图像分类分析,以填补这一空白。我们的方法和评估与该领域的 established practices 保持一致。此外,我们还提出了 RVL-CDIP 数据集中的零 shot 划分。此外,我们引入了 CICA(发音为 'ki-ka'),一种增强 CLIP 零 shot 学习能力的框架。CICA 包括一个新颖的“内容模块”,用于利用任何文档相关文本信息。这个模块提取的判别特征与 CLIP 的文本和图像特征通过一种新颖的“耦合对比”损失进行对齐。我们的模块在 RVL-CDIP 数据集上提高了 CLIP 的 ZSL top-1 准确率 by 6.7%,GZSL 均方误差 by 24%。我们的模块轻量级,并为 CLIP 添加了仅 3.3% 的参数。我们的工作为未来的零 shot 文献分类研究奠定了方向。
https://arxiv.org/abs/2405.03660
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.
大规模语言模型(LLMs) revolutionized 自然语言处理(NLP),但它们的大小产生了计算瓶颈。我们提出了一种新方法来创建准确、稀疏的基本大语言模型,在稀疏度达到 70% 时实现对微调任务的完全准确性恢复。我们通过将 SparseGPT 一键修剪方法和 SlimPajama 数据集中的稀疏预训练方法相结合,在 LaMA-2 7B 模型上实现了这一目标。我们在 Cerebras CS-3 芯片上展示了由于稀疏度而产生的训练加速,这个加速与理论上的扩展速度非常接近。此外,我们还通过利用 Neural Magic 的 DeepSparse 引擎在 CPU 上实现 up to 3x 的推理加速,而在 GPU 上实现同样的加速需要 Neural Magic 的 nm-vllm 引擎,通过稀疏度实现上述增长。这些增长是通过稀疏度实现的,因此可以通过进一步的量化实现更多的增长。具体来说,我们在 CPU 上实现了稀疏量化 LaMA-2 模型总共的 8.6x 速度提升。我们在各种具有挑战性的任务中展示了这些结果,包括聊天、指令跟随、代码生成、算术推理和总结,以证明其普适性。这项工作为快速创建小而快速的 LLM 奠定了基础,同时不牺牲准确性。
https://arxiv.org/abs/2405.03594
Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. While numerical mistakes can largely be addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also demands specialized expertise. In this study, we introduce an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically. Essentially, when a LLM is well pre-trained, only the mathematical questions and their final answers are required to generate our training data, without requiring the solutions. We proceed to train a step-level value model designed to improve the LLM's inference process in mathematical domains. Our experiments indicate that using automatically generated solutions by LLMs enhanced with MCTS significantly improves the model's proficiency in dealing with intricate mathematical reasoning tasks.
近年来,大型语言模型(LLMs)的进步显著提高了它们的数学推理能力。然而,这些模型在处理需要多个推理步骤的复杂问题时仍然存在困难,经常导致逻辑或数值错误。虽然通过集成脚本解释器可以很大程度上解决数值错误,但发现中间步骤的逻辑错误并不容易。此外,为训练手动标注这些步骤需要专门的专业知识,而且代价昂贵。在这项研究中,我们介绍了一种创新方法,通过利用蒙特卡洛树搜索(MCTS)框架自动生成过程监督和评估信号,从而消除手动标注的需求。本质上,当LLM良好预处理后,只需要数学问题和它们的最终答案生成我们的训练数据,而不需要解决方案。我们继续训练一个逐级价值模型,旨在提高LLM在数学领域的推理能力。我们的实验结果表明,通过LLMs预处理并利用MCTS自动生成的解决方案,可以显著提高模型在处理复杂数学推理任务中的熟练程度。
https://arxiv.org/abs/2405.03553
Transformers have recently gained prominence in long time series forecasting by elevating accuracies in a variety of use cases. Regrettably, in the race for better predictive performance the overhead of model architectures has grown onerous, leading to models with computational demand infeasible for most practical applications. To bridge the gap between high method complexity and realistic computational resources, we introduce the Residual Cyclic Transformer, ReCycle. ReCycle utilizes primary cycle compression to address the computational complexity of the attention mechanism in long time series. By learning residuals from refined smoothing average techniques, ReCycle surpasses state-of-the-art accuracy in a variety of application use cases. The reliable and explainable fallback behavior ensured by simple, yet robust, smoothing average techniques additionally lowers the barrier for user acceptance. At the same time, our approach reduces the run time and energy consumption by more than an order of magnitude, making both training and inference feasible on low-performance, low-power and edge computing devices. Code is available at this https URL
近年来,Transformer 在长时序列预测中因提高各种用例中的准确性而取得了突出地位。然而,为了在预测性能的竞争中获得更好的表现,模型架构的复杂性不断提高,导致大多数实际应用模型具有计算密集型,无法满足实际计算资源的需求。为了弥合高方法复杂性和现实计算资源之间的差距,我们引入了 Residual Cyclic Transformer (ReCycle)。ReCycle 通过主要循环压缩来解决长时序列中注意力机制的计算复杂性。通过从精细平滑平均技术中学习残差,ReCycle 在各种应用用例中超越了最先进的准确率。由简单而强大的平滑平均技术确保的可靠且可解释的退火行为还进一步降低了用户接受度的门槛。同时,我们的方法将运行时间和能源消耗降低了 orders of magnitude,使得在低性能、低功率和边缘计算设备上训练和推理都成为可能。代码位于此链接:
https://arxiv.org/abs/2405.03429
Fine-tuned Large Language Models (LLMs) often suffer from overconfidence and poor calibration, particularly when fine-tuned on small datasets. To address these challenges, we propose a simple combination of Low-Rank Adaptation (LoRA) with Gaussian Stochastic Weight Averaging (SWAG), facilitating approximate Bayesian inference in LLMs. Through extensive testing across several Natural Language Processing (NLP) benchmarks, we demonstrate that our straightforward and computationally efficient approach improves model generalization and calibration. We further show that our method exhibits greater robustness against distribution shift, as reflected in its performance on out-of-distribution tasks.
经过在多个自然语言处理(NLP)基准测试中的广泛测试,我们提出了一种简单的结合低秩适应(LoRA)和高斯随机权重平均(SWAG)的方法,有助于在LLM上进行近似贝叶斯推理。通过在LLMs上进行大量实验,我们证明了这种直观且计算效率高的方法提高了模型的泛化能力和标定。我们还证明了我们的方法对于分布漂移的鲁棒性更大,这反映在其在离散任务上的表现。
https://arxiv.org/abs/2405.03425
This work studies ensemble learning for graph neural networks (GNNs) under the popular semi-supervised setting. Ensemble learning has shown superiority in improving the accuracy and robustness of traditional machine learning by combining the outputs of multiple weak learners. However, adopting a similar idea to integrate different GNN models is challenging because of two reasons. First, GNN is notorious for its poor inference ability, so naively assembling multiple GNN models would deteriorate the inference efficiency. Second, when GNN models are trained with few labeled nodes, their performance are limited. In this case, the vanilla ensemble approach, e.g., majority vote, may be sub-optimal since most base models, i.e., GNNs, may make the wrong predictions. To this end, in this paper, we propose an efficient ensemble learner--E2GNN to assemble multiple GNNs in a learnable way by leveraging both labeled and unlabeled nodes. Specifically, we first pre-train different GNN models on a given data scenario according to the labeled nodes. Next, instead of directly combing their outputs for label inference, we train a simple multi-layer perceptron--MLP model to mimic their predictions on both labeled and unlabeled nodes. Then the unified MLP model is deployed to infer labels for unlabeled or new nodes. Since the predictions of unlabeled nodes from different GNN models may be incorrect, we develop a reinforced discriminator to effectively filter out those wrongly predicted nodes to boost the performance of MLP. By doing this, we suggest a principled approach to tackle the inference issues of GNN ensembles and maintain the merit of ensemble learning: improved performance. Comprehensive experiments over both transductive and inductive settings, across different GNN backbones and 8 benchmark datasets, demonstrate the superiority of E2GNN.
本文研究了在流行半监督设置下,图神经网络(GNNs)的集成学习。集成学习已经在多个弱学习器的输出上展示了比传统机器学习更准确和鲁棒性的优势。然而,采用与集成不同GNN模型类似的思路进行集成学习是一个具有挑战性的任务,因为有两个原因。首先,GNN以其推理能力较差而闻名,因此粗心地将多个GNN模型集成在一起会降低推理效率。其次,当GNN模型通过少量标记节点进行训练时,其性能有限。在这种情况下,基本的集成方法,例如多数投票,可能是不最优的,因为大多数基本模型(即GNNs)可能会做出错误的预测。因此,在本文中,我们提出了一种有效的集成学习器——E2GNN,通过利用 labeled 和 unlabeled 节点来学习多个GNN的集成。具体来说,我们首先根据标记节点对不同GNN模型进行预训练。然后,我们使用一个简单的多层感知器(MLP)模型来模仿它们的预测,该模型可以同时处理标记和未标记节点。最后,将统一的MLP模型部署用于推断未标记或新节点的标签。由于不同GNN模型对未标记节点的预测可能是不正确的,我们开发了一个强化判别器,以有效地过滤出错误预测的节点,提高MLP的性能。通过这样做,我们提出了一个有原则的方法来解决GNN集成的问题,并保持了集成的优势:提高性能。在转换式和归纳式设置下的全面实验,跨越不同的GNN骨干网络和8个基准数据集,证明了E2GNN的优越性。
https://arxiv.org/abs/2405.03401
The advent of data-driven weather forecasting models, which learn from hundreds of terabytes (TB) of reanalysis data, has significantly advanced forecasting capabilities. However, the substantial costs associated with data storage and transmission present a major challenge for data providers and users, affecting resource-constrained researchers and limiting their accessibility to participate in AI-based meteorological research. To mitigate this issue, we introduce an efficient neural codec, the Variational Autoencoder Transformer (VAEformer), for extreme compression of climate data to significantly reduce data storage cost, making AI-based meteorological research portable to researchers. Our approach diverges from recent complex neural codecs by utilizing a low-complexity Auto-Encoder transformer. This encoder produces a quantized latent representation through variance inference, which reparameterizes the latent space as a Gaussian distribution. This method improves the estimation of distributions for cross-entropy coding. Extensive experiments demonstrate that our VAEformer outperforms existing state-of-the-art compression methods in the context of climate data. By applying our VAEformer, we compressed the most popular ERA5 climate dataset (226 TB) into a new dataset, CRA5 (0.7 TB). This translates to a compression ratio of over 300 while retaining the dataset's utility for accurate scientific analysis. Further, downstream experiments show that global weather forecasting models trained on the compact CRA5 dataset achieve forecasting accuracy comparable to the model trained on the original dataset. Code, the CRA5 dataset, and the pre-trained model are available at this https URL.
数据驱动的天气预报模型的出现已经显著提高了预测能力。然而,与数据存储和传输相关的巨额成本使得数据提供商和用户面临重大挑战,限制了受AI驱动气象研究限制的研究人员参与。为了减轻这个问题,我们引入了高效的神经编码器,Variational Autoencoder Transformer(VAEformer),用于对气候数据的极端压缩,显著减少了数据存储成本,使基于AI的气象研究对研究人员来说具有便携性。我们的方法与最近复杂的神经编码器有所不同,因为它利用了低复杂度的自编码器变换器。这个编码器通过离散变量推断产生量化 latent 表示,重新参数化 latent 空间为高斯分布。这种方法改善了交叉熵编码的分布估计。大量的实验证明,在气候数据背景下,我们的VAEformer超越了现有最先进的压缩方法。通过应用我们的VAEformer,我们将最流行的ERA5气候数据集(226 TB)压缩到了新的数据集CRA5(0.7 TB)。这导致压缩比超过300,同时保留数据的准确科学分析用途。此外,下游实验证明,在紧凑的CRA5数据集上训练的全天气报预测模型具有与原数据集训练的模型相当的预测准确性。代码、CRA5数据集和预训练模型都可以在這個URL https:// URL上找到。
https://arxiv.org/abs/2405.03376
Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications.
大多数假新闻检测方法基于神经网络学习潜在特征表示,这让它们在分类一篇新闻时缺乏任何依据。现有的解释性系统从调查性新闻中生成真实性证明,但这种方法存在延迟和低效率的缺点。最近的研究只是简单地假设证据的证明等同于民智的多数观点。然而,由于民智未受审查,这些证据通常包含一些不准确或偏见的信息。为了从丰富多样的、拥挤的和竞争性的叙事中检测出假新闻,本文我们提出了一个基于防御的 explainable fake news detection framework。具体来说,我们首先提出了一种证据提取模块,将民智分为两个对抗的派别并分别检测显著的证据。为了从证据中获得简洁的洞察,我们然后设计了一个基于提示的模块,利用一个大语言模型生成两个可能的真理原因。最后,我们提出了一种基于防御的推理模块,通过建模这些证据之间的防御来确定真理。在两个真实世界基准上的大量实验证明,与最先进的基线相比,我们提出的方法在假新闻检测方面表现出色,并提供高质量的可信证明。
https://arxiv.org/abs/2405.03371
Q-learning excels in learning from feedback within sequential decision-making tasks but requires extensive sampling for significant improvements. Although reward shaping is a powerful technique for enhancing learning efficiency, it can introduce biases that affect agent performance. Furthermore, potential-based reward shaping is constrained as it does not allow for reward modifications based on actions or terminal states, potentially limiting its effectiveness in complex environments. Additionally, large language models (LLMs) can achieve zero-shot learning, but this is generally limited to simpler tasks. They also exhibit low inference speeds and occasionally produce hallucinations. To address these issues, we propose \textbf{LLM-guided Q-learning} that employs LLMs as heuristic to aid in learning the Q-function for reinforcement learning. It combines the advantages of both technologies without introducing performance bias. Our theoretical analysis demonstrates that the LLM heuristic provides action-level guidance. Additionally, our architecture has the capability to convert the impact of hallucinations into exploration costs. Moreover, the converged Q function corresponds to the MDP optimal Q function. Experiment results demonstrated that our algorithm enables agents to avoid ineffective exploration, enhances sampling efficiency, and is well-suited for complex control tasks.
Q-learning 在序列决策任务中从反馈中学习表现出色,但需要进行广泛的抽样以实现显著的改进。尽管奖励塑造是一种强大的学习技术,但它可能会引入偏差,影响代理的表现。此外,基于潜在的奖励塑造方法受到约束,因为它不允许根据动作或状态进行奖励修改,这可能限制其在复杂环境中的有效性。此外,大型语言模型(LLMs)可以实现零散学习,但通常仅限于较简单的任务。它们还表现出低推理速度,偶尔产生幻觉。为了应对这些问题,我们提出了 \textbf{LLM-guided Q-learning} 方法,它使用 LLMs 作为启发式,帮助学习者学习 Q 函数,实现强化学习。它结合了两种技术的优势,而没有引入性能偏差。我们的理论分析表明,LLM 启发式提供了动作级别指导。此外,我们的架构具有将幻觉影响转换为探索成本的能力。此外,收敛的 Q 函数与 MDP 最优 Q 函数相等。实验结果表明,我们的算法使代理能够避免无效探索,提高抽样效率,并适用于复杂的控制任务。
https://arxiv.org/abs/2405.03341
Deep neural networks are applied in more and more areas of everyday life. However, they still lack essential abilities, such as robustly dealing with spatially transformed input signals. Approaches to mitigate this severe robustness issue are limited to two pathways: Either models are implicitly regularised by increased sample variability (data augmentation) or explicitly constrained by hard-coded inductive biases. The limiting factor of the former is the size of the data space, which renders sufficient sample coverage intractable. The latter is limited by the engineering effort required to develop such inductive biases for every possible scenario. Instead, we take inspiration from human behaviour, where percepts are modified by mental or physical actions during inference. We propose a novel technique to emulate such an inference process for neural nets. This is achieved by traversing a sparsified inverse transformation tree during inference using parallel energy-based evaluations. Our proposed inference algorithm, called Inverse Transformation Search (ITS), is model-agnostic and equips the model with zero-shot pseudo-invariance to spatially transformed inputs. We evaluated our method on several benchmark datasets, including a synthesised ImageNet test set. ITS outperforms the utilised baselines on all zero-shot test scenarios.
深度神经网络在越来越多的日常应用领域得到应用。然而,它们仍然缺乏一些关键能力,如处理空间变换输入信号的鲁棒性。为减轻这种严重鲁棒性问题,我们仅有两个途径:通过增加样本多样性(数据增强)对模型进行隐式正则化,或者通过硬编码的归纳偏见对模型进行显式约束。前者的限制因素是数据空间的大小,使得足够的样本覆盖变得无法实现。后者则取决于为每个可能场景开发这种归纳偏见的工程努力。相反,我们从人类行为中获得了灵感,其中推理过程中会通过心理或物理动作对感知进行修改。我们提出了一种新颖的技术,用于模拟这种推理过程。这是通过在推理过程中遍历稀疏反变换树实现的,并使用并行能量为基础的评估进行评估。我们提出的推理算法称为反变换搜索(ITS)。与模型无关,它为模型提供了零散的逆变换搜索(ITS)伪invariance,以处理空间变换的输入。我们在多个基准数据集上评估了我们的方法,包括由合成的ImageNet测试集。ITS在所有零散测试场景上都优于使用的基线。
https://arxiv.org/abs/2405.03730
Model editing aims to correct outdated or erroneous knowledge in large language models (LLMs) without the need for costly retraining. Lifelong model editing is the most challenging task that caters to the continuous editing requirements of LLMs. Prior works primarily focus on single or batch editing; nevertheless, these methods fall short in lifelong editing scenarios due to catastrophic knowledge forgetting and the degradation of model performance. Although retrieval-based methods alleviate these issues, they are impeded by slow and cumbersome processes of integrating the retrieved knowledge into the model. In this work, we introduce RECIPE, a RetriEval-augmented ContInuous Prompt lEarning method, to boost editing efficacy and inference efficiency in lifelong learning. RECIPE first converts knowledge statements into short and informative continuous prompts, prefixed to the LLM's input query embedding, to efficiently refine the response grounded on the knowledge. It further integrates the Knowledge Sentinel (KS) that acts as an intermediary to calculate a dynamic threshold, determining whether the retrieval repository contains relevant knowledge. Our retriever and prompt encoder are jointly trained to achieve editing properties, i.e., reliability, generality, and locality. In our experiments, RECIPE is assessed extensively across multiple LLMs and editing datasets, where it achieves superior editing performance. RECIPE also demonstrates its capability to maintain the overall performance of LLMs alongside showcasing fast editing and inference speed.
模型编辑旨在在不需要昂贵重新训练的情况下,纠正大型语言模型(LLMs)中的过时或错误知识。终身模型编辑是满足LLMs连续编辑需求的最具有挑战性的任务。先前的研究主要集中在单次或批量编辑;然而,由于知识遗忘和模型性能的退化,这些方法在终身编辑场景中效果不佳。尽管基于检索的方法可以缓解这些问题,但它们受到将检索到的知识 integration到模型过程中缓慢和繁琐的过程所阻碍。在本文中,我们引入了RECIPE,一种增强型连续提示学习方法,以提高终身学习中的编辑效果和推理效率。 RECIPE首先将知识陈述转换为短而信息丰富的连续提示,附着在LLM的输入查询嵌入之前,以有效地基于知识进行回答的优化。它还引入了知识守护者(KS),作为中间人计算动态阈值,以确定检索存储库中是否包含相关知识。我们的检索器和提示编码器是联合训练的,以实现编辑特性,即可靠性、普适性和局部性。 在我们的实验中,RECIPE在多个LLM和编辑数据集上进行了广泛评估,在这些数据集上取得了卓越的编辑性能。RECIPE还展示了其在展示快速编辑和推理速度的同时,保持LLM整体性能的能力。
https://arxiv.org/abs/2405.03279
Large Language Models (LLMs), such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse anchored bias in the GPT-2 family, where they consistently favour the first choice 'A' in MCQs during inference. This anchored bias challenges the integrity of GPT-2's decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the "logit lens" method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice 'A', we effectively mitigate the anchored bias. Our interventions not only correct the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs. Our code is available at this https URL.
大语言模型(LLMs),如GPT-4和LLaMA家族,在各种任务中取得了显著的成功,包括多项选择题(MCQs)。然而,这些模型表现出位置偏见,尤其是在GPT-2家族中,他们在推理过程中始终倾向于第一个选择'A'。这种位置偏见挑战了GPT-2决策过程的完整性,因为它基于位置而不是内容对MCQ的选择产生偏见。在这项研究中,我们利用机制可解释的方法来确定GPT-2模型内部负责这种偏见的模块。我们关注MLP层和注意头,使用"logit透镜"方法来追溯和修改对偏见有贡献的具体值向量。通过在MLP中更新这些向量并重新调整注意模式以抵消对第一个选择'A'的偏好,我们有效地减轻了偏见的程度。我们的干预不仅纠正了偏见,还在各种数据集上提高了GPT-2模型的多项选择题预测准确性。这项工作是对GPT-2模型中锚定偏见进行全面机制分析的第一项工作,引入了针对性的,最小干预策略,显著增强了GPT2模型在MCQs中的韧性和准确性。我们的代码可在此处访问:https://www.acm.org/dl/doi/10.1145/2818201.2818205
https://arxiv.org/abs/2405.03205
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but how wireless communications can support LLMs has not been extensively studied. In this paper, we propose a wireless distributed LLMs paradigm based on Mixture of Experts (MoE), named WDMoE, deploying LLMs collaboratively across edge servers of base station (BS) and mobile devices in the wireless communications system. Specifically, we decompose the MoE layer in LLMs by deploying the gating network and the preceding neural network layer at BS, while distributing the expert networks across the devices. This arrangement leverages the parallel capabilities of expert networks on distributed devices. Moreover, to overcome the instability of wireless communications, we design an expert selection policy by taking into account both the performance of the model and the end-to-end latency, which includes both transmission delay and inference delay. Evaluations conducted across various LLMs and multiple datasets demonstrate that WDMoE not only outperforms existing models, such as Llama 2 with 70 billion parameters, but also significantly reduces end-to-end latency.
大型语言模型(LLMs)在各种自然语言处理任务中取得了显著的成功,但如何将无线通信支持LLMs进行广泛研究仍然是一个重要的挑战。在本文中,我们提出了一个基于Mixture of Experts(MoE)的无线分布式LLM范例,名为WDMoE,在无线通信系统的基站(BS)和移动设备上协同部署LLM。具体来说,我们在LLMs的MoE层中通过在BS上部署 gate network 和先前的神经网络层,同时将专家网络分布在设备上。这种安排利用了分布式设备上专家网络的并行能力。此外,为了克服无线通信的不稳定性,我们设计了一个专家选择策略,考虑了模型的性能和端到端延迟,包括传输延迟和推理延迟。在各种LLM和多个数据集上进行的评估证明,WDMoE不仅超越了具有700亿参数的现有模型,如Llama 2,而且显著减少了端到端延迟。
https://arxiv.org/abs/2405.03131
In this paper, we propose modelling human translation production as a hierarchy of three embedded translation processes. The proposed architecture replicates the temporal dynamics of keystroke production across sensorimotor, cognitive, and phenomenal layers. Utilizing data from the CRITT TPR-DB, the Task Segment Framework, and the HOF taxonomy, we demonstrate the temporal breakdown of the typing flow on distinct timelines within these three layers.
在本文中,我们提出了一种将人类翻译生产建模为三个嵌入式翻译过程的层次结构的方法。所提出的架构复制了在感觉运动、认知和表象层中,敲击生产的时间动态。利用CRITT TPR-DB数据集、任务分割框架和HOF分类学数据,我们证明了在这些三个层中,敲击流的时间分解。
https://arxiv.org/abs/2405.03111
Large Language Models (LLMs) have made significant strides in information acquisition. However, their overreliance on potentially flawed parametric knowledge leads to hallucinations and inaccuracies, particularly when handling long-tail, domain-specific queries. Retrieval Augmented Generation (RAG) addresses this limitation by incorporating external, non-parametric knowledge. Nevertheless, the retrieved long-context documents often contain noisy, irrelevant information alongside vital knowledge, negatively diluting LLMs' attention. Inspired by the supportive role of essential concepts in individuals' reading comprehension, we propose a novel concept-based RAG framework with the Abstract Meaning Representation (AMR)-based concept distillation algorithm. The proposed algorithm compresses the cluttered raw retrieved documents into a compact set of crucial concepts distilled from the informative nodes of AMR by referring to reliable linguistic features. The concepts explicitly constrain LLMs to focus solely on vital information in the inference process. We conduct extensive experiments on open-domain question-answering datasets to empirically evaluate the proposed method's effectiveness. The results indicate that the concept-based RAG framework outperforms other baseline methods, particularly as the number of supporting documents increases, while also exhibiting robustness across various backbone LLMs. This emphasizes the distilled concepts are informative for augmenting the RAG process by filtering out interference information. To the best of our knowledge, this is the first work introducing AMR to enhance the RAG, presenting a potential solution to augment inference performance with semantic-based context compression.
大语言模型(LLMs)在信息获取方面取得了显著的进展。然而,他们对可能存在缺陷的参数知识的过度依赖导致了一种虚幻和不准确的情况,特别是在处理长尾和领域特定问题时。检索增强生成(RAG)通过引入外部、非参数性知识来解决这个局限。然而,检索到的长文本文档中通常包含噪音,无关信息,这会削弱LLM的注意力。为了模仿人在阅读理解中的关键概念对个人阅读理解的积极作用,我们提出了一个基于概念的RAG框架,使用基于抽象意义表示(AMR)的观念蒸馏算法。该算法通过参考AMR的信息节点将混杂的原始检索文档压缩成关键概念蒸馏的紧凑集合。这些概念明确约束LLM仅在推理过程中关注关键信息。我们在开放域问题回答数据集上进行广泛的实验,以实证评估所提出的方法的有效性。实验结果表明,基于概念的RAG框架在其他基线方法中表现出优异的表现,特别是在支持文档数量增加时,同时还表现出对各种基线LLM的稳健性。这强调了蒸馏出来的概念对增强RAG过程具有信息滤波作用。据我们所知,这是第一个将AMR引入增强RAG的工作,提出了用语义基于上下文压缩来提高推理性能的潜在解决方案。
https://arxiv.org/abs/2405.03085
Ibeling et al. (2023). axiomatize increasingly expressive languages of causation and probability, and Mosse et al. (2024) show that reasoning (specifically the satisfiability problem) in each causal language is as difficult, from a computational complexity perspective, as reasoning in its merely probabilistic or "correlational" counterpart. Introducing a summation operator to capture common devices that appear in applications -- such as the $do$-calculus of Pearl (2009) for causal inference, which makes ample use of marginalization -- van der Zander et al. (2023) partially extend these earlier complexity results to causal and probabilistic languages with marginalization. We complete this extension, fully characterizing the complexity of probabilistic and causal reasoning with summation, demonstrating that these again remain equally difficult. Surprisingly, allowing free variables for random variable values results in a system that is undecidable, so long as the ranges of these random variables are unrestricted. We finally axiomatize these languages featuring marginalization (or more generally summation), resolving open questions posed by Ibeling et al. (2023).
Ibeling 等人(2023)证明,对于越来越表达性的因果和概率语言,归约是一个越来越困难的问题。Mosse 等人(2024)表明,每个因果语言中的推理(特别是不满足问题)从计算复杂性的角度来看,与它的仅概率或“相关性”同义。引入求和操作来捕捉应用中常见的设备——例如 Pearl 的因果推理中的 $do$- 计算法(该方法充分利用边际化),Van der Zander 等人(2023)部分地将这些之前的复杂性结果扩展到具有边际化的因果和概率语言中。我们完成了这个扩展,完全刻画了使用求和操作的概率和因果推理的复杂性,表明这些 again 仍然同等困难。令人惊讶的是,只要随机变量的取值范围不受限制,允许自由变量会导致一个不可判定系统。最后,我们解决了这些具有边际化特点的语言(或更一般地求和操作),从而解决了 Ibeling 等人(2023)提出的问题。
https://arxiv.org/abs/2405.03069
Colorectal cancer contributes significantly to cancer-related mortality. Timely identification and elimination of polyps through colonoscopy screening is crucial in order to decrease mortality rates. Accurately detecting polyps in colonoscopy images is difficult because of the differences in characteristics such as size, shape, texture, and similarity to surrounding tissues. Current deep-learning methods often face difficulties in capturing long-range connections necessary for segmentation. This research presents BetterNet, a convolutional neural network (CNN) architecture that combines residual learning and attention methods to enhance the accuracy of polyp segmentation. The primary characteristics encompass (1) a residual decoder architecture that facilitates efficient gradient propagation and integration of multiscale features. (2) channel and spatial attention blocks within the decoder block to concentrate the learning process on the relevant areas of polyp regions. (3) Achieving state-of-the-art performance on polyp segmentation benchmarks while still ensuring computational efficiency. (4) Thorough ablation tests have been conducted to confirm the influence of architectural components. (5) The model code has been made available as open-source for further contribution. Extensive evaluations conducted on datasets such as Kvasir-SEG, CVC ClinicDB, Endoscene, EndoTect, and Kvasir-Sessile demonstrate that BetterNets outperforms current SOTA models in terms of segmentation accuracy by significant margins. The lightweight design enables real-time inference for various applications. BetterNet shows promise in integrating computer-assisted diagnosis techniques to enhance the detection of polyps and the early recognition of cancer. Link to the code: this https URL
直肠癌对癌症相关死亡率的贡献非常大。通过结肠镜筛查及时发现和消除结肠内的结节是降低死亡率的關鍵。然而,准确地在结肠镜图像中检测结节存在很大困难,因为结肠内结节的特征(如大小、形状、质地和与周围组织的相似性)存在差異。目前的大深度学习方法往往在捕捉分割过程中需要的长距离连接方面遇到困难。这项研究提出了BetterNet,一种结合残差学习和关注方法的卷积神经网络(CNN)架构,以提高结肠癌分割的准确性。主要特点包括:(1)一个残差解码器架构,可促进高效的梯度传播和多尺度特征整合。(2)解码器block内的通道和空间关注块,以将学习过程集中在结肠癌区域的 relevant 区域上。(3)在保证准确性的同时提高结肠癌分割基准测试的性能。(4)已经对建筑组件进行了全面消融测试,以确认其影响。(5)模型代码已公开为开源贡献,以进一步发挥其作用。在Kvasir-SEG、CVC诊所数据库、Endoscene、EndoTect和Kvasir-Sessile等数据集上进行的大量评估证明,BetterNets在分割准确性方面显著优于当前的最优模型。轻量级的设计使得各种应用实现实时推理。BetterNet在将计算机辅助诊断技术集成到结肠癌检测和早期癌症识别方面具有前景。链接到代码:https:// this URL
https://arxiv.org/abs/2405.04288
Efficient Image Super-Resolution (SR) aims to accelerate SR network inference by minimizing computational complexity and network parameters while preserving performance. Existing state-of-the-art Efficient Image Super-Resolution methods are based on convolutional neural networks. Few attempts have been made with Mamba to harness its long-range modeling capability and efficient computational complexity, which have shown impressive performance on high-level vision tasks. In this paper, we propose DVMSR, a novel lightweight Image SR network that incorporates Vision Mamba and a distillation strategy. The network of DVMSR consists of three modules: feature extraction convolution, multiple stacked Residual State Space Blocks (RSSBs), and a reconstruction module. Specifically, the deep feature extraction module is composed of several residual state space blocks (RSSB), each of which has several Vision Mamba Moudles(ViMM) together with a residual connection. To achieve efficiency improvement while maintaining comparable performance, we employ a distillation strategy to the vision Mamba network for superior performance. Specifically, we leverage the rich representation knowledge of teacher network as additional supervision for the output of lightweight student networks. Extensive experiments have demonstrated that our proposed DVMSR can outperform state-of-the-art efficient SR methods in terms of model parameters while maintaining the performance of both PSNR and SSIM. The source code is available at this https URL
高效的图像超分辨率(SR)旨在通过最小化计算复杂度和网络参数来加速SR网络推理,同时保持性能。现有的最先进的Efficient Image Super-Resolution方法基于卷积神经网络。在Mamba上,已经尝试了一些利用其远距离建模能力和高性价比的方法,这些方法在高级视觉任务上的表现令人印象深刻。在本文中,我们提出了DVMSR,一种新颖的轻量级图像SR网络,它结合了Vision Mamba和差分策略。DVMSR网络由三个模块组成:特征提取卷积、多层堆叠残差状态空间块(RSSB)和重构模块。具体来说,深层特征提取模块由多个残差状态空间块(RSSB)组成,每个RSSB都包含多个Vision Mamba模块和一个残差连接。为了在保持性能的同时实现效率提升,我们对视觉Mamba网络采用了差分策略,以获得更好的性能。具体来说,我们利用教师网络的丰富表示知识作为对轻量学生网络输出的附加监督。大量实验证明,与最先进的有效SR方法相比,我们提出的DVMSR在模型参数方面具有优越的性能,同时保持PSNR和SSIM的性能。源代码可在此处访问:https://url
https://arxiv.org/abs/2405.03008