Human being and different species of animals having the skills to gather, transferring knowledge, processing, fine-tune and generating information throughout their lifetime. The ability of learning throughout their lifespan is referred as continuous learning which is using neurocognition mechanism. Consequently, in real world computational system of incremental learning autonomous agents also needs such continuous learning mechanism which provide retrieval of information and long-term memory consolidation. However, the main challenge in artificial intelligence is that the incremental learning of the autonomous agent when new data confronted. In such scenarios, the main concern is catastrophic forgetting(CF), i.e., while learning the sequentially, neural network underfits the old data when it confronted with new data. To tackle this CF problem many numerous studied have been proposed, however it is very difficult to compare their performance due to dissimilarity in their evaluation mechanism. Here we focus on the comparison of all algorithms which are having similar type of evaluation mechanism. Here we are comparing three types of incremental learning methods: (1) Exemplar based methods, (2) Memory based methods, and (3) Network based method. In this survey paper, methodology oriented study for catastrophic forgetting in incremental deep neural network is addressed. Furthermore, it contains the mathematical overview of impact-full methods which can be help researchers to deal with CF.
人类和其他动物在其一生中具有收集、传递知识、处理、微调和生成信息的能力。这种能力在整个生命中进行学习被称为连续学习,利用神经认知机制。因此,在现实世界的自增学习自主代理也需要这种连续学习机制,以提供信息检索和长期记忆巩固。然而,人工智能的一个主要挑战是,当代理面对新的数据时,自增学习的递归问题。在这些问题中,主要关注的是灾难性遗忘(CF),即在依次学习的过程中,神经网络在遇到新数据时,过拟合旧数据。为解决这一CF问题,已经提出了许多研究,但是由于它们的评估机制不同,很难比较它们的性能。因此,本文将重点比较具有相似评估机制的所有算法的性能。本文将比较三种自增学习方法:(1)示例方法, (2)记忆方法,和(3)网络方法。本研究旨在解决灾难性遗忘在自增深度神经网络中的问题,并提供了对有影响力的方法的数学概述,以帮助研究人员解决CF问题。
https://arxiv.org/abs/2405.08015
While our understanding of fairness in machine learning has significantly progressed, our understanding of fairness in reinforcement learning (RL) remains nascent. Most of the attention has been on fairness in one-shot classification tasks; however, real-world, RL-enabled systems (e.g., autonomous vehicles) are much more complicated in that agents operate in dynamic environments over a long period of time. To ensure the responsible development and deployment of these systems, we must better understand fairness in RL. In this paper, we survey the literature to provide the most up-to-date snapshot of the frontiers of fairness in RL. We start by reviewing where fairness considerations can arise in RL, then discuss the various definitions of fairness in RL that have been put forth thus far. We continue to highlight the methodologies researchers used to implement fairness in single- and multi-agent RL systems before showcasing the distinct application domains that fair RL has been investigated in. Finally, we critically examine gaps in the literature, such as understanding fairness in the context of RLHF, that still need to be addressed in future work to truly operationalize fair RL in real-world systems.
虽然我们对机器学习中的公平性理解已经有了很大的进步,但在强化学习(RL)中的公平性理解仍然非常初浅。大部分关注点都集中在单次分类任务中的公平性上;然而,现实世界中的RL enabled系统(例如,自动驾驶车辆)在较长的时期内操作在动态环境中更为复杂。为了确保这些系统的负责任发展和部署,我们必须更好地理解RL中的公平性。在本文中,我们调查了文献,为RL中的公平性提供了最最新的前沿。我们首先回顾了在RL中公平性考虑可能出现的地方,然后讨论了迄今为止在RL中提出的公平性的各种定义。我们继续强调研究人员在实现单人和多机器人RL系统中的公平性时所采用的方法,然后展示了公平RL在研究中所涉及的独特应用领域。最后,我们对这些文献中仍需解决的关于RLHF公平性理解的空白进行了批判性审查,以确保在未来的工作中真正实现公平RL在现实世界系统中的操作。
https://arxiv.org/abs/2405.06909
Since their inception, programming languages have trended towards greater readability and lower barriers for programmers. Following this trend, natural language can be a promising type of programming language that provides great flexibility and usability and helps towards the democracy of programming. However, the inherent vagueness, ambiguity, and verbosity of natural language pose significant challenges in developing an interpreter that can accurately understand the programming logic and execute instructions written in natural language. Fortunately, recent advancements in Large Language Models (LLMs) have demonstrated remarkable proficiency in interpreting complex natural language. Inspired by this, we develop a novel system for Code Representation and Execution (CoRE), which employs LLM as interpreter to interpret and execute natural language instructions. The proposed system unifies natural language programming, pseudo-code programming, and flow programming under the same representation for constructing language agents, while LLM serves as the interpreter to interpret and execute the agent programs. In this paper, we begin with defining the programming syntax that structures natural language instructions logically. During the execution, we incorporate external memory to minimize redundancy. Furthermore, we equip the designed interpreter with the capability to invoke external tools, compensating for the limitations of LLM in specialized domains or when accessing real-time information. This work is open-source at this https URL.
自从它们诞生以来,编程语言的趋势是越来越具有可读性和越来越低的学习障碍,这对程序员来说是一个积极的趋势。遵循这一趋势,自然语言可以成为一种有前景的编程语言,它提供了极大的灵活性和可用性,有助于实现编程的民主化。然而,自然语言固有的模糊、歧义和冗长性使得开发一个准确理解编程逻辑并执行自然语言指令的解释器具有重大挑战。幸运的是,近年来在大型语言模型(LLMs)上的进步已经展示出在解释复杂自然语言方面非凡的能力。受到这一启发,我们开发了一个名为CoRE(代码表示和执行)的新系统,该系统使用LLM作为解释器来解释和执行自然语言指令。所提出的系统将自然语言编程、伪代码编程和流程编程在构建语言代理的相同表示中统一起来,而LLM则作为解释器来解释和执行代理程序。在本文中,我们首先定义了构成自然语言指令的编程语法。在执行过程中,我们引入了外部内存以最小化冗余。此外,我们还为设计中的解释器配备了调用外部工具的能力,弥补LLM在专业领域或访问实时信息时的限制。这项工作在https://这个网址上是开源的。
https://arxiv.org/abs/2405.06907
Generative document retrieval, an emerging paradigm in information retrieval, learns to build connections between documents and identifiers within a single model, garnering significant attention. However, there are still two challenges: (1) neglecting inner-content correlation during document representation; (2) lacking explicit semantic structure during identifier construction. Nonetheless, events have enriched relations and well-defined taxonomy, which could facilitate addressing the above two challenges. Inspired by this, we propose Event GDR, an event-centric generative document retrieval model, integrating event knowledge into this task. Specifically, we utilize an exchange-then-reflection method based on multi-agents for event knowledge extraction. For document representation, we employ events and relations to model the document to guarantee the comprehensiveness and inner-content correlation. For identifier construction, we map the events to well-defined event taxonomy to construct the identifiers with explicit semantic structure. Our method achieves significant improvement over the baselines on two datasets, and also hopes to provide insights for future research.
生成式文档检索,作为一种新兴的信息检索范式,学会了在单个模型中构建文档和标识器之间的联系,并取得了显著的关注。然而,仍然存在两个挑战:(1)在文档表示过程中忽视了内部分内容相关性;(2)在标识符构建过程中缺乏明确的语义结构。然而,事件已经丰富了对关系和明确的分类,这可能有助于解决上述两个挑战。受到这一启发,我们提出了事件GDR,一种以事件为中心的生成式文档检索模型,将事件知识整合到这项任务中。具体来说,我们利用基于多代理器的交换-然后-反思方法来提取事件知识。对于文档表示,我们利用事件和关系来建模文档,以确保其全面性和内部分内容相关性。对于标识符构建,我们将事件映射到明确的事件分类器中,以构建具有明确语义结构的标识符。我们的方法在两个数据集上的基线模型上取得了显著的改进,并为未来的研究提供了启示。
https://arxiv.org/abs/2405.06886
As machine learning (ML) gains widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when ML systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction has emerged as a promising approach to uncertainty and risk quantification, but existing variants either fail to accommodate sequences of data-dependent shifts, or do not fully exploit the fact that agent-induced shift is under our control. In this work we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones, although it is exceedingly impractical to compute in the most general case. For practical applications, we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.
随着机器学习(ML)的广泛采用,实践者 increasingly寻求衡量和控制这些系统所面临的风险的手段。当 ML 系统具有自主收集数据的能力时,这一挑战尤为突出,例如在黑盒优化和主动学习场景中,系统的行动会导致数据分布的序列反馈循环转移。同构预测作为一种有前景的不确定性和风险量化方法已经出现,但现有的变体要么无法适应数据依赖的序列变化,要么没有充分利用代理导致的转变是受我们控制的这一事实。在这项工作中,我们证明了同构预测可以理论上扩展到任何联合数据分布,而不仅仅是可交换或准可交换的 ones,虽然在最一般的情况下,计算它是极为困难的。为了实际应用,我们概述了一种为任何数据分布生成特定同构算法的程序,并利用这一程序为一系列代理引起的协变量转移生成可处理算法。我们在 synthetic 黑盒优化和主动学习任务上通过实验评估所提出的算法。
https://arxiv.org/abs/2405.06627
Large language models (LLMs) have emerged and presented their general problem-solving capabilities with one model. However, the model size has increased dramatically with billions of parameters to enable such broad problem-solving capabilities. In addition, due to the dominance of matrix-matrix and matrix-vector multiplications in LLMs, the compute-to-model size ratio is significantly lower than that of CNNs. This shift pushes LLMs from a computation-bound regime to a memory-bound regime. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored for achieving the memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning for LLMs is not well-understood yet. Therefore, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{37}$) for Llama2-7B). To navigate such a vast design space, we formulate the design space and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
大规模语言模型(LLMs)已经出现并借助一种模型展示了其问题解决能力。然而,随着参数数量成倍增加,模型的规模大幅增加,以实现如此广泛的问题解决能力。此外,由于LLMs中矩阵-矩阵和矩阵-向量乘法的主导地位,计算-模型大小比CNNs要低得多。这一趋势将LLMs从计算密集型模式推向内存密集型模式。因此,优化LLMs的内存占用和流量是LLM today的一个重要优化方向。已经积极研究了量化技术和参数剪枝方法来实现内存占用和流量的优化。然而,LLMs中秩剪枝对模型的准确率-效率权衡尚不清楚。因此,我们研究了LLM中秩剪枝的准确率-效率权衡,特别是Tucker分解,在最近的语言模型上的表现。我们 formalize the design space of low-rank decomposition methods, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the design space and show that the decomposition design space is enormous (e.g., O($2^{37}$) for Llama2-7B). To navigate such a vast design space, we formulate the design space and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops, which range from 4%p to 10%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
https://arxiv.org/abs/2405.06626
Reference path following is a key component in the functioning of almost all engineered autonomous agents. Among several path following guidance methods in existing literature, vector-field-based guidance approach has got wide attention because of its simplicity and guarantee of stability under a broad class of scenarios. However, the usage of same cross-track-error-dependent structure of desired vector field in most of the existing literature irrespective of instantaneous cross-track error and course angle of unmanned vehicle makes it quite restrictive in attaining faster convergence and also leads to infeasibly high turn rate command for many scenarios. To this end, this paper presents a novel switched vector field-based guidance for following a general reference path, in which the structure of the desired vector field depends on instantaneous cross-track-error and vehicle's course angle. While the developed method ensures faster convergence, it also ensures that the guidance command always stays within a realistic threshold satisfying its curvature constraint, thus making it more real-life implementable for autonomous vehicles with kino-dynamic constraints. Theoretical analysis for convergence of the developed guidance scheme is presented. Possibilities of undesirable chattering at phase transitions are also eliminated. Numerical simulation studies are presented to validate the satisfactory performance of the developed algorithm.
参考路径跟随是几乎所有工程自主代理商运作的关键组件。在现有文献中,基于向量场的方法因为其简单性和在多种场景下保证稳定性而得到了广泛关注。然而,在大多数现有文献中,无论瞬时跨道误差和无人车辆的 course angle 为何,使用相同的跨道误差相关结构期望向量场会使得实现更快收敛变得相当困难,同时也导致许多场景中指令过大的问题。为此,本文提出了一种新颖的切换向量场引导方法,用于跟随一般参考路径,其中期望向量场的结构取决于瞬时跨道误差和车辆 course angle。虽然所开发的方法确保了更快的收敛,但它还确保了指导命令始终保持在满足其曲率约束的合理范围内,因此对于具有 kino-dynamic 约束的自动驾驶车辆来说,这种方法更具现实性。理论分析证明了所开发引导方案的收敛性。还消除了在相变过程中出现的 unwanted chattering 现象。数值仿真研究验证了所开发算法的满意性能。
https://arxiv.org/abs/2405.06355
Visual Model-Based Reinforcement Learning (MBRL) promises to encapsulate agent's knowledge about the underlying dynamics of the environment, enabling learning a world model as a useful planner. However, top MBRL agents such as Dreamer often struggle with visual pixel-based inputs in the presence of exogenous or irrelevant noise in the observation space, due to failure to capture task-specific features while filtering out irrelevant spatio-temporal details. To tackle this problem, we apply a spatio-temporal masking strategy, a bisimulation principle, combined with latent reconstruction, to capture endogenous task-specific aspects of the environment for world models, effectively eliminating non-essential information. Joint training of representations, dynamics, and policy often leads to instabilities. To further address this issue, we develop a Hybrid Recurrent State-Space Model (HRSSM) structure, enhancing state representation robustness for effective policy learning. Our empirical evaluation demonstrates significant performance improvements over existing methods in a range of visually complex control tasks such as Maniskill \cite{gu2023maniskill2} with exogenous distractors from the Matterport environment. Our code is avaliable at this https URL.
视觉基于模型的强化学习(MBRL)承诺将封装代理关于环境潜在动态的知识,使得学习一个世界模型作为有用的规划器。然而,顶级MBRL代理如Dreamer通常在观测空间中存在自噪声或无关噪声的情况下,难以处理视觉像素输入,因为未能捕捉任务特定的特征而过滤掉无关的时间空间细节。为解决这个问题,我们应用了空间-时间掩码策略、模拟原理以及潜在重构,以捕捉世界模型中内生任务特定方面,有效地消除无关信息。联合训练表示、动态和策略通常会导致不稳定。为了进一步解决这个问题,我们开发了混合递归状态空间模型(HRSSM)结构,提高了对于有效策略学习的状态表示的稳健性。我们的实证评估表明,在视觉复杂控制任务中,如Maniskill \cite{gu2023maniskill2},带有Matterport环境中的自噪声干扰,性能得到了显著的提高。我们的代码可在此处访问:https://www.osac.hku.edu.cn/file/d0ed21e841164df44f826192a60df42.
https://arxiv.org/abs/2405.06263
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.
开放性世界是指没有预先指定目标或环境奖励信号的世界。因此,智能体必须知道如何执行许多任务。然而,当一个新的任务呈现给智能体时,我们希望它能够将其之前任务中学到的知识的一部分用于快速学习新任务。我们引入了一种新的技术,即将不同先验知识任务的政策合并成一个Mixture-of-Experts模型,该模型具有跨冻结和未冻结专家的注意机制。模型学习何时关注冻结任务特定专家并在适当的时候学习新的专家来处理新颖情况。我们工作在一个基于文本的开放性环境中,其中智能体被要求表现出不同角色类型的行为,并必须迅速学习与新角色类型相关的行为。我们证明了我们的智能体在零散设置中获得更多的奖励,并且在几散学习设置中,发现这些奖励的样本效率更高。
https://arxiv.org/abs/2405.06059
The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper introduces "Smurfs", a cutting-edge multi-agent framework designed to revolutionize the application of LLMs. By transforming a conventional LLM into a synergistic multi-agent ensemble, Smurfs enhances task decomposition and execution without necessitating extra training. This is achieved through innovative prompting strategies that allocate distinct roles within the model, thereby facilitating collaboration among specialized agents. The framework gives access to external tools to efficiently solve complex tasks. Our empirical investigation, featuring the mistral-7b-instruct model as a case study, showcases Smurfs' superior capability in intricate tool utilization scenarios. Notably, Smurfs outmatches the ChatGPT-ReACT in the ToolBench I2 and I3 benchmark with a remarkable 84.4% win rate, surpassing the highest recorded performance of a GPT-4 model at 73.5%. Furthermore, through comprehensive ablation studies, we dissect the contribution of the core components of the multi-agent framework to its overall efficacy. This not only verifies the effectiveness of the framework, but also sets a route for future exploration of multi-agent LLM systems.
大语言模型的出现为自动化通常与人类表现相当复杂的任务打开了史无前例的可能性。然而,由于它们在处理多面性问题方面的固有局限性,大型语言模型(LLMs)仍然在完成需要高准确性和复杂性的任务时遇到困难。本文介绍了一种前沿的多代理器框架——“Smurfs”,旨在彻底改变LLM的应用。通过将传统的LLM转化为协同的多代理器 ensemble,Smurfs增强了任务分解和执行,而无需进行额外的训练。这是通过创新提示策略分配模型内不同角色来实现的,从而促进了专业代理之间的协作。该框架为高效解决复杂任务提供了外部工具。我们的实证研究以 mistral-7b-instruct 模型为案例研究,展示了Smurfs在复杂工具利用场景中的优越能力。值得注意的是,Smurfs在 ToolBench I2 和 I3 基准中的胜率超过了 ChatGPT-ReACT,超过了最高记录的 GPT-4 模型的成绩(73.5%)。此外,通过全面的消融研究,我们深入剖析了多代理器框架的核心组件对整体效率的贡献。这不仅验证了框架的有效性,还为未来多代理器 LLM 系统的进一步研究奠定了道路。
https://arxiv.org/abs/2405.05955
This paper introduces a federated learning framework tailored for online combinatorial optimization with bandit feedback. In this setting, agents select subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals. Our framework transforms any offline resilient single-agent $(\alpha-\epsilon)$-approximation algorithm, having a complexity of $\tilde{\mathcal{O}}(\frac{\psi}{\epsilon^\beta})$, where the logarithm is omitted, for some function $\psi$ and constant $\beta$, into an online multi-agent algorithm with $m$ communicating agents and an $\alpha$-regret of no more than $\tilde{\mathcal{O}}(m^{-\frac{1}{3+\beta}} \psi^\frac{1}{3+\beta} T^\frac{2+\beta}{3+\beta})$. This approach not only eliminates the $\epsilon$ approximation error but also ensures sublinear growth with respect to the time horizon $T$ and demonstrates a linear speedup with an increasing number of communicating agents. Additionally, the algorithm is notably communication-efficient, requiring only a sublinear number of communication rounds, quantified as $\tilde{\mathcal{O}}\left(\psi T^\frac{\beta}{\beta+1}\right)$. Furthermore, the framework has been successfully applied to online stochastic submodular maximization using various offline algorithms, yielding the first results for both single-agent and multi-agent settings and recovering specialized single-agent theoretical guarantees. We empirically validate our approach to a stochastic data summarization problem, illustrating the effectiveness of the proposed framework, even in single-agent scenarios.
本文提出了一种为在线组合优化与强化反馈量身定制的联邦学习框架。在這種設置中,代理商选择子臂,在不访问個體臂信息的情況下观察這些子臂的噪聲獎勵,並且可以在特定的時間間隔內進行合作和信息共享。我們的框架將任何 offline 鲁棒單 agents (α-ε) 近似的算法转化为一個具有 $m$ 通信代理和 $\alpha$ 后悔为 $\tilde{\mathcal{O}}(m^{-\frac{1}{3+\beta}} \psi^\frac{1}{3+\beta} T^\frac{2+\beta}{3+\beta})$ 的在線多代理算法。這種方法不僅消除了 $\epsilon$ 近似誤差,還確保了與時間視角 $T$ 的 sub-linear 增長關係,並展示了隨著通信代理數量的增加,線性的加速。此外,該算法非常具有通信效率,只需要進行 sub-linear 通信環節,用來表示為 $\tilde{\mathcal{O}}\left(ps T^\frac{\beta}{\beta+1}\right)$。此外,該框架已經成功地應用於使用各種 offline 算法進行在線隨機子模擬最大化,從而為單機和多機設置帶來了第一個結果,並且通過 recover 單機理論保證,對於特定的單機設置,我們的方法具有線性的加速。我們通過經驗證 our approach 到隨機數據總結問題,即使是在單機場景中也展現了所提出框架的有效性。
https://arxiv.org/abs/2405.05950
We address the challenge of aggregating the preferences of multiple agents over LLM-generated replies to user queries, where agents might modify or exaggerate their preferences. New agents may participate for each new query, making fine-tuning LLMs on these preferences impractical. To overcome these challenges, we propose an auction mechanism that operates without fine-tuning or access to model weights. This mechanism is designed to provably converge to the ouput of the optimally fine-tuned LLM as computational resources are increased. The mechanism can also incorporate contextual information about the agents when avaiable, which significantly accelerates its convergence. A well-designed payment rule ensures that truthful reporting is the optimal strategy for all agents, while also promoting an equity property by aligning each agent's utility with her contribution to social welfare - an essential feature for the mechanism's long-term viability. While our approach can be applied whenever monetary transactions are permissible, our flagship application is in online advertising. In this context, advertisers try to steer LLM-generated responses towards their brand interests, while the platform aims to maximize advertiser value and ensure user satisfaction. Experimental results confirm that our mechanism not only converges efficiently to the optimally fine-tuned LLM but also significantly boosts advertiser value and platform revenue, all with minimal computational overhead.
https://arxiv.org/abs/2405.05905
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
嵌入式AI代理需要对物理世界进行细粒度的理解,通过视觉和语言输入与感知世界。仅从任务特定数据中学习这种能力是困难的。这导致预训练视觉语言模型成为将从互联网规模数据中获得的表示传递到下游任务和新领域的工具。然而,常见的通过对比训练的表示,如CLIP,已经被证明无法使嵌入式代理获得足够细粒度的场景理解——这对于控制是至关重要的。为了解决这个问题,我们考虑从预训练的文本到图像扩散模型的表示,这些模型专门优化从文本提示中生成图像,因此包含文本相关表示,这些表示具有高度细粒度的视觉和空间信息。使用预训练的文本到图像扩散模型,我们构建了稳定控制表示,允许学习下游控制策略,这些策略可以泛化到复杂、开放性环境。我们证明了使用稳定控制表示学习到的策略在广泛的模拟控制设置中与最先进的表示学习方法具有竞争力,包括具有挑战性的操作和导航任务。最值得注意的是,我们证明了稳定控制表示能够学习具有OVMM(困难开放式词汇表导航) benchmark中最佳性能的策略。
https://arxiv.org/abs/2405.05852
Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs.
大型语言模型(LLMs)如ChatGPT在各种自然语言理解(NLU)任务中取得了显著的进步,包括智能对话和自主机器人。然而,缺乏广泛认可的测试机制,回答`LLMs是否是杂凑的鹦鹉还是真正理解世界`仍然不明确,引发了无数的研究和激烈的争论。目前的研究主要集中在表面层的NLU,而忽略了微观的探索。然而,这种探索对于理解它们的独特理解机制,与人类认知相一致,并最终提高LLMs的通用NLU能力至关重要。为解决这一空白,我们的研究深入探讨了LLMs的复杂语义理解能力,特别是关于常见但含义不寻常的词汇。这一想法源于人类交流心理学的基础原则,强调准确共享对单词语义的理解。具体来说,本文通过新颖的数据集构建了一种新的词汇语义理解(LeSC)数据集,并引入了开源和闭源模型、不同规模和架构的模型,对广泛的实证实验进行了探讨。我们的广泛实证实验证明,现有模型在基本词汇意义理解任务中的表现较差。值得注意的是,即使是目前最先进的LLM GPT-4和GPT-3.5,与16岁的成年人相比,表现也落后了3.9%和22.3%。此外,还引入了多种先进的提示技术和检索增强生成,以帮助缓解这一问题,但限制仍然存在。通过强调上述关键不足,这项研究激发了进一步的调查,为开发更智能的LLM提供了新的见解。
https://arxiv.org/abs/2405.05741
The article is an attempt to contribute to explorations of a common origin for language and planned-collaborative action. It gives `semantics of change' the central stage in the synthesis, from its history and recordkeeping to its development, its syntax, delivery and reception, including substratal aspects. It is suggested that to arrive at a common core, linguistic semantics must be understood as studying through syntax mobile agent's representing, tracking and coping with change and no change. Semantics of actions can be conceived the same way, but through plans instead of syntax. The key point is the following: Sequencing itself, of words and action sequences, brings in more structural interpretation to the sequence than which is immediately evident from the sequents themselves. Mobile sequencers can be understood as subjects structuring reporting, understanding and keeping track of change and no change. The idea invites rethinking of the notion of category, both in language and in planning. Understanding understanding change by mobile agents is suggested to be about human extended practice, not extended-human practice. That's why linguistics is as important as computer science in the synthesis. It must rely on representational history of acts, thoughts and expressions, personal and public, crosscutting overtness and covertness of these phenomena. It has implication for anthropology in the extended practice, which is covered briefly.
本文试图为语言和计划性合作行动的共同起源提供探索。它将“变化的意义”置于合成过程的中心地位,从其历史和记录到发展、语法、传递和接收,包括子层面特征。建议,要达到共同核心,语言的语义必须理解为主通过语法移动代理人的表示、跟踪和应对变化以及不变。动作的意义也可以以同样的方式理解,但通过计划而不是语法。关键在于:单词和动作序列的顺序本身,带来了比序列本身更结构化的解释。移动序列者可以被理解为构建报告、理解和跟踪变化和不变的主体。理解通过移动代理商实现的变化,并非关于人类的扩展实践,而是关于人类的扩展实践。这就是为什么在合成中,语言学与计算机科学同样重要。它必须依赖于行为的历史、思考和表达,个人和公共的交叉覆盖和可见性。这对扩展实践的人类学有深远的影响,这是简短介绍。
https://arxiv.org/abs/2405.06710
We present an A*-based algorithm to compute policies for finite-horizon Dec-POMDPs. Our goal is to sacrifice optimality in favor of scalability for larger horizons. The main ingredients of our approach are (1) using clustered sliding window memory, (2) pruning the A* search tree, and (3) using novel A* heuristics. Our experiments show competitive performance to the state-of-the-art. Moreover, for multiple benchmarks, we achieve superior performance. In addition, we provide an A* algorithm that finds upper bounds for the optimum, tailored towards problems with long horizons. The main ingredient is a new heuristic that periodically reveals the state, thereby limiting the number of reachable beliefs. Our experiments demonstrate the efficacy and scalability of the approach.
我们提出了一种基于A*的算法来计算有限时间视野下的Dec-POMDP策略。我们的目标是牺牲最优性以换取更大视野下的可扩展性。我们方法的主要成分包括:(1)使用聚类的滑动窗口内存,(2)修剪A*搜索树,(3)使用新颖的A*启发式。我们的实验表明,与最先进的算法具有竞争性能。此外,对于多个基准,我们实现卓越的性能。此外,我们还提供了一种针对具有较长视野问题的A*算法,该算法可以找到最优的上界,并针对该问题定制化。主要成分是一种新颖的启发式,定期揭示状态,从而限制了可到达的信念数量。我们的实验证明了这种方法的有效性和可扩展性。
https://arxiv.org/abs/2405.05662
This work introduces a novel value decomposition algorithm, termed \textit{Dynamic Deep Factor Graphs} (DDFG). Unlike traditional coordination graphs, DDFG leverages factor graphs to articulate the decomposition of value functions, offering enhanced flexibility and adaptability to complex value function structures. Central to DDFG is a graph structure generation policy that innovatively generates factor graph structures on-the-fly, effectively addressing the dynamic collaboration requirements among agents. DDFG strikes an optimal balance between the computational overhead associated with aggregating value functions and the performance degradation inherent in their complete decomposition. Through the application of the max-sum algorithm, DDFG efficiently identifies optimal policies. We empirically validate DDFG's efficacy in complex scenarios, including higher-order predator-prey tasks and the StarCraft II Multi-agent Challenge (SMAC), thus underscoring its capability to surmount the limitations faced by existing value decomposition algorithms. DDFG emerges as a robust solution for MARL challenges that demand nuanced understanding and facilitation of dynamic agent collaboration. The implementation of DDFG is made publicly accessible, with the source code available at \url{this https URL}.
本文提出了一种名为 \textit{动态深度因子图} (DDFG) 的新颖价值分解算法。与传统的协调图不同,DDFG 通过因子图来表示价值函数的分解,为复杂价值函数结构提供了更多的灵活性和适应性。DDFG 的核心是图形结构生成策略,创新地在运行时生成因子图结构,有效解决了代理之间的动态协作要求。 DDFG 在计算开销与完整分解性能下降之间取得了最优平衡。通过应用最大收益算法,DDFG 高效地识别出最优策略。我们通过实验验证 DDFG 在复杂场景中的有效性,包括更高阶捕食-被捕食任务和 StarCraft II 多代理挑战(SMAC),从而强调其克服现有价值分解算法的局限性的能力。 DDFG 成为要求深入了解和促进动态代理协作的 MARL 挑战的稳健解决方案。DDFG 的实现是公开可用的,源代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2405.05542
Auction-based Federated Learning (AFL) has attracted extensive research interest due to its ability to motivate data owners (DOs) to join FL through economic means. While many existing AFL methods focus on providing decision support to model users (MUs) and the AFL auctioneer, decision support for data owners remains open. To bridge this gap, we propose a first-of-its-kind agent-oriented joint Pricing, Acceptance and Sub-delegation decision support approach for data owners in AFL (PAS-AFL). By considering a DO's current reputation, pending FL tasks, willingness to train FL models, and its trust relationships with other DOs, it provides a systematic approach for a DO to make joint decisions on AFL bid acceptance, task sub-delegation and pricing based on Lyapunov optimization to maximize its utility. It is the first to enable each DO to take on multiple FL tasks simultaneously to earn higher income for DOs and enhance the throughput of FL tasks in the AFL ecosystem. Extensive experiments based on six benchmarking datasets demonstrate significant advantages of PAS-AFL compared to six alternative strategies, beating the best baseline by 28.77% and 2.64% on average in terms of utility and test accuracy of the resulting FL models, respectively.
基于竞价的联邦学习(AFL)吸引了广泛的研究兴趣,因为它能够通过经济手段激励数据所有者(DOs)加入联邦学习(FL)。然而,许多现有的AFL方法都关注为模型用户(MUs)和AFL竞拍者提供决策支持,而数据所有者的决策支持仍然开放。为了弥合这个差距,我们提出了一个独一无二的代理机决策支持方法,为AFL中的数据所有者提供联合定价、接受和下放决策支持(PAS-AFL)。通过考虑DO的当前声誉、即将完成的FL任务、愿意培训FL模型的意愿以及与其他DO的信任关系,为DO在AFL竞价接受、任务下放和定价方面做出联合决策,基于Lyapunov优化最大化其效用。这是第一个使每个DO同时承担多个FL任务以获得更高收入,并增强AFL生态系统中FL任务的吞吐量。基于六个基准数据集的广泛实验证明,与六个替代策略相比,PAS-AFL具有显著的优势,在效用和测试准确性的平均值上分别比基线高出28.77%和2.64%。
https://arxiv.org/abs/2405.05991
We present a multi-agent reinforcement learning approach to solve a pursuit-evasion game between two players with car-like dynamics and sensing limitations. We develop a curriculum for an existing multi-agent deterministic policy gradient algorithm to simultaneously obtain strategies for both players, and deploy the learned strategies on real robots moving as fast as 2 m/s in indoor environments. Through experiments we show that the learned strategies improve over existing baselines by up to 30% in terms of capture rate for the pursuer. The learned evader model has up to 5% better escape rate over the baselines even against our competitive pursuer model. We also present experiment results which show how the pursuit-evasion game and its results evolve as the player dynamics and sensor constraints are varied. Finally, we deploy learned policies on physical robots for a game between the F1TENTH and JetRacer platforms and show that the learned strategies can be executed on real-robots. Our code and supplementary material including videos from experiments are available at https: //gonultasbu.this http URL.
我们提出了一种多智能体强化学习方法来解决两个具有类似汽车动态和感知限制的玩家之间的追击-躲避游戏。我们开发了一个现有的多智能体确定性策略梯度算法的课程,以同时获得两个玩家的策略,并将学习到的策略应用于室内环境中的真实机器人,这些机器人的速度可以达到2 m/s。通过实验,我们证明了学习到的策略在捕获率方面比现有基线提高了30%。学习到的 evader 模型在基线中甚至具有5%的逃脱率。我们还展示了随着玩家动态和感知限制的变化,追击-躲避游戏及其结果如何演变。最后,我们将学习到的策略应用于F1TENTH 和 JetRacer平台之间的游戏,并表明学习到的策略可以在真实机器人上执行。我们的代码和补充材料(包括实验视频)可以在https://gonultasbu.this URL上找到。
https://arxiv.org/abs/2405.05372
Powered by large language models (LLMs), AI agents have become capable of many human tasks. Using the most canonical definitions of the Big Five personality, we measure the ability of LLMs to negotiate within a game-theoretical framework, as well as methodological challenges to measuring notions of fairness and risk. Simulations (n=1,500) for both single-issue and multi-issue negotiation reveal increase in domain complexity with asymmetric issue valuations improve agreement rates but decrease surplus from aggressive negotiation. Through gradient-boosted regression and Shapley explainers, we find high openness, conscientiousness, and neuroticism are associated with fair tendencies; low agreeableness and low openness are associated with rational tendencies. Low conscientiousness is associated with high toxicity. These results indicate that LLMs may have built-in guardrails that default to fair behavior, but can be "jail broken" to exploit agreeable opponents. We also offer pragmatic insight in how negotiation bots can be designed, and a framework of assessing negotiation behavior based on game theory and computational social science.
由大型语言模型(LLMs)供电的AI代理商已经变得能够完成许多人类任务。在最能概括五大人格特质定义的框架内,我们衡量了LLMs在游戏理论框架内进行谈判的能力,以及测量公平性和风险的概念方法挑战。对于单问题和不单问题谈判的模拟(n=1,500),随着对称问题评估的增加,领域复杂性增加,但非对称问题评估的增加导致协议率增加,但剩余收益减少。通过梯度提升回归和Shapley解释器,我们发现高度开放性、责任心和神经质与公平倾向有关;低同意性和低开放性有关理性倾向。低责任心与高毒性有关。这些结果表明,LLMs可能已经内置了限制机制,使其默认采取公平行为,但可以“越狱”以利用顺从的对手。我们还提供了关于如何设计谈判机器人的实用见解,以及基于游戏理论和计算社会科学的评估谈判行为的框架。
https://arxiv.org/abs/2405.05248