In this paper, we introduce a new method for the task of interaction transfer. Given an example interaction between a source object and an agent, our method can automatically infer both surface and spatial relationships for the agent and target objects within the same category, yielding more accurate and valid transfers. Specifically, our method characterizes the example interaction using a combined spatial and surface representation. We correspond the agent points and object points related to the representation to the target object space using a learned spatial and surface correspondence field, which represents objects as deformed and rotated signed distance fields. With the corresponded points, an optimization is performed under the constraints of our spatial and surface interaction representation and additional regularization. Experiments conducted on human-chair and hand-mug interaction transfer tasks show that our approach can handle larger geometry and topology variations between source and target shapes, significantly outperforming state-of-the-art methods.
在本文中,我们提出了一种新的交互转移方法。给定一个源对象和一个代理之间的交互示例,我们的方法可以自动推断同一类别中代理和目标对象之间的表面和空间关系,从而产生更准确和有效的转移。具体来说,我们的方法通过结合空间和表面表示来描述示例交互。我们将代理点和对表示的物体点与目标对象空间中的点相对应,使用学习到的空间和表面匹配场将物体表示为变形和旋转的签名距离场。在满足我们的空间和表面交互表示的约束条件下,进行优化。在人类椅和手拿咖啡杯的交互转移任务上进行实验证明,我们的方法可以处理源和目标形状之间更大的几何和拓扑变化,显著优于现有方法。
https://arxiv.org/abs/2405.03221
In this paper, we propose modelling human translation production as a hierarchy of three embedded translation processes. The proposed architecture replicates the temporal dynamics of keystroke production across sensorimotor, cognitive, and phenomenal layers. Utilizing data from the CRITT TPR-DB, the Task Segment Framework, and the HOF taxonomy, we demonstrate the temporal breakdown of the typing flow on distinct timelines within these three layers.
在本文中,我们提出了一种将人类翻译生产建模为三个嵌入式翻译过程的层次结构的方法。所提出的架构复制了在感觉运动、认知和表象层中,敲击生产的时间动态。利用CRITT TPR-DB数据集、任务分割框架和HOF分类学数据,我们证明了在这些三个层中,敲击流的时间分解。
https://arxiv.org/abs/2405.03111
Detecting stereotypes and biases in Large Language Models (LLMs) is crucial for enhancing fairness and reducing adverse impacts on individuals or groups when these models are applied. Traditional methods, which rely on embedding spaces or are based on probability metrics, fall short in revealing the nuanced and implicit biases present in various contexts. To address this challenge, we propose the FairMonitor framework and adopt a static-dynamic detection method for a comprehensive evaluation of stereotypes and biases in LLMs. The static component consists of a direct inquiry test, an implicit association test, and an unknown situation test, including 10,262 open-ended questions with 9 sensitive factors and 26 educational scenarios. And it is effective for evaluating both explicit and implicit biases. Moreover, we utilize the multi-agent system to construst the dynamic scenarios for detecting subtle biases in more complex and realistic setting. This component detects the biases based on the interaction behaviors of LLMs across 600 varied educational scenarios. The experimental results show that the cooperation of static and dynamic methods can detect more stereotypes and biased in LLMs.
发现大型语言模型(LLMs)中的刻板印象和偏见对于在应用这些模型时增强公平性和减少对个人或团体的不良影响至关重要。传统方法,包括基于嵌入空间或概率度量的方法,在揭示各种背景下存在的微妙的和隐含的偏见方面存在不足。为解决这个问题,我们提出了FairMonitor框架,并采用一种全面评估LLMs中的刻板印象和偏见的静态-动态检测方法。静态部分包括直接调查测试、隐含联想测试和未知情境测试,包括10,262个开放性问题,其中9个敏感因素和26个教育场景。这种方法对于评估 explicit 和 implicit biases都有效。此外,我们还利用多智能体系统来构建动态场景,以更复杂和现实的环境中检测微妙的偏见。该组件基于LLMs之间600个不同教育场景的交互行为来检测偏见。实验结果表明,静态和动态方法的协同作用可以检测到更多刻板印象和偏见在LLMs中。
https://arxiv.org/abs/2405.03098
Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
深度强化学习(DRL)在现实应用中扮演着越来越重要的角色。然而,为了在复杂任务中获得最优的DRL代理器,特别是稀疏奖励,仍然是一个具有挑战性的问题。DRL代理商的训练常常陷入瓶颈,没有进一步的进步。在本文中,我们提出了RICE,一种创新的强化学习优化方案,它引入了解释方法来突破训练瓶颈。RICE的高层次思路是在组合默认初始状态和通过解释方法确定的临界状态的基础上构建一个新的初始状态分布,从而鼓励代理商从混合初始状态中进行探索。通过仔细的设计,我们可以理论上将我们的优化方案的子最优解界变得更小。我们在各种流行RL环境和现实应用中评估了RICE。结果表明,RICE在提高代理商性能方面显著优于现有的优化方案。
https://arxiv.org/abs/2405.03064
Mean field games (MFGs) are a promising framework for modeling the behavior of large-population systems. However, solving MFGs can be challenging due to the coupling of forward population evolution and backward agent dynamics. Typically, obtaining mean field Nash equilibria (MFNE) involves an iterative approach where the forward and backward processes are solved alternately, known as fixed-point iteration (FPI). This method requires fully observed population propagation and agent dynamics over the entire spatial domain, which could be impractical in some real-world scenarios. To overcome this limitation, this paper introduces a novel online single-agent model-free learning scheme, which enables a single agent to learn MFNE using online samples, without prior knowledge of the state-action space, reward function, or transition dynamics. Specifically, the agent updates its policy through the value function (Q), while simultaneously evaluating the mean field state (M), using the same batch of observations. We develop two variants of this learning scheme: off-policy and on-policy QM iteration. We prove that they efficiently approximate FPI, and a sample complexity guarantee is provided. The efficacy of our methods is confirmed by numerical experiments.
平均场游戏(MFGs)是一种有前景的建模大型人群系统行为的方法。然而,解决MFGs可能会具有挑战性,因为前向种群进化与后向智能体动态的耦合。通常,获得平均场纳什均衡(MFNE)需要一种交替求解前向和后向过程的方法,称为固定点迭代(FPI)。这种方法需要完全观察到的种群传播和智能体动态在整个空间域内,这在某些现实场景中可能是不切实际的。为了克服这一局限,本文引入了一种新颖的在线单智能体无学习方案,使得单个智能体能够使用在线样本学习MFNE,而无需先验知识,即使不知道状态-动作空间,奖励函数或状态转移动态。具体来说,智能体通过价值函数(Q)更新其策略,同时使用同一批观察值评估平均场状态(M)。我们开发了两种学习方案:离散和连续QM迭代。我们证明了它们有效地近似FPI,并为样本复杂性提供了保证。通过数值实验,证实了我们的方法的实效性。
https://arxiv.org/abs/2405.03718
This work considers the problem of optimal lane changing in a structured multi-agent road environment. A novel motion planning algorithm that can capture long-horizon dependencies as well as short-horizon dynamics is presented. Pivotal to our approach is a geometric approximation of the long-horizon combinatorial transition problem which we formulate in the continuous time-space domain. Moreover, a discrete-time formulation of a short-horizon optimal motion planning problem is formulated and combined with the long-horizon planner. Both individual problems, as well as their combination, are formulated as MIQP and solved in real-time by using state-of-the-art solvers. We show how the presented algorithm outperforms two other state-of-the-art motion planning algorithms in closed-loop performance and computation time in lane changing problems. Evaluations are performed using the traffic simulator SUMO, a custom low-level tracking model predictive controller, and high-fidelity vehicle models and scenarios, provided by the CommonRoad environment.
本工作考虑了在结构多代理道路环境中进行最优车道切换的问题。我们提出了一个新颖的运动规划算法,可以捕捉长时依赖关系和短时动态。我们还在连续时间域中形式化了一个几何近似的长时间依赖关系组合问题,这是我们方法的关键。此外,我们还提出了一个离散时间的短期最优运动规划问题,并将其与长期规划器相结合。我们将其组合问题和解决方案都表示为MIQP,并使用最先进的求解器在实时状态下求解。我们证明了所提出的算法在关闭环路性能和计算时间方面优于另外两个最先进的运动规划算法。评估实验使用了交通仿真器SUMO、由CommonRoad环境提供的低级跟踪预测控制器和高度逼真的车辆模型和场景。
https://arxiv.org/abs/2405.02979
Currently, the generative model has garnered considerable attention due to its application in addressing the challenge of scarcity of abnormal samples in the industrial Internet of Things (IoT). However, challenges persist regarding the edge deployment of generative models and the optimization of joint edge AI-generated content (AIGC) tasks. In this paper, we focus on the edge optimization of AIGC task execution and propose GMEL, a generative model-driven industrial AIGC collaborative edge learning framework. This framework aims to facilitate efficient few-shot learning by leveraging realistic sample synthesis and edge-based optimization capabilities. First, a multi-task AIGC computational offloading model is presented to ensure the efficient execution of heterogeneous AIGC tasks on edge servers. Then, we propose an attention-enhanced multi-agent reinforcement learning (AMARL) algorithm aimed at refining offloading policies within the IoT system, thereby supporting generative model-driven edge learning. Finally, our experimental results demonstrate the effectiveness of the proposed algorithm in optimizing the total system latency of the edge-based AIGC task completion.
目前,由于其在解决工业物联网(IoT)中异常样本 scarcity 的问题而受到了相当的关注。然而,关于生成模型的边缘部署和联合边缘 AI 生成的内容(AIGC)任务的优化问题仍然存在挑战。在本文中,我们重点关注了 AIGC 任务执行的边缘优化,并提出了 GMEL,一种基于生成模型的工业 AIGC 协同边缘学习框架。这个框架旨在通过利用真实的样本合成和边缘优化的能力来促进高效的少样本学习。首先,我们提出了一个多任务 AIGC 计算卸载模型,以确保在边缘服务器上高效执行异构性的 AIGC 任务。然后,我们提出了一种注意力增强的多代理强化学习(AMARL)算法,旨在在物联网系统内优化卸载策略,从而支持基于生成模型的边缘学习。最后,我们的实验结果证明了所提出的算法的有效性,即在优化基于边缘的 AIGC 任务完成时,可以降低系统的延迟总和。
https://arxiv.org/abs/2405.02972
A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and potentially malicious attack, jeopardizing the precision of spatial-temporal alignment. Rather than relying on external hardwares, this work proposes a novel approach: aligning by recognizing the inherent geometric patterns within the perceptual data of various agents. Following this spirit, we propose a robust collaborative perception system that operates independently of external localization and clock devices. The key module of our system,~\emph{FreeAlign}, constructs a salient object graph for each agent based on its detected boxes and uses a graph neural network to identify common subgraphs between agents, leading to accurate relative pose and time. We validate \emph{FreeAlign} on both real-world and simulated datasets. The results show that, the ~\emph{FreeAlign} empowered robust collaborative perception system perform comparably to systems relying on precise localization and clock devices.
在多个代理器之间实现一致的空间-时间协调是协同感知的基本前提,该目标是通过代理器之间的信息交流来改善代理器的感知能力。要实现这种空间-时间对齐,传统方法依赖于外部设备提供定位和时钟信号。然而,硬件生成的信号可能受到噪声和潜在恶意攻击的影响,从而危及空间-时间对齐的精度。因此,我们提出了一个新方法:通过识别各种代理感知数据中的固有几何模式来对齐。 在這種精神下,我们提出了一个自適應的協同感知系統,該系統獨立於外部定位和時鐘設備運行。我們的关键模塊~\emph{FreeAlign}基於其檢測到的邊框構建每個代理器的顯著對象圖形,並使用圖神經網絡在代理之間識別共同子圖形,從而實現準確的相對姿態和時間。 我们在現實世界和模擬數據上驗證了~\emph{FreeAlign}。結果表明,~\emph{FreeAlign}輔助的堅實協同感知系統與依賴精確定位和時鐘設備的系統性能相当。
https://arxiv.org/abs/2405.02965
In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. All patients, nurses, and doctors are autonomous agents powered by large language models (LLMs). Our central goal is to enable a doctor agent to learn how to treat illness within the simulacrum. To do so, we propose a method called MedAgent-Zero. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. Simulation experiments show that the treatment performance of doctor agents consistently improves on various tasks. More interestingly, the knowledge the doctor agents have acquired in Agent Hospital is applicable to real-world medicare benchmarks. After treating around ten thousand patients (real-world doctors may take over two years), the evolved doctor agent achieves a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset that covers major respiratory diseases. This work paves the way for advancing the applications of LLM-powered agent techniques in medical scenarios.
在本文中,我们提出了一个名为Agent Hospital的医院模拟模型,该模型模拟了治疗疾病的过程。所有患者、护士和医生都是由大型语言模型(LLMs)驱动的自主代理。我们的核心目标是让医生代理学会在模拟中如何治疗疾病。为此,我们提出了一个名为MedAgent-Zero的方法。 由于模拟可以根据知识库和LLMs模拟疾病的发生和进展,医生代理可以从成功和失败案例中积累经验。仿真实验表明,医生代理在各种任务上的治疗效果不断提高。更令人兴奋的是,医生代理在Agent Hospital中所获得的知识可以应用于现实世界的医疗基准。 在治疗大约10,000名患者(现实世界的医生可能需要两年多才能完成)后,进化的医生代理在覆盖主要呼吸疾病的部分MedQA数据集上达到93.06%的准确率,为LLM-驱动代理技术在医疗场景中的应用铺平道路。
https://arxiv.org/abs/2405.02957
Conversational recommender systems have emerged as a potent solution for efficiently eliciting user preferences. These systems interactively present queries associated with "key terms" to users and leverage user feedback to estimate user preferences more efficiently. Nonetheless, most existing algorithms adopt a centralized approach. In this paper, we introduce FedConPE, a phase elimination-based federated conversational bandit algorithm, where $M$ agents collaboratively solve a global contextual linear bandit problem with the help of a central server while ensuring secure data management. To effectively coordinate all the clients and aggregate their collected data, FedConPE uses an adaptive approach to construct key terms that minimize uncertainty across all dimensions in the feature space. Furthermore, compared with existing federated linear bandit algorithms, FedConPE offers improved computational and communication efficiency as well as enhanced privacy protections. Our theoretical analysis shows that FedConPE is minimax near-optimal in terms of cumulative regret. We also establish upper bounds for communication costs and conversation frequency. Comprehensive evaluations demonstrate that FedConPE outperforms existing conversational bandit algorithms while using fewer conversations.
对话推荐系统已成为有效发掘用户偏好的强大解决方案。这些系统通过交互地向用户呈现与“关键词”相关的查询,并利用用户反馈来更有效地估计用户偏好。然而,大多数现有算法采用集中方法。在本文中,我们引入了FedConPE,一种基于阶段消除的联邦对话博弈算法,其中M个代理商通过中央服务器协作解决全球上下文线性博弈问题,同时确保安全数据管理。为了有效地协调所有客户端并汇总他们收集的数据,FedConPE采用一种自适应方法构造关键词,以最小化所有维度特征空间中的不确定性。此外,与现有联邦线性博弈算法相比,FedConPE在计算和通信效率以及隐私保护方面提供了改进。我们的理论分析表明,FedConPE在累积后悔度方面是最小最大最优的。我们还建立了通信成本和对话频率的上界。全面评估表明,尽管使用较少对话,FedConPE在现有对话博弈算法中表现优异。
https://arxiv.org/abs/2405.02881
Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation framework using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios.
社交媒体平台如Twitter、Reddit和新浪微博在全球交流中扮演着关键角色,但通常会在敏感的地缘政治地区遇到严格的监管。这种情况促使用户巧妙地修改他们的交流方式,经常在受监管的社交媒体环境中使用暗语。这种交流方式的转变不仅是对抗监管的一种策略,更是语言进化的生动表现,展示了在社会和技术压力下语言的自然演变。研究在受监管的社交媒体环境中语言演变的重要性对于确保言论自由、优化内容监管和推动语言研究具有重大意义。本文提出了一种使用大型语言模型(LLMs)的多代理仿真框架,探讨受监管社交媒体环境中用户语言的演变。框架采用LLM驱动的代理:监督代理负责对话监督,参与者代理在参与对话的过程中发展他们的语言策略,模拟在严格监管下避免社交媒体管理策略的演变。研究通过从抽象场景到现实世界情况的各种场景对框架的有效性进行评估。关键发现表明,LLMs能够模拟受约束环境中的细微语言动态和互动,随着进度的提高,逃避监督和信息准确性的能力都有所改善。此外,发现LLM代理采用不同的策略来应对不同的场景。
https://arxiv.org/abs/2405.02858
Exploring complex adaptive financial trading environments through multi-agent based simulation methods presents an innovative approach within the realm of quantitative finance. Despite the dominance of multi-agent reinforcement learning approaches in financial markets with observable data, there exists a set of systematically significant financial markets that pose challenges due to their partial or obscured data availability. We, therefore, devise a multi-agent simulation approach employing small-scale meta-heuristic methods. This approach aims to represent the opaque bilateral market for Australian government bond trading, capturing the bilateral nature of bank-to-bank trading, also referred to as "over-the-counter" (OTC) trading, and commonly occurring between "market makers". The uniqueness of the bilateral market, characterized by negotiated transactions and a limited number of agents, yields valuable insights for agent-based modelling and quantitative finance. The inherent rigidity of this market structure, which is at odds with the global proliferation of multilateral platforms and the decentralization of finance, underscores the unique insights offered by our agent-based model. We explore the implications of market rigidity on market structure and consider the element of stability, in market design. This extends the ongoing discourse on complex financial trading environments, providing an enhanced understanding of their dynamics and implications.
通过基于多智能体(multi-agent)的仿真方法探索复杂适应金融交易环境是一种在量化金融领域具有创新性的方法。尽管在具有观测数据的市场中,多智能体强化学习方法占据主导地位,但存在一组由于部分或难以获得数据而具有系统性地重要性的金融市场。因此,我们设计了一种基于元启发式方法的多智能体仿真方法。该方法旨在代表澳大利亚政府债券交易的双边市场,捕捉到银行间交易的双边性质,也称为“场外”(OTC) 交易,以及通常在市场制造商之间发生的双边交易。双边市场的独特性,其特点是有协议的交易和有限的代理数量,为基于智能体的建模和量化金融提供了宝贵的见解。市场结构的固有刚性,与其与全球多边平台和金融市场的分散化相矛盾,强调了我们的基于智能体的模型所提供的独特见解。我们探讨了市场刚性对市场结构和市场设计的影响。这扩展了关于复杂金融交易环境的持续讨论,提供了对它们动态和影响的更深入了解。
https://arxiv.org/abs/2405.02849
Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.
最近,由于在仿真中实现端到端学习机器人任务的成功,Sim2real传输受到了越来越多的关注。虽然已经在视觉导航策略的转移方面取得了很大的进展,但现有的Sim2real策略在经验上对音频-视觉导航进行数据增强,而没有测量声学缺口。声音与光不同,它跨越了更广泛的频率,因此需要为Sim2real采用不同的解决方案。我们将通过将Sim2real分解为声场预测(AFP)和轨迹导航来提出第一个针对音频-视觉导航的Sim2real治疗方案。我们在SoundSpaces仿真器中验证了我们的设计选择,并展示了在连续音频目标导航基准上的改善。接着,我们收集了现实世界的数据,通过训练只对特定频率子带输入的AFP模型来测量模拟和现实之间的频谱差异。我们进一步提出了一种频率适应策略,根据测量的频谱差异和接收到的音频的能量分布智能地选择预测的最佳频率带,这会在现实数据上提高性能。最后,我们构建了一个真实的机器人平台,并展示了通过转移策略成功导航到发声物的能力。这项工作证明了构建能够从仿真中看到、听到并行动的智能代理的潜力,并将它们转移到现实世界。
https://arxiv.org/abs/2405.02821
Deep reinforcement learning (DRL) has demonstrated remarkable performance in many continuous control tasks. However, a significant obstacle to the real-world application of DRL is the lack of safety guarantees. Although DRL agents can satisfy system safety in expectation through reward shaping, designing agents to consistently meet hard constraints (e.g., safety specifications) at every time step remains a formidable challenge. In contrast, existing work in the field of safe control provides guarantees on persistent satisfaction of hard safety constraints. However, these methods require explicit analytical system dynamics models to synthesize safe control, which are typically inaccessible in DRL settings. In this paper, we present a model-free safe control algorithm, the implicit safe set algorithm, for synthesizing safeguards for DRL agents that ensure provable safety throughout training. The proposed algorithm synthesizes a safety index (barrier certificate) and a subsequent safe control law solely by querying a black-box dynamic function (e.g., a digital twin simulator). Moreover, we theoretically prove that the implicit safe set algorithm guarantees finite time convergence to the safe set and forward invariance for both continuous-time and discrete-time systems. We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark, where it achieves zero safety violations while gaining $95\% \pm 9\%$ cumulative reward compared to state-of-the-art safe DRL methods. Furthermore, the resulting algorithm scales well to high-dimensional systems with parallel computing.
深度强化学习(DRL)在许多连续控制任务中表现出显著的性能。然而,DRL在现实世界应用中的一大障碍是缺乏安全性保证。尽管DRL代理可以通过奖励塑造满足系统安全,但设计一个在每一步都确保达到严格约束的安全代理仍然具有挑战性。相比之下,该领域现有的安全控制方法提供了对严格安全约束的持续满足的保证。然而,这些方法需要显式地分析系统动态模型来合成安全控制,这在DRL环境中通常是不可访问的。在本文中,我们提出了一个模型-无关的安全控制算法,称为隐式安全集算法,用于为DRL代理合成训练过程中的安全保障。所提出的算法通过查询黑盒动态函数(例如数字双胞胎模拟器)生成安全指数和安全控制律。此外,我们理论证明,隐式安全集算法保证连续时间和离散时间系统的有限时间收敛和前馈不变性。我们在最新的Safety Gym基准上验证了所提出的算法,该算法在实现零安全违规的同时,与最先进的 safe DRL 方法相比获得了95% ± 9%的累积奖励。此外,该算法具有良好的扩展性,可应用于高维系统,通过并行计算实现。
https://arxiv.org/abs/2405.02754
The significance of network structures in promoting group cooperation within social dilemmas has been widely recognized. Prior studies attribute this facilitation to the assortment of strategies driven by spatial interactions. Although reinforcement learning has been employed to investigate the impact of dynamic interaction on the evolution of cooperation, there remains a lack of understanding about how agents develop neighbour selection behaviours and the formation of strategic assortment within an explicit interaction structure. To address this, our study introduces a computational framework based on multi-agent reinforcement learning in the spatial Prisoner's Dilemma game. This framework allows agents to select dilemma strategies and interacting neighbours based on their long-term experiences, differing from existing research that relies on preset social norms or external incentives. By modelling each agent using two distinct Q-networks, we disentangle the coevolutionary dynamics between cooperation and interaction. The results indicate that long-term experience enables agents to develop the ability to identify non-cooperative neighbours and exhibit a preference for interaction with cooperative ones. This emergent self-organizing behaviour leads to the clustering of agents with similar strategies, thereby increasing network reciprocity and enhancing group cooperation.
社交困境中网络结构在促进群体合作方面的 significance已经被广泛认可。之前的研究将这种促进归因于空间交互驱动的策略多样化。尽管使用强化学习研究了动态交互对合作演化影响的效应,但关于如何在显式互动结构中发展邻居选择行为和战略组合仍然存在理解不足。为了应对这个问题,我们的研究基于多代理强化学习在囚徒困境游戏中的框架。这个框架允许代理根据其长期经验选择困境策略和互动邻居,与现有的研究不同,后者依赖于预设的社会规范或外部激励。通过使用两个不同的Q网络来建模每个代理,我们解开了合作和互动之间的协同进化动态。结果显示,长期经验使代理能够发展出识别非合作邻居和偏好与积极合作者互动的能力。这种自组织行为导致具有相似策略的代理聚类,从而增加网络互惠和提高群体合作。
https://arxiv.org/abs/2405.02654
Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agents (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSA to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations. The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.
近年来,随着大型语言模型(LLMs)的发展,会话信息寻求已经迅速发展,为用户提供了以自然方式理解和回应请求的基础。TREC Interactive Knowledge Assistance Track (iKAT)扩展收藏旨在使研究人员能够测试和评估他们的会话搜索代理(CSA)。该收藏包含20个不同主题的个性化对话,每个主题都附带一个个人文本知识库(PTKB),定义了特定的用户人格。该收藏提供了关于相关性的评估以及关于生成响应的四个关键方面的额外评估:相关性、完整性、 groundedness 和自然性。该收藏挑战CSA有效地浏览多样的人格背景,唤起相关人物信息,并利用相关对话的上下文。集成PTKB和强调决策搜索任务使该测试收藏独特,成为推动研究在会话和交互式知识助手领域的重要基准。
https://arxiv.org/abs/2405.02637
Categorical Distributional Reinforcement Learning (CDRL) has demonstrated superior sample efficiency in learning complex tasks compared to conventional Reinforcement Learning (RL) approaches. However, the practical application of CDRL is encumbered by challenging projection steps, detailed parameter tuning, and domain knowledge. This paper addresses these challenges by introducing a pioneering Continuous Distributional Model-Free RL algorithm tailored for continuous action spaces. The proposed algorithm simplifies the implementation of distributional RL, adopting an actor-critic architecture wherein the critic outputs a continuous probability distribution. Additionally, we propose an ensemble of multiple critics fused through a Kalman fusion mechanism to mitigate overestimation bias. Through a series of experiments, we validate that our proposed method is easy to train and serves as a sample-efficient solution for executing complex continuous-control tasks.
分类分布强化学习(CDRL)在处理复杂任务时具有比传统强化学习(RL)方法更高的样本效率。然而,CDRL的实践应用受到具有挑战性的投影阶段、详细的参数调整和领域知识的限制。本文通过引入一个首创的连续分布模型无关RL算法来解决这些挑战。所提出的算法简化了分布强化学习的实现,采用actor-critic架构,其中批评器输出一个连续概率分布。此外,我们提出了一种通过Kalman融合机制将多个批评器聚类的 ensemble。通过一系列实验验证,我们证实了我们的方法容易训练,并且可作为执行复杂连续控制任务的样本效率解决方案。
https://arxiv.org/abs/2405.02576
Unsupervised pre-training has been on the lookout for the virtue of a value function representation referred to as successor features (SFs), which decouples the dynamics of the environment from the rewards. It has a significant impact on the process of task-specific fine-tuning due to the decomposition. However, existing approaches struggle with local optima due to the unified intrinsic reward of exploration and exploitation without considering the linear regression problem and the discriminator supporting a small skill sapce. We propose a novel unsupervised pre-training model with SFs based on a non-monolithic exploration methodology. Our approach pursues the decomposition of exploitation and exploration of an agent built on SFs, which requires separate agents for the respective purpose. The idea will leverage not only the inherent characteristics of SFs such as a quick adaptation to new tasks but also the exploratory and task-agnostic capabilities. Our suggested model is termed Non-Monolithic unsupervised Pre-training with Successor features (NMPS), which improves the performance of the original monolithic exploration method of pre-training with SFs. NMPS outperforms Active Pre-training with Successor Features (APS) in a comparative experiment.
无监督预训练一直在寻找价值函数表示中被称为后继特征(SFs)的优点,它解耦了环境的动态和奖励。由于分解,它在任务特定微调的过程中具有显著影响。然而,现有方法在局部最优解方面遇到困难,因为没有考虑线性回归问题和支持小技能范围的判别器。我们提出了一种基于非分裂探索方法的非监督预训练模型,其SFs。我们的方法追求在SFs上构建的代理器的探索和利用的分解,这需要分别的代理器。这个想法将不仅利用SF的固有特性,如对新技术快速适应,而且还有探索和任务无关的能力。我们提出的模型称为非分裂无监督预训练 with Successor features (NMPS),它提高了使用SFs进行预训练的原有聚合物探索方法的性能。在比较实验中,NMPS优于使用成功者特征的主动预训练(APS)。
https://arxiv.org/abs/2405.02569
Curriculum design for reinforcement learning (RL) can speed up an agent's learning process and help it learn to perform well on complex tasks. However, existing techniques typically require domain-specific hyperparameter tuning, involve expensive optimization procedures for task selection, or are suitable only for specific learning objectives. In this work, we consider curriculum design in contextual multi-task settings where the agent's final performance is measured w.r.t. a target distribution over complex tasks. We base our curriculum design on the Zone of Proximal Development concept, which has proven to be effective in accelerating the learning process of RL agents for uniform distribution over all tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent's learning toward the target distribution via leveraging task correlations. We theoretically justify the task selection strategy of ProCuRL-Target by analyzing a simple learning setting with REINFORCE learner model. Our experimental results across various domains with challenging target task distributions affirm the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.
强化学习(RL)中,课程设计可以加速智能体的学习过程,并帮助它学会在复杂任务上表现出色。然而,现有的技术通常需要针对特定领域进行超参数调整,涉及昂贵的任务选择优化过程,或者仅适用于特定的学习目标。在这项工作中,我们考虑在上下文多任务环境中进行课程设计,其中智能体的最终性能是相对于复杂任务的某个目标分布进行衡量的。我们基于上下文多任务环境中 Zone of Proximal Development(ZOPD)的概念进行课程设计,该概念已经被证明在加速所有任务上 RL 智能体的学习过程中非常有效。我们提出了一个名为 ProCuRL-Target 的全新课程,通过利用任务关联有效地平衡了选择任务既不至于过于困难,又不至于无法继续学习目标分布的需求。我们通过分析一个简单的学习场景(REINFORCE 学习者模型)来理论证明 ProCuRL-Target 的任务选择策略。我们在各种具有具有挑战性目标任务分布的实验领域中进行实验,证实了我们的课程策略在加速深度 RL 智能体训练过程方面的有效性超过了最先进的基准。
https://arxiv.org/abs/2405.02481
We present a novel agent-based approach to simulating an over-the-counter (OTC) financial market in which trades are intermediated solely by market makers and agent visibility is constrained to a network topology. Dynamics, such as changes in price, result from agent-level interactions that ubiquitously occur via market maker agents acting as liquidity providers. Two additional agents are considered: trend investors use a deep convolutional neural network paired with a deep Q-learning framework to inform trading decisions by analysing price history; and value investors use a static price-target to determine their trade directions and sizes. We demonstrate that our novel inclusion of a network topology with market makers facilitates explorations into various market structures. First, we present the model and an overview of its mechanics. Second, we validate our findings via comparison to the real-world: we demonstrate a fat-tailed distribution of price changes, auto-correlated volatility, a skew negatively correlated to market maker positioning, predictable price-history patterns and more. Finally, we demonstrate that our network-based model can lend insights into the effect of market-structure on price-action. For example, we show that markets with sparsely connected intermediaries can have a critical point of fragmentation, beyond which the market forms distinct clusters and arbitrage becomes rapidly possible between the prices of different market makers. A discussion is provided on future work that would be beneficial.
我们提出了一个新颖的基于代理的模拟超额交易金融市场的算法,其中交易仅由市场制造商代理进行中介,代理的可见性受到网络拓扑结构的限制。动态,如价格变化,源于市场制造商代理作为流动性提供者普遍发生的代理水平相互作用。我们还考虑了两个额外的代理:趋势投资者使用深度卷积神经网络与深度 Q-学习框架分析价格历史来告知交易决策;价值投资者使用静态价格目标来确定他们的交易方向和规模。我们证明了在我们的新颖加入市场拓扑结构与市场制造商的情况下,可以探索各种市场结构。首先,我们介绍了模型及其工作原理。其次,我们通过与现实世界的比较验证了我们的研究结果:我们证明了价格变化具有脂肪尾分布,自相关波动,市场制造商位置的 skew 负相关,可预测的价格历史模式以及更多。最后,我们证明了基于网络的模型可以揭示市场结构对价格行动的影响。例如,我们证明了稀疏连接的中介市场中,市场可能会出现临界点,超过这个临界点,市场将形成明显的簇,套利在不同的市场制造商价格之间变得迅速可能。我们还提供了未来工作的讨论。
https://arxiv.org/abs/2405.02480