While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.
最初是为连续控制问题而开发的,但Proximal Policy Optimization (PPO)现在已成为各种强化学习(RL)应用(包括对生成模型的微调)的摇钱树。不幸的是,PPO需要多个技巧来实现稳定的收敛(例如值网络,截断) ,并以其对这些组件的具体实现非常敏感而闻名。为了应对这个问题,我们回退一步并问:在生成模型时代,一个简约的RL算法会是什么样子?我们提出了REBEL,一种通过直接对两个完成之间的策略参数化来降低策略优化问题的算法。在理论方面,我们证明了诸如自然策略梯度等基本RL算法可以被视为REBEL的变体,这使我们能够匹配RL文献中关于收敛和样本复杂性的最强已知理论保证。REBEL还可以干净地整合离线数据,并处理我们经常遇到的实际问题中的自偏好。在实证研究中,我们发现REBEL在语言建模和图像生成方面的性能与PPO和DPO相当或更好,同时比PPO更简单地实现,并且具有更快的计算可处理性。
https://arxiv.org/abs/2404.16767
This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
本文提出了一种名为Neighborhood-based Traveling Salesman Problem (DTSP)的新的学习方法,用于通过给定的任务点快速生成非holonomic车辆通过邻近区域的周游路线。该方法包括两个学习阶段:首先,一种模型无关的强化学习方法利用特权信息从由LinKernighan启发式(LKH)算法生成的专家轨迹中提炼知识。随后,一个监督学习阶段训练一个自适应网络,以独立于特权信息解决问题。在第一个学习阶段之前,还开发了一种使用演示数据进行参数初始化的技术,以提高训练效率。与LKH相比,所提出的学习方法解决方案大约快50倍,并且比其他演示学习方法和RL取得了显著的优越性,大多数这些方法无法感知所有任务点。
https://arxiv.org/abs/2404.16721
Autonomous navigation in dynamic environments is a complex but essential task for autonomous robots, with recent deep reinforcement learning approaches showing promising results. However, the complexity of the real world makes it infeasible to train agents in every possible scenario configuration. Moreover, existing methods typically overlook factors such as robot kinodynamic constraints, or assume perfect knowledge of the environment. In this work, we present RUMOR, a novel planner for differential-drive robots that uses deep reinforcement learning to navigate in highly dynamic environments. Unlike other end-to-end DRL planners, it uses a descriptive robocentric velocity space model to extract the dynamic environment information, enhancing training effectiveness and scenario interpretation. Additionally, we propose an action space that inherently considers robot kinodynamics and train it in a simulator that reproduces the real world problematic aspects, reducing the gap between the reality and simulation. We extensively compare RUMOR with other state-of-the-art approaches, demonstrating a better performance, and provide a detailed analysis of the results. Finally, we validate RUMOR's performance in real-world settings by deploying it on a ground robot. Our experiments, conducted in crowded scenarios and unseen environments, confirm the algorithm's robustness and transferability.
自主导航在动态环境中是一个复杂但 essential 的任务,对于自主机器人来说,最近的深度强化学习方法显示出良好的效果。然而,真实世界的复杂性使得在所有可能的场景配置上训练代理是不切实际的。此外,现有的方法通常忽视诸如机器人动力学限制等因素,或者假设对环境具有完美的了解。在这项工作中,我们提出了 RUMOR,一种用于在高度动态环境中进行自主导航的新规划器,它使用深度强化学习来 navigate。与其它端到端 DRL 规划器不同,它使用描述性的机器人本体运动空间模型来提取动态环境信息,提高训练效果和场景解释。此外,我们还提出了一个考虑机器人动力学的动作空间,并将其在模拟器上训练,减少了现实世界和模拟器之间的差距。我们详细比较了 RUMOR 与其他最先进的方案,证明了其更好的性能,并提供了结果的详细分析。最后,我们通过在真实环境中部署 RUMOR 来验证其性能。我们的实验在拥挤的场景和未知的环境中进行,证实了算法的稳健性和可迁移性。
https://arxiv.org/abs/2404.16672
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.
将大型语言模型(LLMs)融入医疗保健行业有望彻底改变医疗诊断、研究和患者护理。然而,医疗LLMs的发展面临着一些障碍,如复杂的训练要求、严格的评估需求以及 proprietary模型的主导地位,这些模型限制了学术探索。透明、全面的访问LLM资源对于推动该领域的发展、促进可重复性以及鼓励医疗保健领域的人工创新至关重要。我们推出了Hippocrates,一个专为医疗领域而设计的开源LLM框架。与之前的努力相比,它提供了无限制的访问其训练数据集、代码库、检查点以及评估协议。这种开放方法旨在鼓励协同研究,让社区在透明的生态系统中构建、改进和严格评估医疗LLM。此外,我们还介绍了Hippo家族7B模型,这些模型针对医疗领域进行了微调和优化,通过持续的预训练、指令调整和强化学习从人类和AI反馈中进行微调。我们的模型在现有开放医疗LLM模型的性能优势基础上,性能优势巨大,甚至超过了具有70B参数的模型。通过Hippocrates,我们渴望利用LLMs不仅推动医疗知识和患者护理的发展,还将促进医疗保健领域的人工研究民主化,使它们在全球范围内可用。
https://arxiv.org/abs/2404.16621
This conceptual analysis examines the dynamics of data transmission in 5G networks. It addresses various aspects of sending data from cameras and LiDARs installed on a remote-controlled ferry to a land-based control center. The range of topics includes all stages of video and LiDAR data processing from acquisition and encoding to final decoding, all aspects of their transmission and reception via the WebRTC protocol, and all possible types of network problems such as handovers or congestion that could affect the quality of experience for end-users. A series of experiments were conducted to evaluate the key aspects of the data transmission. These include simulation-based reproducible runs and real-world experiments conducted using open-source solutions we developed: "Gymir5G" - an OMNeT++-based 5G simulation and "GstWebRTCApp" - a GStreamer-based application for adaptive control of media streams over the WebRTC protocol. One of the goals of this study is to formulate the bandwidth and latency requirements for reliable real-time communication and to estimate their approximate values. This goal was achieved through simulation-based experiments involving docking maneuvers in the Bay of Kiel, Germany. The final latency for the entire data processing pipeline was also estimated during the real tests. In addition, a series of simulation-based experiments showed the impact of key WebRTC features and demonstrated the effectiveness of the WebRTC protocol, while the conducted video codec comparison showed that the hardware-accelerated H.264 codec is the best. Finally, the research addresses the topic of adaptive communication, where the traditional congestion avoidance and deep reinforcement learning approaches were analyzed. The comparison in a sandbox scenario shows that the AI-based solution outperforms the WebRTC baseline GCC algorithm in terms of data rates, latency, and packet loss.
这一概念分析研究了5G网络数据传输的动态。它涉及从遥控渡轮上的摄像头和激光雷达发送数据到陆地控制中心的各种方面。主题范围包括从采集和编码到最终解码的视频和激光雷达数据处理的所有阶段,以及通过WebRTC协议传输和接收数据的所有方面,以及可能影响用户体验的各种网络问题,如切换或拥塞。进行了一系列实验来评估数据传输的关键方面。这些包括基于模拟的重复实验和基于我们开发的Open Source解决方案进行的真实世界实验:“Gymir5G” - 一个基于OMNeT++的5G模拟和“GstWebRTCApp” - 一个基于GStreamer的WebRTC应用程序,用于在WebRTC协议上 adaptive控制媒体流。本研究的一个目标是制定可靠实时通信的带宽和延迟要求,并估计其近似值。通过基于模拟的实验在德国基尔湾的港口进行系泊操纵来达到这个目标。在真实测试期间,还估计了整个数据处理流程的最终延迟。此外,一系列基于模拟的实验展示了关键WebRTC功能对性能的影响,并证明了WebRTC协议的有效性,而进行的视频编解码比较则表明了硬件加速的H.264编解码器是最好的。最后,研究关注自适应通信,其中传统的避免拥塞和深度强化学习方法进行了分析。在沙盒场景中的比较表明,基于AI的解决方案在数据率、延迟和包丢失方面优于基于WebRTC基线的GCC算法。
https://arxiv.org/abs/2404.16508
Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
模型无关强化学习方法缺乏对训练后策略施加行为约束的固有机制。虽然存在某些扩展,但它们仍然局限于特定的约束类型,例如带有额外奖励信号的价值约束或访问密度约束。在这项工作中,我们试图统一这些现有技术,并使用基于价值的actor-critic强化学习方法的泛化二次框架来弥合与经典优化和控制理论之间的差距。所得到的双重形式展开在很大程度上有助于对学习到的策略施加额外的约束,因为这种双重约束(或 regularization 项)与原初在值上的约束之间揭示了一种固有的关系。此外,利用这种框架,我们能够引入一些新颖的约束类型,使得能够对策略的动作密度或连续状态和动作之间的转移成本施加限制。从调整后的原初-二次优化问题中,我们得到了一个实际算法,它在训练过程中自动处理各种策略约束。利用不同的约束组合,对两个可解释的环境进行了评估。结果表明,该方法非常有效,最终为设计这种系统的设计者提供了一个丰富的策略约束工具箱。
https://arxiv.org/abs/2404.16468
Offline reinforcement learning (RL) algorithms are applied to learn performant, well-generalizing policies when provided with a static dataset of interactions. Many recent approaches to offline RL have seen substantial success, but with one key caveat: they demand substantial per-dataset hyperparameter tuning to achieve reported performance, which requires policy rollouts in the environment to evaluate; this can rapidly become cumbersome. Furthermore, substantial tuning requirements can hamper the adoption of these algorithms in practical domains. In this paper, we present TD3 with Behavioral Supervisor Tuning (TD3-BST), an algorithm that trains an uncertainty model and uses it to guide the policy to select actions within the dataset support. TD3-BST can learn more effective policies from offline datasets compared to previous methods and achieves the best performance across challenging benchmarks without requiring per-dataset tuning.
离线强化学习(RL)算法应用于学习在给定静态交互数据集时表现出高效、具有良好的泛化能力的策略。许多最近应用于离线RL的途径都取得了显著的成功,但有一个关键的缺陷:它们需要对每个数据集的显著超参数进行巨大的调整,以达到报告的性能,这需要策略在环境中进行评估;这可能导致评估过程变得繁琐。此外,巨大的调整需求会阻碍这些算法在实际领域的应用。在本文中,我们提出了TD3 with Behavioral Supervisor Tuning(TD3-BST)算法,一种在数据集中支持策略选择的学习不确定性模型。TD3-BST可以从离线数据集中学习更有效的策略,并在具有挑战性的基准测试中实现最佳性能,而无需进行每个数据集的超参数调整。
https://arxiv.org/abs/2404.16399
This work introduces SwarmRL, a Python package designed to study intelligent active particles. SwarmRL provides an easy-to-use interface for developing models to control microscopic colloids using classical control and deep reinforcement learning approaches. These models may be deployed in simulations or real-world environments under a common framework. We explain the structure of the software and its key features and demonstrate how it can be used to accelerate research. With SwarmRL, we aim to streamline research into micro-robotic control while bridging the gap between experimental and simulation-driven sciences. SwarmRL is available open-source on GitHub at this https URL.
本工作介绍了SwarmRL,一个旨在研究智能主动粒子的Python软件包。SwarmRL为使用经典控制和深度强化学习方法控制微观颗粒的模型提供了易于使用的界面。这些模型可以在模拟或现实环境中共存于一个共同框架中。我们解释了软件的架构及其关键特征,并展示了如何使用它来加速研究。通过SwarmRL,我们旨在简化微型机器人控制的科学研究,并缩小实验和仿真驱动的科学之间的差距。SwarmRL可以在GitHub上的这个链接开源使用。
https://arxiv.org/abs/2404.16388
Foundation models contain a wealth of information from their vast number of training samples. However, most prior arts fail to extract this information in a precise and efficient way for small sample sizes. In this work, we propose a framework utilizing reinforcement learning as a control for foundation models, allowing for the granular generation of small, focused synthetic support sets to augment the performance of neural network models on real data classification tasks. We first allow a reinforcement learning agent access to a novel context based dictionary; the agent then uses this dictionary with a novel prompt structure to form and optimize prompts as inputs to generative models, receiving feedback based on a reward function combining the change in validation accuracy and entropy. A support set is formed this way over several exploration steps. Our framework produced excellent results, increasing classification accuracy by significant margins for no additional labelling or data cost.
基础模型包含大量训练样本中所学到的丰富信息。然而,大多数先前的艺术作品在小型样本量的情况下无法精确有效地提取这些信息。在本文中,我们提出了一种利用强化学习作为基础模型控制的方法,允许在小型、关注点状的合成支持集上生成细粒度的支持集,以提高神经网络模型在真实数据分类任务上的性能。我们首先允许一个强化学习代理访问一个新颖的上下文基词表;然后,代理使用此基词表与新颖的提示结构形成和优化提示作为输入,根据基于验证准确性和熵的奖励函数接收反馈。通过几次探索步骤,这样就可以形成一个支持集。我们的框架产生了很好的结果,在不需要额外标签或数据成本的情况下,将分类准确度提高了显著的幅度。
https://arxiv.org/abs/2404.16300
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to intelligently select acoustic data sampling locations. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment in which a mobile agent equipped with visual and acoustic sensors jointly constructs the environment acoustic model and the occupancy map on-the-fly. We introduce ActiveRIR, a reinforcement learning (RL) policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions, yielding a high quality acoustic model of the environment from a minimal set of acoustic samples. We train our policy with a novel RL reward based on information gain in the environment acoustic model. Evaluating on diverse unseen indoor environments from a state-of-the-art acoustic simulation platform, ActiveRIR outperforms an array of methods--both traditional navigation agents based on spatial novelty and visual exploration as well as existing state-of-the-art methods.
环境声学模型表示了室内环境声音如何通过物理特性进行转换,对于任何给定的源/接收器位置。传统构建声学模型的方法包括在空间密集的位置收集大量声学数据,或依赖于场景几何知识来智能选择声学数据采样位置。我们提出了一种主动声学采样,这是一种新任务,用于高效构建未映射环境中的环境声学模型和占用图,其中移动代理器配备视觉和听觉传感器共同构建环境声学模型和占用图。我们引入了ActiveRIR,一种基于音频-视觉传感器流的信息增强强化学习(RL)策略,用于引导代理器导航并确定最优的声学数据采样位置,从而从最小数量的声学样本中产生高质量的环境声学模型。我们用基于环境声学模型信息增益的新颖RL奖励来训练我们的策略。在从最先进的声学仿真平台上的多样未见室内环境中评估,ActiveRIR表现优于基于空间新颖性和视觉探索的传统导航方法和现有最先进的方法。
https://arxiv.org/abs/2404.16216
Physics-based simulations have accelerated progress in robot learning for driving, manipulation, and locomotion. Yet, a fast, accurate, and robust surgical simulation environment remains a challenge. In this paper, we present ORBIT-Surgical, a physics-based surgical robot simulation framework with photorealistic rendering in NVIDIA Omniverse. We provide 14 benchmark surgical tasks for the da Vinci Research Kit (dVRK) and Smart Tissue Autonomous Robot (STAR) which represent common subtasks in surgical training. ORBIT-Surgical leverages GPU parallelization to train reinforcement learning and imitation learning algorithms to facilitate study of robot learning to augment human surgical skills. ORBIT-Surgical also facilitates realistic synthetic data generation for active perception tasks. We demonstrate ORBIT-Surgical sim-to-real transfer of learned policies onto a physical dVRK robot. Project website: this http URL
基于物理的机器人学习在驾驶、操作和移动方面已经取得了进展。然而,快速、准确和稳健的手术模拟环境仍然是一个挑战。在本文中,我们提出了ORBIT-Surgical,一个基于物理的手术机器人模拟框架,在NVIDIA Omniverse中实现光栅化渲染。我们为达芬奇研究工具包(dVRK)和智能组织自主机器人(STAR)提供了14个基准手术任务,这些任务代表了手术训练中常见的子任务。ORBIT-Surgical利用GPU并行训练强化学习和模仿学习算法,以促进研究机器人学习以提高人类手术技能。ORBIT-Surgical还促进了真实合成数据生成,用于主动感知任务。我们证明了ORBIT-Surgical将学习到的策略在物理dVRK机器人上实现模拟-到-实转。项目网站:这个链接
https://arxiv.org/abs/2404.16027
Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.
强化学习是一种解决复杂问题最优解的流行方法。像Q-学习这样的算法在解决没有环境模型的随机问题方面表现出色。然而,它们解决确定性问题的时间比必要的要长。通过引入基于模型的方法,可以提高Q-学习的确定性问题的解决能力。本文介绍了递归后反向Q学习(RBQL)代理,它探索并构建环境的模型。在达到终端状态后,它通过这个模型递归地传播其值。这让每个状态都能在没有长篇累赘的学习过程中评估到最优值。在找寻迷宫中最短路径的例子中,这个代理大大超过了普通Q-学习代理的表现。
https://arxiv.org/abs/2404.15822
The study of variational quantum algorithms (VQCs) has received significant attention from the quantum computing community in recent years. These hybrid algorithms, utilizing both classical and quantum components, are well-suited for noisy intermediate-scale quantum devices. Though estimating exact gradients using the parameter-shift rule to optimize the VQCs is realizable in NISQ devices, they do not scale well for larger problem sizes. The computational complexity, in terms of the number of circuit evaluations required for gradient estimation by the parameter-shift rule, scales linearly with the number of parameters in VQCs. On the other hand, techniques that approximate the gradients of the VQCs, such as the simultaneous perturbation stochastic approximation (SPSA), do not scale with the number of parameters but struggle with instability and often attain suboptimal solutions. In this work, we introduce a novel gradient estimation approach called Guided-SPSA, which meaningfully combines the parameter-shift rule and SPSA-based gradient approximation. The Guided-SPSA results in a 15% to 25% reduction in the number of circuit evaluations required during training for a similar or better optimality of the solution found compared to the parameter-shift rule. The Guided-SPSA outperforms standard SPSA in all scenarios and outperforms the parameter-shift rule in scenarios such as suboptimal initialization of the parameters. We demonstrate numerically the performance of Guided-SPSA on different paradigms of quantum machine learning, such as regression, classification, and reinforcement learning.
近年来,量子计算社区对变分量子算法(VQCs)的研究受到了广泛关注。这些混合算法利用经典和量子组件,非常适合应用于噪声中等规模的量子设备。然而,在NISQ设备上使用参数位移规则估计VQCs的 exact梯度是可行的,但当问题规模较大时,它们并不具有良好的扩展性。在另一方面,近似VQCs梯度的技术,如同时扰动随机近似(SPSA)并不随着参数数量的增加而扩展,而是努力应对不稳定问题,并且往往得到次优解决方案。在本文中,我们引入了一种名为Guided-SPSA的新参数估计方法,它将参数位移规则和基于SPSA的梯度近似相结合,实现了对参数的指导性估计。Guided-SPSA在类似或更好的解决方法与参数位移规则相比减少了15%至25%的电路评估次数。在所有场景中,Guided-SPSA都优于标准SPSA,并且在参数初始化不优的情况下也优于参数位移规则。我们通过数值展示了Guided-SPSA在量子机器学习中的不同范式(如回归、分类和强化学习)的性能。
https://arxiv.org/abs/2404.15751
In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses. Though prior work has demonstrated that human videos can benefit policy learning, performance improvement has been limited by physically implausible trajectories extracted from videos. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. A coordinate transformation method is proposed to significantly boost the performance. We evaluate our method on three dexterous manipulation tasks and demonstrate a large improvement over state-of-the-art algorithms.
在这项工作中,我们的目标是学习一个多指机器人手的统一视觉基础策略,以在多样姿态下操纵不同物体。虽然先前的研究已经表明人类视频可以提高策略学习的效果,但受限与从视频中提取的不太可能的状态轨迹。此外,过于依赖特权对象信息,如地面真实物体状态,进一步限制了在现实场景中的应用。为了应对这些限制,我们提出了一个新的框架ViViDex,以改进基于人类视频的视觉基础策略学习。它首先使用强化学习,结合轨迹引导奖励,为每个视频训练基于状态的策略,同时获得视频中的 both visually natural and physically plausible trajectories。然后,从基于状态的策略中成功抽样 episode,并训练一个统一的视觉策略,无需使用任何特权信息。提出了一种坐标变换方法,显著提高了性能。我们在三个灵巧操作任务上评估我们的方法,并表明其性能优于最先进的算法。
https://arxiv.org/abs/2404.15709
Cooperative Adaptive Cruise Control (CACC) represents a quintessential control strategy for orchestrating vehicular platoon movement within Connected and Automated Vehicle (CAV) systems, significantly enhancing traffic efficiency and reducing energy consumption. In recent years, the data-driven methods, such as reinforcement learning (RL), have been employed to address this task due to their significant advantages in terms of efficiency and flexibility. However, the delay issue, which often arises in real-world CACC systems, is rarely taken into account by current RL-based approaches. To tackle this problem, we propose a Delay-Aware Multi-Agent Reinforcement Learning (DAMARL) framework aimed at achieving safe and stable control for CACC. We model the entire decision-making process using a Multi-Agent Delay-Aware Markov Decision Process (MADA-MDP) and develop a centralized training with decentralized execution (CTDE) MARL framework for distributed control of CACC platoons. An attention mechanism-integrated policy network is introduced to enhance the performance of CAV communication and decision-making. Additionally, a velocity optimization model-based action filter is incorporated to further ensure the stability of the platoon. Experimental results across various delay conditions and platoon sizes demonstrate that our approach consistently outperforms baseline methods in terms of platoon safety, stability and overall performance.
合作自适应巡航控制(CACC)代表了一种在连接和自动驾驶车辆(CAV)系统中协调车辆编队运动的典型控制策略,显著提高了交通效率和降低了能源消耗。近年来,数据驱动的方法,如强化学习(RL),已经被采用来解决这个任务,因为它们在效率和灵活性方面具有显著优势。然而,当前基于RL的方法很少考虑到实世界CACC系统中经常出现的延迟问题。为了解决这个问题,我们提出了一个针对延迟敏感的多代理器强化学习(DAMARL)框架,旨在实现CACC的安全和稳定控制。我们使用多代理器延迟感知马尔可夫决策过程(MADA-MDP)来建模整个决策过程,并开发了一种集中训练和分布式执行(CTDE)的MARL框架,用于分布式控制CACC编队。引入了注意机制的策略网络,以提高CAV通信和决策的性能。此外,还引入了基于速度优化模型的动作滤波器,进一步确保编队的稳定性。在不同的延迟条件和编队大小等实验条件下,我们发现,我们的方法在编队安全、稳定和整体性能方面 consistently超过了基线方法。
https://arxiv.org/abs/2404.15696
Understanding bidding behavior in multi-unit auctions remains an ongoing challenge for researchers. Despite their widespread use, theoretical insights into the bidding behavior, revenue ranking, and efficiency of commonly used multi-unit auctions are limited. This paper utilizes artificial intelligence, specifically reinforcement learning, as a model free learning approach to simulate bidding in three prominent multi-unit auctions employed in practice. We introduce six algorithms that are suitable for learning and bidding in multi-unit auctions and compare them using an illustrative example. This paper underscores the significance of using artificial intelligence in auction design, particularly in enhancing the design of multi-unit auctions.
理解多单元拍卖中的竞价行为仍然是一个 ongoing 的挑战,对研究人员来说。尽管它们在实践中得到了广泛应用,但理论对多单元拍卖中竞价行为、收入排名和效率的研究仍然有限。本文利用人工智能,特别是强化学习,作为无监督学习方法,通过模拟实际采用的三个著名多单元拍卖来研究竞价。我们引入了六个适合学习 和竞价的算法,并通过一个示例进行比较。本文强调了在拍卖设计中使用人工智能的重要性,特别是在提高多单元拍卖的设计方面。
https://arxiv.org/abs/2404.15633
Reinforcement learning (RL) with continuous state and action spaces remains one of the most challenging problems within the field. Most current learning methods focus on integral identities such as value functions to derive an optimal strategy for the learning agent. In this paper, we instead study the dual form of the original RL formulation to propose the first differential RL framework that can handle settings with limited training samples and short-length episodes. Our approach introduces Differential Policy Optimization (DPO), a pointwise and stage-wise iteration method that optimizes policies encoded by local-movement operators. We prove a pointwise convergence estimate for DPO and provide a regret bound comparable with current theoretical works. Such pointwise estimate ensures that the learned policy matches the optimal path uniformly across different steps. We then apply DPO to a class of practical RL problems which search for optimal configurations with Lagrangian rewards. DPO is easy to implement, scalable, and shows competitive results on benchmarking experiments against several popular RL methods.
强化学习(RL)在具有连续状态和动作空间的情况下仍然是最具挑战性的问题之一。大多数现有学习方法都关注于全局等式,如价值函数,以得出学习代理的最优策略。在本文中,我们研究了原始RL公式的对偶形式,以提出第一个可以处理训练样本有限且历时较短的场景的DRL框架。我们的方法引入了差分策略优化(DPO),这是一种局部运动操作符编码的点间迭代方法。我们证明了DPO的点间收敛估计,并提供了一个与当前理论工作相当的后悔边界。这样的点间估计确保了学习到的策略在不同的步骤上与最优路径保持一致。然后我们将DPO应用于一类使用拉格朗兰奖励寻找最优配置的实践RL问题中。DPO易于实现,具有可扩展性,在基准实验中与几种流行RL方法竞争,表现优异。
https://arxiv.org/abs/2404.15617
Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50% power consumption.
尖峰神经网络(SNNs)因其在节能和快速推理能力而广泛应用于各种领域。将SNN应用于强化学习(RL)可以显著降低代理程序的计算资源需求,并在资源受限条件下提高算法的性能。然而,在当前的尖峰强化学习(SRL)算法中,多个时间步的模拟结果只能对应于RL中的单步决策。这与大脑的实际时间动态以及SNNs处理时间数据的能力之间存在很大的差异。为了解决这一时间差问题,并更好地利用尖峰神经元的固有时间动态,我们提出了一个新的时间对齐范式(TAP)。它利用尖峰神经元的单步更新来累积历史状态信息,并引入门控单元来增强尖峰神经元的记忆容量。实验结果表明,我们的方法可以与具有类似性能的循环神经网络(RNNs)解决部分可观察的马尔可夫决策过程(POMDP)和多智能体合作问题,但功耗大约为RNN的50%。
https://arxiv.org/abs/2404.15597
The rapidly changing architecture and functionality of electrical networks and the increasing penetration of renewable and distributed energy resources have resulted in various technological and managerial challenges. These have rendered traditional centralized energy-market paradigms insufficient due to their inability to support the dynamic and evolving nature of the network. This survey explores how multi-agent reinforcement learning (MARL) can support the decentralization and decarbonization of energy networks and mitigate the 12 associated challenges. This is achieved by specifying key computational challenges in managing energy networks, reviewing recent research progress on addressing them, and highlighting open challenges that may be addressed using MARL.
由于电力网络 rapidly变化的建筑和功能,以及可再生能源和分布式能源资源的日益普及,产生了各种技术和管理挑战。这使得传统集中式能源市场范式由于无法支持网络的动态和演变性质而变得不足。本调查探讨了多智能体强化学习(MARL)如何支持能源网络的分散化和脱碳,并减轻与12个相关挑战相关的负担。这是通过指定管理能源网络的关键计算挑战,回顾针对这些挑战的最近研究进展,并强调可以使用MARL解决的开放挑战来实现的。
https://arxiv.org/abs/2404.15583
In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.
在传统统计学习过程中,数据点通常被假定服从一个未知的概率分布,且为独立且等距分布(i.i.d.)。本文提出了一种不同的观点,将数据点视为相互连接的,并采用马尔可夫奖励过程(MRP)进行数据建模。我们将典型的监督学习视为强化学习(RL)中的策略评估问题,并引入了一个泛化时间差(TD)学习算法作为解决方案。从理论上讲,我们的分析将线性TD学习和普通最小二乘(OLS)的解决方案联系起来。我们还证明了在特定条件下,特别是噪声相关的情况下,TD的解决方案证明比OLS更有效。此外,我们还建立了我们的泛化TD算法的收敛性。实证研究证实了我们的理论结果,检查了TD算法的关键设计,并表明其在各种数据集上的实际应用具有价值,包括使用深度学习进行回归和图像分类等任务。
https://arxiv.org/abs/2404.15518