In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of $n$ actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as $\mathcal{O}(\log(n))$. The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
在具有较大离散动作空间复杂环境的强化学习(RL)中,有效的决策非常重要。尽管基于价值的RL方法(如Q-learning)在实践中得到了广泛应用,但它们带来了计算负担,需要通过每个迭代最大化价值函数来解决。这个负担在处理大规模问题和使用深度神经网络作为函数近似的函数时变得尤为困难。在本文中,我们提出了随机价值基于RL的方法,每个迭代周期内,除了优化整个$n$个动作集合外,只考虑一个随机子线性数量的动作,可能大小为$\mathcal{O}(\log(n))$。所提出的随机价值基于RL方法包括,例如,Stochastic Q-learning,StochDQN和StochDDQN,它们都集成了这个随机方法 both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we demonstrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
https://arxiv.org/abs/2405.10310
Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
大视觉语言模型(VLMs)在专用视觉指令跟随数据上进行微调已经展示了令人印象深刻的语言推理能力。然而,这种微调范式可能无法有效地从交互环境中学习最优决策策略。为解决这个问题,我们提出了一个使用强化学习(RL)微调VLMs的算法框架。具体来说,我们的框架提供任务描述,然后提示VLM生成连锁推理(CoT)思维,使VLM能够高效探索导致最终文本基于行动的中间推理步骤。接下来,开放的文本输出被解析为可执行动作,以与环境交互以获得目标导向任务奖励。最后,我们的框架使用这些任务奖励对整个VLM进行微调。实验证明,我们提出的框架增强了VLM代理在不同任务中的决策能力,使得7b模型能够优于诸如GPT4-V或Gemini等商业模型。此外,我们发现,CoT推理是提高性能的关键组成部分,因为去除CoT推理会导致我们方法的整体性能显著下降。
https://arxiv.org/abs/2405.10292
Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks.
作者身份混淆技术在帮助人们保护在线交流中的隐私方面具有潜力,通过自动重写文本来隐藏原始作者的身份。然而,在自然语言处理文献中,混淆性主要通过粗略的编辑操作来解决,可能导致输出不自然。在这项工作中,我们介绍了一种自动文本保密框架,通过强化学习方法微调一个大语言模型,以产生平衡音调、意义和隐私的重新编写。我们在由68k名作者组成的英语Reddit帖子的大型测试集中对其进行了评估。我们研究了评估条件包括作者个人资料长度和作者身份检测策略时,性能的变化。我们的方法根据自动和人类评估都保持了高质量,并成功逃避了几个自动作者身份攻击。
https://arxiv.org/abs/2405.10260
Using Unmanned Aerial Vehicles (UAVs) in Search and rescue operations (SAR) to navigate challenging terrain while maintaining reliable communication with the cellular network is a promising approach. This paper suggests a novel technique employing a reinforcement learning multi Q-learning algorithm to optimize UAV connectivity in such scenarios. We introduce a Strategic Planning Agent for efficient path planning and collision awareness and a Real-time Adaptive Agent to maintain optimal connection with the cellular base station. The agents trained in a simulated environment using multi Q-learning, encouraging them to learn from experience and adjust their decision-making to diverse terrain complexities and communication scenarios. Evaluation results reveal the significance of the approach, highlighting successful navigation in environments with varying obstacle densities and the ability to perform optimal connectivity using different frequency bands. This work paves the way for enhanced UAV autonomy and enhanced communication reliability in search and rescue operations.
使用无人机在搜索和救援行动(SAR)中导航具有挑战性的地形,同时保持与移动网络的可靠通信,是一种有前景的方法。本文提出了一种采用强化学习多Q-学习算法来优化无人机连接的新技术。我们引入了一个用于高效路径规划和避障的策略规划代理和一个用于实时适应基站策略的实时适应代理。这些代理在模拟环境中使用多Q-学习进行训练,鼓励它们从经验中学习并调整其决策以适应多样地形和通信场景。评估结果显示出这种方法的重要性,突出了在具有不同障碍密度和通信场景的环境中实现最优连接的成功导航。这项工作为增强无人机自主性和提高搜索和救援行动中的通信可靠性奠定了基础。
https://arxiv.org/abs/2405.10042
We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.
我们证明了,如果将连续强化学习问题的折扣方法集中于通过减去奖励的实证平均来调整奖励,那么这些方法的表现会显著更好。在通常使用的折扣因子中,改进非常明显,而且随着折扣因子的接近1,改进会进一步增加。此外,我们还证明了,如果一个问题中的奖励被一个常数所调整,那么标准方法的表现会差得多,而具有奖励中心的方法则不受影响。在on-policy设置中直接估计平均奖励是直接的;我们在off-policy设置中提出了一种更复杂的方法。奖励中心是一个通用概念,所以我们预计几乎所有的强化学习算法都能从奖励中心中获得好处。
https://arxiv.org/abs/2405.09999
This paper proposes a method to combine reinforcement learning (RL) and imitation learning (IL) using a dynamic, performance-based modulation over learning signals. The proposed method combines RL and behavioral cloning (IL), or corrective feedback in the action space (interactive IL/IIL), by dynamically weighting the losses to be optimized, taking into account the backpropagated gradients used to update the policy and the agent's estimated performance. In this manner, RL and IL/IIL losses are combined by equalizing their impact on the policy's updates, while modulating said impact such that IL signals are prioritized at the beginning of the learning process, and as the agent's performance improves, the RL signals become progressively more relevant, allowing for a smooth transition from pure IL/IIL to pure RL. The proposed method is used to learn local planning policies for mobile robots, synthesizing IL/IIL signals online by means of a scripted policy. An extensive evaluation of the application of the proposed method to this task is performed in simulations, and it is empirically shown that it outperforms pure RL in terms of sample efficiency (achieving the same level of performance in the training environment utilizing approximately 4 times less experiences), while consistently producing local planning policies with better performance metrics (achieving an average success rate of 0.959 in an evaluation environment, outperforming pure RL by 12.5% and pure IL by 13.9%). Furthermore, the obtained local planning policies are successfully deployed in the real world without performing any major fine tuning. The proposed method can extend existing RL algorithms, and is applicable to other problems for which generating IL/IIL signals online is feasible. A video summarizing some of the real world experiments that were conducted can be found in this https URL.
本文提出了一种结合强化学习(RL)和模仿学习(IL)的方法,通过动态地对学习信号进行性能为基础的调节。所提出的方法将RL和行为复制(IL)相结合,或者在动作空间中使用交互式IL/IIL中的纠正反馈(interactive IL/IIL),通过动态地加权需要优化的损失,考虑到用于更新策略的回溯梯度以及代理器的估计绩效。这样,通过平衡它们对策略更新影响的等效性,RL和IL/IIL损失得以结合。在某种程度上,通过动态地加权它们对策略更新的影响,使得IL信号在学习过程开始时具有优先级,而随着代理器绩效的提高,RL信号逐渐变得更加相关,从而实现从纯IL/IIL到纯RL的平滑过渡。所提出的方法用于学习移动机器人的局部规划策略,通过编写脚本策略在线合成IL/IIL信号。对所提出方法在 this任务上的应用进行了广泛的仿真评估,实验结果表明,与纯RL相比,其在样本效率方面具有优势(在训练环境中实现与约4倍经验相同的性能),同时,它还 consistently产生具有更好性能指标的局部规划策略(在评估环境中,平均成功率为0.959,比纯RL高12.5%,比纯IL高13.9%)。此外,所获得的局部规划策略在实际环境中成功地得到了部署,没有进行任何重大微调。所提出的方法可以扩展现有的RL算法,并适用于其他可以通过在线生成IL/IIL信号的问题。可以在这个链接中找到一个总结了一些真实世界实验的视频:https://www.youtube.com/watch?v=
https://arxiv.org/abs/2405.09760
The software industry is experiencing a surge in the adoption of Continuous Integration (CI) practices, both in commercial and open-source environments. CI practices facilitate the seamless integration of code changes by employing automated building and testing processes. Some frameworks, such as Travis CI and GitHub Actions have significantly contributed to simplifying and enhancing the CI process, rendering it more accessible and efficient for development teams. Despite the availability these CI tools , developers continue to encounter difficulties in accurately flagging commits as either suitable for CI execution or as candidates for skipping especially for large projects with many dependencies. Inaccurate flagging of commits can lead to resource-intensive test and build processes, as even minor commits may inadvertently trigger the Continuous Integration process. The problem of detecting CI-skip commits, can be modeled as binary classification task where we decide to either build a commit or to skip it. This study proposes a novel solution that leverages Deep Reinforcement Learning techniques to construct an optimal Decision Tree classifier that addresses the imbalanced nature of the data. We evaluate our solution by running a within and a cross project validation benchmark on diverse range of Open-Source projects hosted on GitHub which showcased superior results when compared with existing state-of-the-art methods.
软件产业正在经历持续集成(CI)实践的激增,无论是商业还是开源环境。CI 实践通过采用自动构建和测试过程,使代码更改的轻松集成变得无缝。一些工具,如Travis CI 和 GitHub Actions在简化和完善 CI 过程中发挥了重要作用,使开发团队更容易使用。尽管这些 CI 工具已经存在,但开发人员仍然会面临准确地标记提交为适合 CI 执行或作为跳过条件的困难,尤其是在大型具有多个依赖关系的项目中。不准确的标记提交可能导致资源密集的测试和构建过程,甚至小的提交也可能无意中触发 Continuous Integration 过程。检测 CI 跳过提交的问题可以建模为二元分类任务,我们决定构建或跳过它。本研究提出了一种新解决方案,利用深度强化学习技术构建了一个最优决策树分类器,解决了数据不平衡的问题。我们通过在 GitHub 上托管的多样化开源项目上进行内部分别和跨项目验证基准来评估我们的解决方案,这表明与现有最先进的方法相比,具有卓越的结果。
https://arxiv.org/abs/2405.09657
To safely navigate intricate real-world scenarios, autonomous vehicles must be able to adapt to diverse road conditions and anticipate future events. World model (WM) based reinforcement learning (RL) has emerged as a promising approach by learning and predicting the complex dynamics of various environments. Nevertheless, to the best of our knowledge, there does not exist an accessible platform for training and testing such algorithms in sophisticated driving environments. To fill this void, we introduce CarDreamer, the first open-source learning platform designed specifically for developing WM based autonomous driving algorithms. It comprises three key components: 1) World model backbone: CarDreamer has integrated some state-of-the-art WMs, which simplifies the reproduction of RL algorithms. The backbone is decoupled from the rest and communicates using the standard Gym interface, so that users can easily integrate and test their own algorithms. 2) Built-in tasks: CarDreamer offers a comprehensive set of highly configurable driving tasks which are compatible with Gym interfaces and are equipped with empirically optimized reward functions. 3) Task development suite: This suite streamlines the creation of driving tasks, enabling easy definition of traffic flows and vehicle routes, along with automatic collection of multi-modal observation data. A visualization server allows users to trace real-time agent driving videos and performance metrics through a browser. Furthermore, we conduct extensive experiments using built-in tasks to evaluate the performance and potential of WMs in autonomous driving. Thanks to the richness and flexibility of CarDreamer, we also systematically study the impact of observation modality, observability, and sharing of vehicle intentions on AV safety and efficiency. All code and documents are accessible on this https URL.
为了在复杂的现实场景中安全导航,自动驾驶车辆必须能够适应各种道路条件并预测未来事件。基于强化学习的(RL)世界模型(WM)作为一种有前景的方法,通过学习和预测各种环境中的复杂动态而 emergence。然而,据我们所知,目前没有可用的平台来训练和测试这种算法在复杂驾驶环境中的自动驾驶算法。为填补这一空白,我们介绍了CarDreamer,第一个专为开发基于RL的自驾算法而设计的开源学习平台。它包括三个关键组件:1)世界模型骨架:CarDreamer集成了一些最先进的WMs,简化了RL算法的复制。骨架与其余部分解耦并使用标准的Gym界面通信,以便用户轻松地将自己的算法集成和测试。2)内置任务:CarDreamer提供了一系列高度可配置的驾驶任务,与Gym接口兼容,并配备经过实证优化的奖励函数。3)任务开发套件:该套件简化了驾驶任务的创建,用户可以轻松定义交通流量和车辆路线,并自动收集多模态观察数据。可视化服务器允许用户通过浏览器追踪实时代理驾驶员的视频和性能指标。此外,我们使用内置任务对WMs在自动驾驶中的性能和潜力进行了广泛的实验评估。由于CarDreamer的丰富性和灵活性,我们还系统地研究了观测模式、可观测性和车辆意图共享对AV安全性和效率的影响。所有代码和文档都可以在https://这个URL访问。
https://arxiv.org/abs/2405.09111
Autonomous tuning of particle accelerators is an active and challenging field of research with the goal of enabling novel accelerator technologies cutting-edge high-impact applications, such as physics discovery, cancer research and material sciences. A key challenge with autonomous accelerator tuning remains that the most capable algorithms require an expert in optimisation, machine learning or a similar field to implement the algorithm for every new tuning task. In this work, we propose the use of large language models (LLMs) to tune particle accelerators. We demonstrate on a proof-of-principle example the ability of LLMs to successfully and autonomously tune a particle accelerator subsystem based on nothing more than a natural language prompt from the operator, and compare the performance of our LLM-based solution to state-of-the-art optimisation algorithms, such as Bayesian optimisation (BO) and reinforcement learning-trained optimisation (RLO). In doing so, we also show how LLMs can perform numerical optimisation of a highly non-linear real-world objective function. Ultimately, this work represents yet another complex task that LLMs are capable of solving and promises to help accelerate the deployment of autonomous tuning algorithms to the day-to-day operations of particle accelerators.
自动调节粒子加速器是一个充满挑战的研究领域,旨在实现新型的加速器技术, cutting-edge 的具有高影响应用的高科技应用,例如物理学发现、癌症研究和材料科学。自动调节粒子加速器的一个重要挑战是,最有效的算法需要优化领域的专家才能实现对每个新调节任务的算法进行操作。在这项工作中,我们提出使用大型语言模型(LLMs)对粒子加速器进行自动调节。我们在一个证明性的例子中展示了LLMs成功且自主地调节一个粒子加速器子系统的能力,仅基于操作员的自然语言提示。我们还比较了我们的LLM基于解决方案与最先进的优化算法(如贝叶斯优化(BO)和强化学习训练的优化(RLO))的性能。通过这样做,我们还展示了LLMs如何执行高度非线性的现实世界目标函数的数值优化。最终,这项工作代表了LLMs能够解决的最新复杂任务,并有望加速将自调节算法应用于粒子加速器日常运营的工作。
https://arxiv.org/abs/2405.08888
This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.
本文解决了在机器人运动中需要精炼的问题,尽管通过人类-机器人对齐方法实现了高视觉相似性,但在物理世界中却缺乏实际执行。图形社区中现有的技术通常优先考虑视觉一致性而非基于物理的可行性,这给在实际应用中部署双足机器人带来了巨大的挑战。我们的研究引入了一个约束的强化学习算法,用于在下肢式机器人上产生基于物理的高质量运动模仿,同时成功跟踪参考人类轨迹。我们将框架命名为I-CTRL。通过将运动复制问题重新表述为基于非物理对齐运动的约束优化,我们的框架在具有简单和独特奖励的简单和独特的基础上表现出色,并且可以适用于四台机器人。此外,我们的框架可以跟随大规模运动数据集,并使用独特的RL代理。所提出的方法标志着在进步控制双足机器人方面迈出了关键的一步,强调了在成功运动复制中实现视觉和物理现实之间的一致性至关重要。
https://arxiv.org/abs/2405.08726
This study investigates the computational speed and accuracy of two numerical integration methods, cubature and sampling-based, for integrating an integrand over a 2D polygon. Using a group of rovers searching the Martian surface with a limited sensor footprint as a test bed, the relative error and computational time are compared as the area was subdivided to improve accuracy in the sampling-based approach. The results show that the sampling-based approach exhibits a $14.75\%$ deviation in relative error compared to cubature when it matches the computational performance at $100\%$. Furthermore, achieving a relative error below $1\%$ necessitates a $10000\%$ increase in relative time to calculate due to the $\mathcal{O}(N^2)$ complexity of the sampling-based method. It is concluded that for enhancing reinforcement learning capabilities and other high iteration algorithms, the cubature method is preferred over the sampling-based method.
这项研究探讨了两种数值积分方法:立方和基于采样的方法,在整合一个二维多边形中的积分多项式的计算速度和精度。使用一组轮式机器人,其有限传感器足迹作为火星表面测试台,将采样的精度与分段面积的提高精度进行比较,当采样方法在计算性能达到100%时。结果显示,与立方相比,基于采样的方法在相对误差方面存在14.75%的偏差。此外,为了实现相对误差低于1%,需要将相对时间增加10000%以计算由于采样的方法 $\mathcal{O}(N^2)$ 复杂性。因此,结论是,为了增强强化学习能力和其他高迭代算法,立方方法比基于采样的方法更受欢迎。
https://arxiv.org/abs/2405.08691
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.
自动驾驶交叉管理(AIM)由于现实交通场景复杂性和需要一个昂贵的集中式服务器同时控制所有车辆而带来了显著的挑战。为了应对这些问题,本研究通过提出一种新型的分布式AIM方法利用多智能体强化学习(MARL)来解决这些问题。我们证明了通过利用高级辅助系统3D环绕视技术,自动驾驶车辆可以在不需要任何集中式控制器的情况下准确地导航路口场景。因此,本文的贡献包括基于MARL的自動管理4个路口的算法和引入了一种名为优先场景回放的新策略,以提高训练效果。我们验证了我们的方法作为传统集中AIM技术的一个创新替代方案,确保了我们的结果的完整可重复性。具体来说,使用SMARTS平台在虚拟环境中进行的实验强调了其在各种指标上优于基准测试的优越性。
https://arxiv.org/abs/2405.08655
In biological evolution complex neural structures grow from a handful of cellular ingredients. As genomes in nature are bounded in size, this complexity is achieved by a growth process where cells communicate locally to decide whether to differentiate, proliferate and connect with other cells. This self-organisation is hypothesized to play an important part in the generalisation, and robustness of biological neural networks. Artificial neural networks (ANNs), on the other hand, are traditionally optimized in the space of weights. Thus, the benefits and challenges of growing artificial neural networks remain understudied. Building on the previously introduced Neural Developmental Programs (NDP), in this work we present an algorithm for growing ANNs that solve reinforcement learning tasks. We identify a key challenge: ensuring phenotypic complexity requires maintaining neuronal diversity, but this diversity comes at the cost of optimization stability. To address this, we introduce two mechanisms: (a) equipping neurons with an intrinsic state inherited upon neurogenesis; (b) lateral inhibition, a mechanism inspired by biological growth, which controlls the pace of growth, helping diversity persist. We show that both mechanisms contribute to neuronal diversity and that, equipped with them, NDPs achieve comparable results to existing direct and developmental encodings in complex locomotion tasks
在生物进化中,复杂的神经结构从一些细胞成分开始生长。由于自然界中的基因组大小是有限的,这种复杂性是通过细胞在局部交流以决定是否分化、增殖并与其他细胞连接来实现增长的。这种自组织被认为在生物神经网络的泛化和鲁棒性中发挥了重要作用。另一方面,人工神经网络(ANN)在权重空间中通常是优化的。因此,生长人工神经网络的收益和挑战仍然没有被充分研究。在之前引入的神经发育程序(NDP)的基础上,在这篇论文中,我们提出了一个生长ANN的算法,用于解决强化学习任务。我们认识到一个关键挑战是:保证表型复杂性需要保持神经元多样性,但这种多样性是以优化稳定性为代价的。为了应对这个问题,我们引入了两个机制:(a)为神经元提供源于神经发生学的内在状态;(b)横向抑制,一种受到生物生长启发的机制,它控制生长速度,有助于维持多样性。我们证明了这两个机制都贡献了神经元多样性,有了它们,NDP在复杂运动任务上的效果与现有的直接和发育编码相当。
https://arxiv.org/abs/2405.08510
Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
强化学习从人类反馈(RLHF)是大型语言模型对齐的规范框架。然而,离线对齐算法的兴起使得RLHF中需要在线策略抽样变得具有挑战性。在奖励过度优化背景下,我们从一个实验开端的实验集开始,这些实验展示了在线方法相对于离线方法的优势。这促使我们通过一系列精心设计的实验 ablations 调查性能差异的原因。我们通过经验证明,类似于离线数据覆盖和数据质量本身无法说服地解释性能差异。我们发现,尽管离线算法通过成对分类来训练策略变得擅长,但它在大规模生成任务上的表现却更差;与此同时,在线算法在大规模生成任务上表现更好,但在成对分类上表现得更差。这揭示了在抽样过程中存在某种独特的交互作用,这种作用在很大程度上受到对样本的影响。最后,我们观察到,对于对比度和非对比度损失函数,性能差异仍然存在,并且似乎不能通过简单地增加策略网络规模来解决。结合研究,我们的研究阐明了在人工智能对齐中on-policy抽样的关键作用,以及离线对齐算法的某些基本挑战。
https://arxiv.org/abs/2405.08448
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380
Safe maneuvering capability is critical for mobile robots in complex environments. However, robotic system dynamics are often time-varying, uncertain, or even unknown during the motion planning and control process. Therefore, many existing model-based reinforcement learning (RL) methods could not achieve satisfactory reliability in guaranteeing safety. To address this challenge, we propose a two-level Vector Field-guided Learning Predictive Control (VF-LPC) approach that guarantees safe maneuverability. The first level, the guiding level, generates safe desired trajectories using the designed kinodynamic guiding vector field, enabling safe motion in obstacle-dense environments. The second level, the Integrated Motion Planning and Control (IMPC) level, first uses the deep Koopman operator to learn a nominal dynamics model offline and then updates the model uncertainties online using sparse Gaussian processes (GPs). The learned dynamics and game-based safe barrier function are then incorporated into the learning predictive control framework to generate near-optimal control sequences. We conducted tests to compare the performance of VF-LPC with existing advanced planning methods in an obstacle-dense environment. The simulation results show that it can generate feasible trajectories quickly. Then, VF-LPC is evaluated against motion planning methods that employ model predictive control (MPC) and RL in high-fidelity CarSim software. The results show that VF-LPC outperforms them under metrics of completion time, route length, and average solution time. We also carried out path-tracking control tests on a racing road to validate the model uncertainties learning capability. Finally, we conducted real-world experiments on a Hongqi E-HS3 vehicle, further validating the VF-LPC approach's effectiveness.
保证移动机器人在复杂环境中的安全机动能力至关重要。然而,机器人系统动力学通常在运动规划和控制过程中是时间变化、不确定或甚至是未知的。因此,许多基于模型的强化学习(RL)方法无法在保证安全方面达到令人满意的可靠性。为解决这个问题,我们提出了一个两级Vector Field-guided Learning Predictive Control(VF-LPC)方法,以确保安全机动。 第一级,指导层,使用设计的水动力引导向量场生成安全的愿望轨迹,使机器人在密集障碍物的环境中安全运动。第二级,集成运动规划与控制(IMPC)层,首先使用深度Koopman操作学习一个定理动态模型,然后在线使用稀疏高斯过程(GPs)更新模型不确定性。然后将学到的动态和基于游戏的safe barrier函数纳入学习预测控制框架,生成最优控制序列。我们对VF-LPC与现有高级规划方法在密集障碍物的环境中的性能进行了测试。 仿真结果表明,VF-LPC可以快速生成可行轨迹。然后,将VF-LPC与采用模型预测控制(MPC)和RL的高保真度CarSim软件的运动规划方法进行比较。结果表明,在完成时间、路径长度和平均解决方案时间等指标上,VF-LPC优越。我们还对赛车道路进行了路径跟踪控制测试,以验证模型不确定性学习能力的有效性。 最后,我们在一辆 Hongqi E-HS3 车上进行了实际实验,进一步验证了VF-LPC方法的有效性。
https://arxiv.org/abs/2405.08283
Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the ``wooden barrel effect'' caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as the battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments on various well-known datasets show that DR-FL can not only maximise knowledge sharing among heterogeneous models under the energy constraint of large-scale AIoT systems but also improve the model performance of each involved heterogeneous device.
尽管联邦学习(FL)在知识共享方面对异构人工智能设备(AIoT)具有前景,但在实际电池驱动场景中,它们的训练效果和能效受到了由异构模型范式和异构设备能力之间的“木桶效应”引起的影响。因此,由于设备之间的各种差异,现有的FL方法很难在能源受限的场景中进行有效训练,例如设备的电池限制。为了解决上述问题,我们提出了一个能源感知FL框架,名为DR-FL,它考虑了客户端和异构深度学习模型的能源限制,以实现能源高效的FL。与Vanilla FL不同,DR-FL采用了我们提出的基于MARL的双选方法,允许参与设备根据其计算能力和能源能力以一种MARL方式有效且适当地为全局模型做出贡献。在各种知名数据集上的实验表明,DR-FL不仅可以提高大规模AIoT系统中的异构模型之间的知识共享,而且还可以提高涉及的所有异构设备的模型性能。
https://arxiv.org/abs/2405.08183
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to this https URL and this https URL for more detailed information.
在本技术报告中,我们介绍了基于人类反馈的在线迭代强化学习(RLHF)的工作流程,其在最近的大规模语言模型(LLM)文献中被广泛报道,远超其线下对应方案。然而,现有的开源RLHF项目仍然主要局限于离线学习环境。在本技术报告中,我们的目标是填补这一空白,并为在线迭代RLHF提供详细的可重复食谱。 首先,由于在线人类反馈通常对于资源有限的开源社区是不可行的,我们首先使用多样化的开源数据集构建偏好模型,并使用构建的代理偏好模型来近似人类反馈。然后,我们讨论了在线迭代RLHF的理论和算法原理,接着是详细的实践实现。 经过训练,我们的LLM模型SFR-Iterative-DPO-LLaMA-3-8B-R在LLM聊天机器人基准测试中取得了令人印象深刻的性能,包括AlpacaEval-2、Arena-Hard和MT-Bench,以及包括HumanEval和TruthfulQA在内的其他学术基准。我们证明了监督微调(SFT)和迭代RLHF可以通过完全开源数据集获得最先进的性能。此外,我们将模型、精心挑选的数据集以及全面的步骤指南公开发布。请参阅此[https://url和https://url以获取更多信息。
https://arxiv.org/abs/2405.07863
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This leaves an open question: how can behavior policy be chosen for data-efficient GVF learning? To address this gap, we propose GVFExplorer, which aims at learning a behavior policy that efficiently gathers data for evaluating multiple GVFs in parallel. This behavior policy selects actions in proportion to the total variance in the return across all GVFs, reducing the number of environmental interactions. To enable accurate variance estimation, we use a recently proposed temporal-difference-style variance estimator. We prove that each behavior policy update reduces the mean squared error in the summed predictions over all GVFs. We empirically demonstrate our method's performance in both tabular representations and nonlinear function approximation.
一般价值函数(GVFs)(Sutton et al, 2011) 是用于在强化学习中表示预测知识的一种已建立的方法。每个GVF根据唯一的伪奖励计算给定策略的预期回报。可以使用来自单个数据流的多重GVF估计,该数据流通常源自于固定行为策略或预先收集的数据。这留下了一个开放性问题:如何选择数据有效的GVF学习中的行为策略?为了解决这个空白,我们提出了GVFExplorer,它旨在学习一个行为策略,以并行收集数据以评估多个GVF。该行为策略根据所有GVF的回报总方差的比例选择动作,减少环境交互的数量。为了实现准确方差估计,我们使用了一个最近提出的 temporal-difference-style 方差估计器。我们证明了每个行为策略更新都会减少所有GVF的加总预测的均方误差。我们通过实验展示了我们方法的性能,包括表格表示和非线性函数逼近。
https://arxiv.org/abs/2405.07838
Over the last few years, 360$\degree$ video traffic on the network has grown significantly. A key challenge of 360$\degree$ video playback is ensuring a high quality of experience (QoE) with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate (ABR) streaming based on single viewport prediction to reduce bandwidth consumption. However, the performance of models for single-viewpoint prediction is severely limited by the inherent uncertainty in head movement, which can not cope with the sudden movement of users very well. This paper first presents a multimodal spatial-temporal attention transformer to generate multiple viewpoint trajectories with their probabilities given a historical trajectory. The proposed method models viewpoint prediction as a classification problem and uses attention mechanisms to capture the spatial and temporal characteristics of input video frames and viewpoint trajectories for multi-viewpoint prediction. After that, a multi-agent deep reinforcement learning (MADRL)-based ABR algorithm utilizing multi-viewpoint prediction for 360$\degree$ video streaming is proposed for maximizing different QoE objectives under various network conditions. We formulate the ABR problem as a decentralized partially observable Markov decision process (Dec-POMDP) problem and present a MAPPO algorithm based on centralized training and decentralized execution (CTDE) framework to solve the problem. The experimental results show that our proposed method improves the defined QoE metric by up to 85.5\% compared to existing ABR methods.
在过去的几年里,网络上的360度视频流量大幅增长。360度视频播放的一个关键挑战是确保在有限网络带宽下提供高质量(QoE)体验,尤其是在用户运动突然的情况下。目前,大多数研究都集中在基于单视图预测的块式自适应比特率(ABR)流媒体上,以降低带宽消耗。然而,单视图预测模型的性能受到头部运动固有不确定性的严重限制,无法很好地应对用户的突然运动。本文首先提出了一种多模态空间-时间注意力Transformer,用于根据历史轨迹生成多个视角轨迹的概率。所提出的方法将视角预测视为分类问题,并使用注意机制来捕捉输入视频帧和多视角预测视角轨迹的空间和时间特征。然后,我们提出了一种基于多视角预测的ABR算法,用于在各种网络条件下最大化不同的QoE目标。我们将ABR问题形式化为分布式部分观察的马尔可夫决策过程(Dec-POMDP)问题,并基于集中训练和分布式执行(CTDE)框架提出了一种MAPPO算法来解决该问题。实验结果表明,与现有ABR方法相比,我们所提出的方法提高了定义的QoE指标高达85.5%。
https://arxiv.org/abs/2405.07759