Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.
机器人可以更有效地影响人们以完成任务:自动驾驶汽车可以在路口逐步前进以通过,而桌面操作器可以首先尝试桌子上的物体。然而,机器人影响力的能力也可能危及附近人的安全,如果盲目执行。在这项工作中,我们提出并解决了新颖的鲁棒到达避免动态游戏,使机器人只有在存在安全备份控制时才能发挥最大影响力。在人类方面,我们将人类的行为建模为以目标为导向,但受到机器人计划的条件限制,使我们能够捕捉影响力。在机器人方面,我们在物理和信念空间中解决了动态游戏,使机器人能够关于其对人类行为的不确定性如何随时间演变进行思考。我们通过高维(39-D)的模拟人类-机器人协同操作任务,使用离线游戏理论强化学习来解决该任务,实例化我们的方法,称为SLIDE(在动态环境中安全利用影响力)。我们比较了我们的方法与将人类视为最坏情况 adversary 的稳健基线、不明确考虑影响力的安全控制器以及基于能量函数的安全护盾之间的差异。我们发现,SLIDE 能够使机器人安全地利用其对人类的影响力,从而在执行任务时允许机器人更加保守,但在任务执行过程中仍能确保高安全率。
https://arxiv.org/abs/2409.12153
Temporal difference (TD) learning with linear function approximation, abbreviated as linear TD, is a classic and powerful prediction algorithm in reinforcement learning. While it is well understood that linear TD converges almost surely to a unique point, this convergence traditionally requires the assumption that the features used by the approximator are linearly independent. However, this linear independence assumption does not hold in many practical scenarios. This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features. In fact, we do not make any assumptions on the features. We prove that the approximated value function converges to a unique point and the weight iterates converge to a set. We also establish a notion of local stability of the weight iterates. Importantly, we do not need to introduce any other additional assumptions and do not need to make any modification to the linear TD algorithm. Key to our analysis is a novel characterization of bounded invariant sets of the mean ODE of linear TD.
时空差(TD)学习是一种在强化学习中的经典且强大的预测算法,可以简称为线性TD。尽管众所周知,线性TD几乎 sure 地收敛到唯一的点,但这一收敛传统要求近似器使用的特征是线性相关的。然而,在许多实际场景中,这个线性相关假设不成立。这项工作是第一项不需要线性相关特征来建立几乎 sure 收敛的线性TD。事实上,我们不需要对特征做出任何假设。我们证明近似值函数收敛到唯一的点,权重迭代收敛到一个集合。我们还建立了权重迭代的一个局部稳定性概念。重要的是,我们不需要引入任何其他附加假设,也不需要对线性TD算法进行任何修改。我们分析的关键是对线性TDmean ODE的有界不变子集的全新刻画。
https://arxiv.org/abs/2409.12135
In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.
在这份报告中,我们介绍了一系列针对数学的大语言模型:Qwen2.5-Math和Qwen2.5-Math-Instruct-1.5B/7B/72B。Qwen2.5系列的核心创新在于在整个管道中整合自我提升的理念,从预训练到后训练,直至推理: (1)在预训练阶段,Qwen2-Math-Instruct被用于生成大规模、高质量的数学数据。 (2)在后训练阶段,我们通过从Qwen2-Math-Instruct进行大规模抽样来开发了一个奖励模型(RM)。然后将该RM应用于数据在监督微调(SFT)中的迭代进化。有了更强的SFT模型,可以逐级训练和更新RM,从而引导下一轮SFT数据迭代。在最终的后训练模型中,我们使用终极RM进行强化学习,实现了Qwen2.5-Math-Instruct。 (3)此外,在推理阶段,RM被用于指导抽样,优化模型的性能。Qwen2.5-Math-Instruct支持中文和英文,并具备高级数学推理能力,包括链式思维(CoT)和工具集成推理(TIR)。我们在英语和中文的10个数学数据集上评估我们的模型,这些数据集包括GSM8K、MATH、GaoKao、AMC23和AIME24,涵盖了从小学到数学竞赛问题的各种难度级别。
https://arxiv.org/abs/2409.12122
Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot's capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach's robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves improvement up to $35\%$ in success rate compared with the best-performing baseline.
机器人辅助喂食对于改善具有进食障碍的个人生活质量具有巨大的潜力。然而,在不同的条件和情况下获取多样食物项目并将其推广到未见过的食物具有独特的挑战。现有的方法往往依赖于视觉提示(例如颜色、形状和纹理)表面的几何信息,往往缺乏适应性和稳健性,尤其是在食物具有类似的物理属性但视觉外观不同的情况下。我们使用模仿学习(IL)来学习食物获取策略。现有的方法使用IL或强化学习(RL)基于标准的图像编码器(例如ResNet-50)学习策略。然而,这些表示不具有鲁棒性,在多样获取场景中表现不佳。为了克服这些限制,我们提出了名为IMRL(集成多维表示学习)的新方法,该方法将视觉、物理、时间和几何表示相结合,提高了IL在食品获取方面的鲁棒性和可扩展性。我们的方法捕捉食品类型和物理属性(例如固体、半固体、粗粒、液体和混合物),建模获取动作的时变动态,并引入几何信息来确定最优的勺子抓取点和评估碗的满度。IMRL使得IL能够根据上下文自适应地调整抓取策略,从而提高机器人处理多样食品获取场景的能力。在真实机器人上的实验证实了我们在不同食品和碗配置下的鲁棒性和适应性,包括对未见设置的零 shot泛化。与最佳基线相比,我们的方法实现了35%的改善。
https://arxiv.org/abs/2409.12092
Safety is one of the key issues preventing the deployment of reinforcement learning techniques in real-world robots. While most approaches in the Safe Reinforcement Learning area do not require prior knowledge of constraints and robot kinematics and rely solely on data, it is often difficult to deploy them in complex real-world settings. Instead, model-based approaches that incorporate prior knowledge of the constraints and dynamics into the learning framework have proven capable of deploying the learning algorithm directly on the real robot. Unfortunately, while an approximated model of the robot dynamics is often available, the safety constraints are task-specific and hard to obtain: they may be too complicated to encode analytically, too expensive to compute, or it may be difficult to envision a priori the long-term safety requirements. In this paper, we bridge this gap by extending the safe exploration method, ATACOM, with learnable constraints, with a particular focus on ensuring long-term safety and handling of uncertainty. Our approach is competitive or superior to state-of-the-art methods in final performance while maintaining safer behavior during training.
安全是阻止将强化学习技术应用于现实机器人中的关键问题之一。虽然安全强化学习领域的大多数方法都没有要求先验知识约束和机器人运动学,并且仅依赖数据,但通常很难将它们应用于复杂的现实环境。相反,基于模型的方法已经证明,将约束和动态知识引入学习框架可以使学习算法直接部署到现实机器人上。然而,不幸的是,虽然通常可以获得机器人动态的近似模型,但安全约束是任务特定的,很难获得:它们可能过于复杂,无法通过分析来编码,或者难以在训练过程中明确地想象远期安全性需求。在本文中,我们通过扩展安全探索方法ATACOM,引入可学习约束,特别关注确保长期安全和处理不确定性。我们的方法在最终表现上与最先进的Methods相当或更优,同时保持训练过程中的安全性行为。
https://arxiv.org/abs/2409.12045
Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.
离线多智能体强化学习(MARL)是一个令人兴奋的研究方向,它使用静态数据集来寻找最优控制策略 for 多智能体系统。尽管该领域本质上是数据驱动的,但迄今为止的努力都忽略了数据在追求最优结果的过程中。我们首先通过调查文献来证实这一说法,并表明大多数工作在生成自己的数据集时没有采用一致的方法论,并且关于这些数据集的特征提供得很稀疏。然后我们展示了忽视数据性质的严重问题,通过引人注目的例子说明,算法的性能与使用的数据集密切相关,需要在该领域建立共同的基础进行实验。为了回应这一问题,我们迈出了改善离线MARL数据使用和数据意识的重要一步,包括以下三个关键贡献:(1)生成新数据集的清晰指南;(2)使用一致存储格式并在公开可用的存储库中托管的 80 多个现有数据集的标准化;(3)一套分析工具,使我们能够更好地理解这些数据集,并有助于进一步发展。
https://arxiv.org/abs/2409.12001
Handling orientations of robots and objects is a crucial aspect of many applications. Yet, ever so often, there is a lack of mathematical correctness when dealing with orientations, especially in learning pipelines involving, for example, artificial neural networks. In this paper, we investigate reinforcement learning with orientations and propose a simple modification of the network's input and output that adheres to the Lie group structure of orientations. As a result, we obtain an easy and efficient implementation that is directly usable with existing learning libraries and achieves significantly better performance than other common orientation representations. We briefly introduce Lie theory specifically for orientations in robotics to motivate and outline our approach. Subsequently, a thorough empirical evaluation of different combinations of orientation representations for states and actions demonstrates the superior performance of our proposed approach in different scenarios, including: direct orientation control, end effector orientation control, and pick-and-place tasks.
处理机器人和对象的朝向是一个许多应用中的关键方面。然而,在处理朝向时,尤其是在涉及例如人工神经网络的学习管道中,数学正确性往往存在不足。在本文中,我们研究了具有朝向的强化学习,并提出了一个简单的朝向网络输入和输出的修改,该修改遵循了朝向的Lie群结构。因此,我们获得了一个易於使用现有学习库的简单而有效的实现,并比其他常见的朝向表示方式取得了显著的更好的性能。我们简要介绍了机器人学中朝向的Lie理论,以激励和概述我们的方法。接下来,对不同朝向表示组合的状态和动作的全面实证评估表明,我们提出的朝向方法在不同的场景中的表现都优于其他常见方法,包括:直接朝向控制、末端执行器朝向控制和抓取和放置任务。
https://arxiv.org/abs/2409.11935
The problem of safety for robotic systems has been extensively studied. However, little attention has been given to security issues for three-dimensional systems, such as quadrotors. Malicious adversaries can compromise robot sensors and communication networks, causing incidents, achieving illegal objectives, or even injuring people. This study first designs an intelligent control system for autonomous quadrotors. Then, it investigates the problems of optimal false data injection attack scheduling and countermeasure design for unmanned aerial vehicles. Using a state-of-the-art deep learning-based approach, an optimal false data injection attack scheme is proposed to deteriorate a quadrotor's tracking performance with limited attack energy. Subsequently, an optimal tracking control strategy is learned to mitigate attacks and recover the quadrotor's tracking performance. We base our work on Agilicious, a state-of-the-art quadrotor recently deployed for autonomous settings. This paper is the first in the United Kingdom to deploy this quadrotor and implement reinforcement learning on its platform. Therefore, to promote easy reproducibility with minimal engineering overhead, we further provide (1) a comprehensive breakdown of this quadrotor, including software stacks and hardware alternatives; (2) a detailed reinforcement-learning framework to train autonomous controllers on Agilicious agents; and (3) a new open-source environment that builds upon PyFlyt for future reinforcement learning research on Agilicious platforms. Both simulated and real-world experiments are conducted to show the effectiveness of the proposed frameworks in section 5.2.
机器机器人系统的安全性问题已经得到了广泛研究。然而,对于三维系统(如四旋翼)的安全性问题,关注较少。恶意攻击者可能攻击机器人传感器和通信网络,导致事故、实现非法目标或甚至伤害人员。本研究首先为自主四旋翼设计了智能控制系统。然后,研究了无人机上最优假数据注入攻击调度问题和反制设计问题。采用最先进的深度学习方法,提出了用有限攻击能量恶化四旋翼跟踪性能的最优假数据注入攻击方案。接着,学习最优跟踪控制策略以减轻攻击并恢复四旋翼跟踪性能。我们的工作基于最新部署的智能四旋翼Agilicious,这是英国首个在自主环境中部署的智能四旋翼,并在其平台上实现了强化学习。因此,为了通过最小化工程开发生度促进易于重复,我们进一步提供了(1)对Agilicious的全面拆分,包括软件堆栈和硬件替代方案;(2)用于在Agilicious代理上训练自主控制器的详细强化学习框架;(3)利用PyFlyt构建未来Agilicious平台上的强化学习研究的新开源环境。第5.2节中的模拟和现实世界实验都进行了研究,以证明所提出的框架的有效性。
https://arxiv.org/abs/2409.11897
Non-stationarity poses a fundamental challenge in Multi-Agent Reinforcement Learning (MARL), arising from agents simultaneously learning and altering their policies. This creates a non-stationary environment from the perspective of each individual agent, often leading to suboptimal or even unconverged learning outcomes. We propose an open-source framework named XP-MARL, which augments MARL with auxiliary prioritization to address this challenge in cooperative settings. XP-MARL is 1) founded upon our hypothesis that prioritizing agents and letting higher-priority agents establish their actions first would stabilize the learning process and thus mitigate non-stationarity and 2) enabled by our proposed mechanism called action propagation, where higher-priority agents act first and communicate their actions, providing a more stationary environment for others. Moreover, instead of using a predefined or heuristic priority assignment, XP-MARL learns priority-assignment policies with an auxiliary MARL problem, leading to a joint learning scheme. Experiments in a motion-planning scenario involving Connected and Automated Vehicles (CAVs) demonstrate that XP-MARL improves the safety of a baseline model by 84.4% and outperforms a state-of-the-art approach, which improves the baseline by only 12.8%. Code: this http URL
多智能体强化学习(MARL)中的非平稳性在很大程度上源于同时学习并修改策略的智能体。这导致每个智能体的观点上都存在非平稳的环境,从而导致学习结果不最优或甚至不可收敛。我们提出了一个名为XP-MARL的开源框架,该框架在合作设置中通过辅助优先级排序来解决这一挑战。XP-MARL包括以下几个方面: 1) 我们假设,在优先级排序并让优先级较高的智能体先行动的情况下,可以稳定学习过程,从而减轻非平稳性和不收敛。 2) 得益于我们提出的动作传播机制,优先级较高的智能体先行动并沟通其行动,为其他人创建了一个更平稳的环境。 3) 采用辅助MARL问题学习优先级分配策略,实现了一种联合学习方案。 在运动规划场景中,涉及连接和自动车辆(CAVs)的运动,XP-MARL可以提高基线模型的安全性84.4%,而最先进的解决方案只提高了基线模型12.8%。代码:这个链接
https://arxiv.org/abs/2409.11852
This paper explores the potential application of Deep Reinforcement Learning in the furniture industry. To offer a broad product portfolio, most furniture manufacturers are organized as a job shop, which ultimately results in the Job Shop Scheduling Problem (JSSP). The JSSP is addressed with a focus on extending traditional models to better represent the complexities of real-world production environments. Existing approaches frequently fail to consider critical factors such as machine setup times or varying batch sizes. A concept for a model is proposed that provides a higher level of information detail to enhance scheduling accuracy and efficiency. The concept introduces the integration of DRL for production planning, particularly suited to batch production industries such as the furniture industry. The model extends traditional approaches to JSSPs by including job volumes, buffer management, transportation times, and machine setup times. This enables more precise forecasting and analysis of production flows and processes, accommodating the variability and complexity inherent in real-world manufacturing processes. The RL agent learns to optimize scheduling decisions. It operates within a discrete action space, making decisions based on detailed observations. A reward function guides the agent's decision-making process, thereby promoting efficient scheduling and meeting production deadlines. Two integration strategies for implementing the RL agent are discussed: episodic planning, which is suitable for low-automation environments, and continuous planning, which is ideal for highly automated plants. While episodic planning can be employed as a standalone solution, the continuous planning approach necessitates the integration of the agent with ERP and Manufacturing Execution Systems. This integration enables real-time adjustments to production schedules based on dynamic changes.
本文探讨了在家具行业中运用深度强化学习(Deep Reinforcement Learning)的潜在应用。为了提供广泛的产品线,大多数家具制造商都按就业站组织,这最终导致就业站调度问题(JSSP)。JSSP通过扩展传统模型以更好地表示现实生产环境中的复杂性来解决。现有的方法通常未能考虑诸如机器设置时间或批量大小的关键因素。为了提高预测精度和效率,提出了一个模型概念,该概念在提高调度准确性和效率方面提供了更高层次的信息。该概念将生产计划中的深度强化学习(DRL)集成引入了批次生产行业,如家具行业。通过包括工作量、缓冲管理、运输时间和机器设置时间,将传统方法扩展到JSSP。这使得更精确地预测和分析生产流程和流量,并适应现实生产过程的变异性。RL代理学会优化调度决策。它在一个离散的动作空间中做出决策,基于详细的观察结果。一个奖励函数指导代理的决策过程,从而促进高效的调度并达到生产目标。本文讨论了实施RL代理的两个集成策略:episodic计划,适用于低自动化环境,和连续计划,适用于高度自动化企业。虽然episodic计划可以作为独立的解决方案,但连续计划方法需要将代理与ERP和生产执行系统集成。这种集成使可以根据动态变化实时调整生产计划。
https://arxiv.org/abs/2409.11820
Recently, AI systems have made remarkable progress in various tasks. Deep Reinforcement Learning(DRL) is an effective tool for agents to learn policies in low-level state spaces to solve highly complex tasks. Researchers have introduced Intrinsic Motivation(IM) to the RL mechanism, which simulates the agent's curiosity, encouraging agents to explore interesting areas of the environment. This new feature has proved vital in enabling agents to learn policies without being given specific goals. However, even though DRL intelligence emerges through a sub-symbolic model, there is still a need for a sort of abstraction to understand the knowledge collected by the agent. To this end, the classical planning formalism has been used in recent research to explicitly represent the knowledge an autonomous agent acquires and effectively reach extrinsic goals. Despite classical planning usually presents limited expressive capabilities, PPDDL demonstrated usefulness in reviewing the knowledge gathered by an autonomous system, making explicit causal correlations, and can be exploited to find a plan to reach any state the agent faces during its experience. This work presents a new architecture implementing an open-ended learning system able to synthesize from scratch its experience into a PPDDL representation and update it over time. Without a predefined set of goals and tasks, the system integrates intrinsic motivations to explore the environment in a self-directed way, exploiting the high-level knowledge acquired during its experience. The system explores the environment and iteratively: (a) discover options, (b) explore the environment using options, (c) abstract the knowledge collected and (d) plan. This paper proposes an alternative approach to implementing open-ended learning architectures exploiting low-level and high-level representations to extend its knowledge in a virtuous loop.
近年来,AI系统在各种任务上取得了显著的进步。深度强化学习(DRL)是一种有效的工具,使智能体在低级状态空间中学习策略,以解决高度复杂的任务。研究人员引入了内生动机(IM)到强化学习(RL)机制中,模拟了智能体的好奇心,鼓励智能体探索环境中的有趣区域。这种新特性已经在使智能体在没有具体目标的情况下学习策略方面证明至关重要。然而,尽管DRL智能是通过子符号模型出现的,但仍然需要某种抽象来理解智能体收集到的知识。为此,在最近的研究中,经典规划形式被用于明确表示智能体获得的知識,并有效地达到外部的目标。尽管经典规划通常具有有限的表达能力,但PPDDL在回顾智能体收集到的知识以及明确因果关系方面表现出了有效性,并可以被用于找到智能体在经历其经验时面临的任何状态的规划方案。这项工作提出了一种新的架构,实现了一个自定义的学习系统,可以从零开始合成其经验并随时间更新。在没有预定义的目标和任务的情况下,系统通过内生动机以自导向的方式探索环境,利用其在经验中获得的先进知识。系统探索环境并递归执行:(a)发现选项,(b) 使用选项探索环境,(c) 抽象收集到的知识,(d) 规划。本文提出了利用低级和高级表示来扩展其知识以实现美德循环的另一种实现开放性学习架构的方法。
https://arxiv.org/abs/2409.11756
Human-in-the-loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single-agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (Human-Assisted Regrouping with Permutation Invariant Critic), a multi-agent reinforcement learning framework designed for group-oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non-experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human-proposed groupings, allowing non-expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non-experts and enhance performance. The project can be found at this https URL.
人类回馈强化学习将人类专业知识整合到加速智能体学习,为复杂领域提供关键指导和支持。然而,许多现有方法关注单一智能体任务,并且在训练过程中需要持续的人类参与,导致 human workload 显著增加,限制了可扩展性。在本文中,我们提出了 HARP(用于面向团队的强化学习框架),一种多智能体强化学习框架,旨在解决面向团队的任务。HARP 集成了自动智能体分组和战略人类辅助在部署过程中,使非专家能够在不干预的情况下提供有效的指导。在训练过程中,智能体动态调整其分组以优化合作任务完成。部署时,它们积极寻求人类帮助,并利用 Permutation Invariant Group Critic(PIGC)对人类提出的分组进行评估和优化,使非专家用户能够做出宝贵的建议。在多个合作场景中,我们的方法能够利用非专家的有限指导,并提高性能。这个项目可以在这个链接找到:https://www.academia.edu/39411841/HARP_for_Human_Collaborative_Task_Completion_in_Multi
https://arxiv.org/abs/2409.11741
In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.
在本文中,我们研究了强化学习中的格式偏见(RLHF)。我们观察到,许多广泛使用的偏好模型,包括人类评估者、GPT-4和RewardBench基准中的排名最高的模型,都表现出对特定格式模式强烈的偏见,例如列表、链接、加粗文本和表情符号。此外,大型语言模型(LLMs)可以利用这些偏见在流行的基准上实现更高的排名,如AlpacaEval和LMSYS Chatbot Arena。一个著名的例子是verbosity bias,其中当前的偏好模型倾向于选择更长、看起来更全面的回答,即使它们的质量与较短的竞争回答相同或更高。然而,在文献中,格式偏见(超过verbosity)仍然没有被充分探索。在这项工作中,我们超越了通常认可的长度偏见,提供了更广泛的格式偏见分析。此外,我们还证明了用很少的偏见数据(不到1%)可以向奖励模型注入显著的偏见。此外,这些格式偏见也可以很容易地被下游的归一化算法(如best-of-n采样和在线迭代DPO)利用,因为通常来说,操纵格式比改善响应的质量更容易。我们的研究结果强调了在设计对齐算法和评估模型时区分格式和内容的重要性。
https://arxiv.org/abs/2409.11704
Autonomous driving technology has witnessed rapid advancements, with foundation models improving interactivity and user experiences. However, current autonomous vehicles (AVs) face significant limitations in delivering command-based driving styles. Most existing methods either rely on predefined driving styles that require expert input or use data-driven techniques like Inverse Reinforcement Learning to extract styles from driving data. These approaches, though effective in some cases, face challenges: difficulty obtaining specific driving data for style matching (e.g., in Robotaxis), inability to align driving style metrics with user preferences, and limitations to pre-existing styles, restricting customization and generalization to new commands. This paper introduces Words2Wheels, a framework that automatically generates customized driving policies based on natural language user commands. Words2Wheels employs a Style-Customized Reward Function to generate a Style-Customized Driving Policy without relying on prior driving data. By leveraging large language models and a Driving Style Database, the framework efficiently retrieves, adapts, and generalizes driving styles. A Statistical Evaluation module ensures alignment with user preferences. Experimental results demonstrate that Words2Wheels outperforms existing methods in accuracy, generalization, and adaptability, offering a novel solution for customized AV driving behavior. Code and demo available at this https URL.
自动驾驶技术经历了快速发展,基础模型提高了互动性和用户体验。然而,目前的自动驾驶车辆(AVs)在实现命令式驾驶风格方面存在显著的局限性。大多数现有方法要么依赖于专家输入的预定义驾驶风格,要么使用数据驱动技术如逆强化学习从驾驶数据中提取风格。尽管这些方法在某些情况下有效,但仍然存在挑战:难以获得特定的驾驶风格数据进行风格匹配(例如,在Robotaxis中),无法将驾驶风格指标与用户偏好对齐,以及限制了预先存在的风格的定制和泛化能力,从而限制了根据新命令进行自定义和推广的能力。本文介绍了一个名为Words2Wheels的框架,该框架根据自然语言用户命令自动生成定制化的驾驶策略。Words2Wheels采用风格定制奖励函数生成一个不需要依赖先前驾驶数据的风格定制驾驶策略。通过利用大型语言模型和驾驶风格数据库,该框架有效地检索、适应和推广驾驶风格。一个统计评估模块确保与用户偏好对齐。实验结果表明,Words2Wheels在准确性、泛化性和适应性方面优于现有方法,为定制化AV驾驶行为提供了新颖的解决方案。代码和演示示例可在该链接的https URL中找到。
https://arxiv.org/abs/2409.11694
Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of H&N cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton H&N treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for H&N cancers.
Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. 我们提出了一个使用局部策略优化(PPO)算法和剂量分布为基础的奖励函数来进行质子PBS治疗计划的前列癌(H&N)肿瘤的自动治疗计划模型。具体来说,采用一系列经验规则从目标体积和器官危险区域(OARs)创建辅助规划结构,并将其与相应的规划目标连接起来。这些规划目标被输入到内部优化引擎中,生成点监测单元(MU)值。 为了迭代调整连续动作空间中涉及的规划目标参数并使用新的剂量分布为基础的奖励函数优化质子PBS治疗计划,我们开发了一个决策制定策略网络。通过训练PPO模型,我们成功地提高了OAR的保留率,在人类生成的计划中具有相同或更好的靶覆盖率。此外,在肝脏癌症的额外实验中,我们证明了所提出的方法可以在其他治疗站点上成功推广。 据我们所知,这是第一个基于DRL的自动治疗计划模型,可以在H&N癌症上实现人类水平的表现。
https://arxiv.org/abs/2409.11576
Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth examination of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area.
偏好调整是将深度生成模型与人类偏好对齐的关键过程。本调查对偏好调整和人类反馈的集成进行了全面的概述。论文分为三个主要部分:1)介绍和前提:介绍强化学习框架、偏好调整任务、模型和数据集在各种模态(语言、语音、视觉以及不同策略方法)上的应用;2)深入研究每种偏好调整方法:对偏好调整方法使用的详细分析,以及3)应用、讨论和未来方向:偏好调整在下游任务中的应用、包括不同模态的评估方法和未来研究方向。我们的目标是向研究人员和实践者呈现偏好调整的最新方法和技术,促进该领域的进一步参与和创新。我们希望进一步激发该领域的进一步参与和创新。
https://arxiv.org/abs/2409.11564
A team of multiple robots seamlessly and safely working in human-filled public environments requires adaptive task allocation and socially-aware navigation that account for dynamic human behavior. Current approaches struggle with highly dynamic pedestrian movement and the need for flexible task allocation. We propose Hyper-SAMARL, a hypergraph-based system for multi-robot task allocation and socially-aware navigation, leveraging multi-agent reinforcement learning (MARL). Hyper-SAMARL models the environmental dynamics between robots, humans, and points of interest (POIs) using a hypergraph, enabling adaptive task assignment and socially-compliant navigation through a hypergraph diffusion mechanism. Our framework, trained with MARL, effectively captures interactions between robots and humans, adapting tasks based on real-time changes in human activity. Experimental results demonstrate that Hyper-SAMARL outperforms baseline models in terms of social navigation, task completion efficiency, and adaptability in various simulated scenarios.
多个机器人无缝地且安全地在人类填充的公共环境中工作,需要适应任务分配和社会感知导航,以考虑动态的人类行为。目前的方法在高度动态的行人运动和需要灵活任务分配方面遇到困难。我们提出了Hyper-SAMARL,一种基于超图的多机器人任务分配和社会感知导航的超图系统,利用多智能体强化学习(MARL)。Hyper-SAMARL使用超图建模机器人、人类和兴趣点(POIs)之间的环境动态,通过超图扩散机制实现自适应任务分配和社会合规导航。用MARL训练的我们的框架有效捕捉了机器人和人类之间的互动,根据人类活动的实时变化适应任务。实验结果表明,Hyper-SAMARL在社交导航、任务完成效率和各种模拟场景的适应性方面优于基线模型。
https://arxiv.org/abs/2409.11561
Embodied vision-based real-world systems, such as mobile robots, require a careful balance between energy consumption, compute latency, and safety constraints to optimize operation across dynamic tasks and contexts. As local computation tends to be restricted, offloading the computation, ie, to a remote server, can save local resources while providing access to high-quality predictions from powerful and large models. However, the resulting communication and latency overhead has led to limited usability of cloud models in dynamic, safety-critical, real-time settings. To effectively address this trade-off, we introduce UniLCD, a novel hybrid inference framework for enabling flexible local-cloud collaboration. By efficiently optimizing a flexible routing module via reinforcement learning and a suitable multi-task objective, UniLCD is specifically designed to support the multiple constraints of safety-critical end-to-end mobile systems. We validate the proposed approach using a challenging, crowded navigation task requiring frequent and timely switching between local and cloud operations. UniLCD demonstrates improved overall performance and efficiency, by over 35% compared to state-of-the-art baselines based on various split computing and early exit strategies.
具有身体嵌入式视觉的实时现实系统(如移动机器人)需要在能耗、计算延迟和安全约束之间进行仔细的平衡,以优化在动态任务和上下文中的操作。由于局部计算通常受到限制,将计算卸载到远程服务器上,可以在本地资源受限的同时提供来自强大的大型模型的优质预测。然而,这种方法产生的通信和延迟开销导致云模型在动态、安全关键、实时设置中的可用性有限。为了有效解决这一权衡,我们引入了UniLCD,一种新型的混合推理框架,用于实现灵活的本地-云协作。通过通过强化学习和合适的多任务目标,有效地优化了灵活的路由模块,UniLCD特别设计来支持安全关键端到端移动系统的多个约束。我们通过一个具有挑战性的拥挤导航任务来验证所提出的方法。UniLCD展示了比基于各种分裂计算和早期退出策略的最先进的基准超过35%的优越性能和效率。
https://arxiv.org/abs/2409.11403
This work proposes an approach that integrates reinforcement learning and model predictive control (MPC) to efficiently solve finite-horizon optimal control problems in mixed-logical dynamical systems. Optimization-based control of such systems with discrete and continuous decision variables entails the online solution of mixed-integer quadratic or linear programs, which suffer from the curse of dimensionality. Our approach aims at mitigating this issue by effectively decoupling the decision on the discrete variables and the decision on the continuous variables. Moreover, to mitigate the combinatorial growth in the number of possible actions due to the prediction horizon, we conceive the definition of decoupled Q-functions to make the learning problem more tractable. The use of reinforcement learning reduces the online optimization problem of the MPC controller from a mixed-integer linear (quadratic) program to a linear (quadratic) program, greatly reducing the computational time. Simulation experiments for a microgrid, based on real-world data, demonstrate that the proposed method significantly reduces the online computation time of the MPC approach and that it generates policies with small optimality gaps and high feasibility rates.
本文提出了一种将强化学习和模型预测控制(MPC)相结合的方法,以高效解决混合逻辑动态系统中有限时间最优控制问题。基于优化的控制这些系统具有离散和连续决策变量,会导致混合整数二次或线性规划的在线求解,而该问题受到维数诅咒的影响。我们的方法旨在通过有效隔离离散变量和连续变量的决策来减轻这一问题。此外,为了减轻由于预测视野中可能行动数量的组合增长,我们定义了分离的Q函数来使学习问题更加简单。基于强化学习的控制将MPC控制器的在线优化问题从混合整数线性(二次)程序减少到线性(二次)程序,大大减少了计算时间。基于实际世界数据的微电网仿真实验证明,与MPC方法相比,所提出的方法显著减少了在线计算时间,并生成了具有小最优性缺口和高可行性的策略。
https://arxiv.org/abs/2409.11267
Network slicing in 5G and the future 6G networks will enable the creation of multiple virtualized networks on a shared physical infrastructure. This innovative approach enables the provision of tailored networks to accommodate specific business types or industry users, thus delivering more customized and efficient services. However, the shared memory and cache in network slicing introduce security vulnerabilities that have yet to be fully addressed. In this paper, we introduce a reinforcement learning-based side-channel cache attack framework specifically designed for network slicing environments. Unlike traditional cache attack methods, our framework leverages reinforcement learning to dynamically identify and exploit cache locations storing sensitive information, such as authentication keys and user registration data. We assume that one slice network is compromised and demonstrate how the attacker can induce another shared slice to send registration requests, thereby estimating the cache locations of critical data. By formulating the cache timing channel attack as a reinforcement learning-driven guessing game between the attack slice and the victim slice, our model efficiently explores possible actions to pinpoint memory blocks containing sensitive information. Experimental results showcase the superiority of our approach, achieving a success rate of approximately 95\% to 98\% in accurately identifying the storage locations of sensitive data. This high level of accuracy underscores the potential risks in shared network slicing environments and highlights the need for robust security measures to safeguard against such advanced side-channel attacks.
5G和未来的6G网络将实现在一个共享物理基础设施上创建多个虚拟化网络。这种创新方法可以将定制网络分配给特定的业务类型或行业用户,从而提供更加个性化和高效的服務。然而,在网络切片过程中共享内存和缓存引入了尚未完全解决的 security vulnerabilities。在本文中,我们介绍了一个专门针对网络切片环境的强化学习基于侧通道缓存攻击框架。与传统缓存攻击方法不同,我们的框架利用强化学习动态地识别并利用存储敏感信息的缓存位置。我们假设一个切片网络已经被攻破,并展示了攻击者如何诱导另一个共享切片发送注册请求,从而估计关键数据的缓存位置。通过将缓存时序信道攻击建模为攻击切片和受害切片之间的强化学习猜谜游戏,我们的模型有效地探索可能的操作,以确定包含敏感信息的内存块的位置。实验结果展示了我们方法的优势,在准确识别敏感数据存储位置方面取得了约95\%至98\%的成功率。这种高水平的准确性突出了共享网络切片环境中的潜在风险,并强调了在保护共享网络切片环境中采取有力安全措施的需求。
https://arxiv.org/abs/2409.11258