Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.
获取物理上合理的运动技能,适用于多样且非传统的形态(包括人形机器人、四足动物和各种生物)对于推进角色模拟与机器人技术至关重要。传统方法,如强化学习(RL),具有任务和身体特定性,需要大量奖励函数工程,并且泛化能力较差。模仿学习则提供了一种替代方案,但严重依赖高质量的专家演示数据,这在非人类形态中很难获得。视频扩散模型能够生成各种生物形态的真实视频,从人类到蚂蚁皆可涵盖。利用这一能力,我们提出了一种数据独立的方法来获取技能,该方法通过二维生成视频学习三维运动技能,并具备向非常规和非人类形式泛化的潜力。 具体而言,我们的模仿学习过程由视觉变压器引导进行基于视频的比较,计算视频嵌入之间的成对距离。除了视频编码的距离外,我们还使用分段视频帧之间计算出的相似度作为指导奖励。我们在涉及独特身体配置的运动任务上验证了该方法的有效性,在人形机器人行走任务中,“无数据模仿学习”(NIL)优于基于三维动作捕捉数据训练的基础模型。 我们的研究成果突显了利用生成式视频模型进行物理上合理技能学习的潜力,尤其适用于多样化的生物形态。这种方法通过用数据生成替代数据收集,有效地提高了模仿学习的能力。
https://arxiv.org/abs/2503.10626
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
大型语言模型已经在复杂的文本任务中展现了出色的推理能力。然而,多模态推理(需要融合视觉和文本信息)仍然是一个重大挑战。现有的视觉-语言模型往往难以有效地分析和推断视觉内容,在复杂推理任务上的表现不尽如人意。此外,缺乏全面的基准测试阻碍了对多模态推理能力的准确评估。 在本文中,我们介绍了R1-Onevision,这是一种旨在弥合视觉感知与深度推理之间差距的多模态推理模型。为此,我们提出了一种跨模态推理流水线,将图像转换为正式的语言表示形式,从而能够进行精确的语言基础推理。借助这一流水线,我们构建了R1-Onevision数据集,该数据集提供了涵盖各个领域的详细、分步骤的多模态推理注释。 为了培养先进的推理和强大的泛化能力,我们在监督微调和强化学习的基础上进一步开发了R1-Onevision模型。为全面评估不同级别的多模态推理性能,我们引入了R1-Onevision-Bench这一基准测试平台,它与人类教育阶段相吻合,并涵盖了初中到大学及更高层次的考试内容。 实验结果显示,R1-Onevision在多个具有挑战性的多模态推理基准上取得了最先进的性能表现,超越了包括GPT-4o和Qwen2.5-VL在内的模型。
https://arxiv.org/abs/2503.10615
Robot navigation in complex environments necessitates controllers that are adaptive and safe. Traditional controllers like Regulated Pure Pursuit, Dynamic Window Approach, and Model-Predictive Path Integral, while reliable, struggle to adapt to dynamic conditions. Reinforcement Learning offers adaptability but lacks formal safety guarantees. To address this, we propose a path tracking controller leveraging the Simplex architecture. It combines a Reinforcement Learning controller for adaptiveness and performance with a high-assurance controller providing safety and stability. Our contribution is twofold. We firstly discuss general stability and safety considerations for designing controllers using the Simplex architecture. Secondly, we present a Simplex-based path tracking controller. Our simulation results, supported by preliminary in-field tests, demonstrate the controller's effectiveness in maintaining safety while achieving comparable performance to state-of-the-art methods.
在复杂环境中进行机器人导航需要能够适应和确保安全性的控制器。传统的方法,如受控纯追踪(Regulated Pure Pursuit)、动态窗口方法(Dynamic Window Approach)以及预测路径积分模型(Model-Predictive Path Integral),虽然可靠但难以应对动态变化的环境条件。强化学习则因其可塑性而具有优势,但它缺乏正式的安全保障机制。为解决这些问题,我们提出了一种基于Simplex架构的路径追踪控制器,它结合了强化学习控制器的适应性和性能以及高可靠性控制器提供的安全和稳定性。 我们的贡献主要体现在两个方面:首先,我们讨论了在使用Simplex架构设计控制器时的一般稳定性和安全性考虑;其次,我们展示了一个基于Simplex的路径跟踪控制器。通过模拟结果及初步现场测试支持的数据表明,该控制器能够在确保安全的同时实现与最先进的方法相当的性能水平。 综上所述,我们的研究为机器人导航任务提供了一种新的解决方案,特别是在需要高度适应性、稳定性和安全性的情况下,该方案显示出优越的应用潜力。
https://arxiv.org/abs/2503.10559
Existing quadrupedal locomotion learning paradigms usually rely on extensive domain randomization to alleviate the sim2real gap and enhance robustness. It trains policies with a wide range of environment parameters and sensor noises to perform reliably under uncertainty. However, since optimal performance under ideal conditions often conflicts with the need to handle worst-case scenarios, there is a trade-off between optimality and robustness. This trade-off forces the learned policy to prioritize stability in diverse and challenging conditions over efficiency and accuracy in ideal ones, leading to overly conservative behaviors that sacrifice peak performance. In this paper, we propose a two-stage framework that mitigates this trade-off by integrating policy learning with imagined transitions. This framework enhances the conventional reinforcement learning (RL) approach by incorporating imagined transitions as demonstrative inputs. These imagined transitions are derived from an optimal policy and a dynamics model operating within an idealized setting. Our findings indicate that this approach significantly mitigates the domain randomization-induced negative impact of existing RL algorithms. It leads to accelerated training, reduced tracking errors within the distribution, and enhanced robustness outside the distribution.
现有的四足机器人步态学习范式通常依赖于广泛的领域随机化来缓解仿真与现实之间的差距,并增强鲁棒性。这种方法通过使用一系列环境参数和传感器噪声训练策略,使其在不确定条件下也能可靠运行。然而,由于理想条件下的最优性能往往与处理最坏情况的需求相冲突,因此存在效率(optimality)与稳健性(robustness)之间的权衡。这种权衡迫使学习到的策略在多样化且具有挑战性的条件下优先考虑稳定性,而不是在理想条件下的高效性和准确性,从而导致过于保守的行为并牺牲了峰值性能。 本文提出了一种两阶段框架,通过将政策学习与想象中的过渡状态相结合来缓解这一权衡问题。该框架通过将想象中的过渡状态作为示范输入,增强了传统的强化学习(RL)方法。这些想象的过渡状态是从理想化环境下的最优策略和动态模型中得出的。 我们的研究结果表明,这种方法显著减轻了现有RL算法因领域随机化而导致的负面影响。它不仅加速了训练过程,减少了分布内部的跟踪误差,还提高了分布外部的鲁棒性。
https://arxiv.org/abs/2503.10484
This paper presents our work on the Light-R1 series, with models, data, and code all released. We first focus on training long COT models from scratch, specifically starting from models initially lacking long COT capabilities. Using a curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO, we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains. In the subsequent phase of this work, we highlight the significant benefit of the 3k dataset constructed for the second SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled models using this dataset, we obtain new SOTA models in 7B and 14B, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying reinforcement learning, specifically GRPO, on long-COT models to further improve reasoning performance. We successfully train our final Light-R1-14B-DS with RL, achieving SOTA performance among 14B parameter models in math. With AIME24 & 25 scores of 74.0 and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected behavior, showing simultaneous increase in response length and reward score. The Light-R1 series of work validates training long-COT models from scratch, showcases the art in SFT data and releases SOTA models from RL.
这篇论文介绍了我们在Light-R1系列上的工作,包括模型、数据和代码的发布。首先,我们专注于从头开始训练具有长链思维链(Long COT)能力的模型,特别是从最初不具备这种能力的基础模型开始。通过采用由两阶段SFT(指令微调)和半策略DPO(分布惩罚优化)组成的课程训练方法,我们将Qwen2.5-32B-Instruct模型训练成了Light-R1-32B模型,在数学性能上优于DeepSeek-R1-Distill-Qwen-32B。尽管仅在数学数据集上进行训练,但Light-R1-32B在其他领域也表现出强大的泛化能力。 接下来的工作阶段中,我们强调了为第二阶段SFT准备的3k数据集对提升其他模型性能的重大作用。通过对DeepSeek-R1-Distill系列模型使用此数据集进行微调,我们在70亿和140亿参数规模上获得了新的最先进的(SOTA)模型,而320亿参数规模的Light-R1-32B-DS模型的表现与QwQ-32B和DeepSeek-R1相当。 此外,我们还通过将强化学习(特别是GRPO)应用于长链思维链模型来进一步提升推理性能。成功训练出最终版本的带RL微调的Light-R1-14B-DS,在数学领域中超越了包括许多320亿参数规模在内的其他模型和DeepSeek-R1-Distill-Llama-70B。它的强化学习训练还表现出预期的行为,即同时增加了响应长度和奖励分数。 总之,Light-R1系列工作验证了从零开始训练长链思维链模型的可能性,并展示了在SFT数据的巧妙运用以及通过RL获得最先进的模型成果。
https://arxiv.org/abs/2503.10460
Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of human driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models, designed to align motion planning with diverse driving preferences. TrajHF incorporates multi-conditional denoiser and reinforcement learning with human feedback to refine multi-modal trajectory generation beyond conventional imitation learning. This enables better alignment with human driving preferences while maintaining safety and feasibility constraints. TrajHF achieves PDMS of 93.95 on NavSim benchmark, significantly exceeding other methods. TrajHF sets a new paradigm for personalized and adaptable trajectory generation in autonomous driving.
生成与人类驾驶行为相似且适应性强的轨迹对于在动态环境中的自动驾驶至关重要。虽然生成模型在合成可行路径方面显示出潜力,但它们往往因数据集偏差和分布变化而难以捕捉到复杂的、细微的人类驾驶风格差异。为了解决这个问题,我们提出了TrajHF框架——一个以人类反馈驱动的微调框架,专门用于优化生成轨迹模型,使运动规划与多样化的驾驶偏好相一致。 TrajHF整合了多条件去噪器和基于人类反馈的强化学习技术,超越了传统模仿学习方法,进一步完善了多模态路径生成。这不仅能使自动生成的轨迹更好地符合人的驾驶习惯,同时也能确保安全性和可行性约束得到满足。 在NavSim基准测试中,TrajHF达到了PDMS得分93.95,显著优于其他现有技术。这一框架为自动驾驶领域个性化和适应性强的路径生成设定了新的标准。
https://arxiv.org/abs/2503.10434
In motion simulation, motion cueing algorithms are used for the trajectory planning of the motion simulator platform, where workspace limitations prevent direct reproduction of reference trajectories. Strategies such as motion washout, which return the platform to its center, are crucial in these settings. For serial robotic MSPs with highly nonlinear workspaces, it is essential to maximize the efficient utilization of the MSPs kinematic and dynamic capabilities. Traditional approaches, including classical washout filtering and linear model predictive control, fail to consider platform-specific, nonlinear properties, while nonlinear model predictive control, though comprehensive, imposes high computational demands that hinder real-time, pilot-in-the-loop application without further simplification. To overcome these limitations, we introduce a novel approach using deep reinforcement learning for motion cueing, demonstrated here for the first time in a 6-degree-of-freedom setting with full consideration of the MSPs kinematic nonlinearities. Previous work by the authors successfully demonstrated the application of DRL to a simplified 2-DOF setup, which did not consider kinematic or dynamic constraints. This approach has been extended to all 6 DOF by incorporating a complete kinematic model of the MSP into the algorithm, a crucial step for enabling its application on a real motion simulator. The training of the DRL-MCA is based on Proximal Policy Optimization in an actor-critic implementation combined with an automated hyperparameter optimization. After detailing the necessary training framework and the algorithm itself, we provide a comprehensive validation, demonstrating that the DRL MCA achieves competitive performance against established algorithms. Moreover, it generates feasible trajectories by respecting all system constraints and meets all real-time requirements with low...
https://arxiv.org/abs/2503.10419
In safe reinforcement learning, agent needs to balance between exploration actions and safety constraints. Following this paradigm, domain transfer approaches learn a prior Q-function from the related environments to prevent unsafe actions. However, because of the large number of false positives, some safe actions are never executed, leading to inadequate exploration in sparse-reward environments. In this work, we aim to learn an efficient state representation to balance the exploration and safety-prefer action in a sparse-reward environment. Firstly, the image input is mapped to latent representation by an auto-encoder. A further contrastive learning objective is employed to distinguish safe and unsafe states. In the learning phase, the latent distance is used to construct an additional safety check, which allows the agent to bias the exploration if it visits an unsafe state. To verify the effectiveness of our method, the experiment is carried out in three navigation-based MiniGrid environments. The result highlights that our method can explore the environment better while maintaining a good balance between safety and efficiency.
在安全强化学习中,代理需要在探索行为和安全性约束之间找到平衡。遵循这一范式,领域转移方法通过从相关环境中学习先验Q函数来防止不安全的行为。然而,由于大量的误报,一些安全行动从未被执行,这导致稀疏奖励环境中的探索不足。在这项工作中,我们旨在学习一种有效的状态表示以在稀疏奖励环境中平衡探索和优先考虑的安全性行为。 首先,通过自编码器将图像输入映射到潜在表示中。进一步采用对比学习目标来区分安全状态与不安全状态。在学习阶段,利用潜在距离构建一个额外的安全检查机制,这允许代理在其访问到不安全状态时偏向于进行探索。为了验证我们方法的有效性,在三个基于导航的MiniGrid环境中进行了实验。结果显示,我们的方法能够在保持安全性与效率良好平衡的同时更好地探索环境。
https://arxiv.org/abs/2503.10318
Many online advertising platforms provide advertisers with auto-bidding services to enhance their advertising performance. However, most existing auto-bidding algorithms fail to accurately capture the auto-bidding problem formulation that the platform truly faces, let alone solve it. Actually, we argue that the platform should try to help optimize each advertiser's performance to the greatest extent -- which makes $\epsilon$-Nash Equilibrium ($\epsilon$-NE) a necessary solution concept -- while maximizing the social welfare of all the advertisers for the platform's long-term value. Based on this, we introduce the \emph{Nash-Equilibrium Constrained Bidding} (NCB), a new formulation of the auto-bidding problem from the platform's perspective. Specifically, it aims to maximize the social welfare of all advertisers under the $\epsilon$-NE constraint. However, the NCB problem presents significant challenges due to its constrained bi-level structure and the typically large number of advertisers involved. To address these challenges, we propose a \emph{Bi-level Policy Gradient} (BPG) framework with theoretical guarantees. Notably, its computational complexity is independent of the number of advertisers, and the associated gradients are straightforward to compute. Extensive simulated and real-world experiments validate the effectiveness of the BPG framework.
许多在线广告平台为广告商提供自动竞价服务,以增强其广告表现。然而,大多数现有的自动竞价算法无法准确捕捉到平台真正面临的自动竞价问题的本质,更不用说解决了。实际上,我们认为平台应努力最大限度地优化每个广告商的性能——这使得$\epsilon$-纳什均衡($\epsilon$-NE)成为必要的解决方案概念——同时最大化所有广告商的整体社会福利以实现平台的长期价值。基于此,我们提出了\emph{纳什均衡约束竞价}(NCB),这是一个从平台视角出发的新自动竞价问题表述方法。具体来说,它旨在在$\epsilon$-NE约束下最大化所有广告商的社会总福利。然而,由于其具有受限制的双层结构和通常涉及大量广告商的特点,NCB问题面临着巨大的挑战。为了应对这些挑战,我们提出了一个带有理论保证的\emph{双层策略梯度}(BPG)框架。值得注意的是,它的计算复杂性与广告商的数量无关,并且相关的梯度易于计算。广泛的模拟和真实世界实验验证了BPG框架的有效性。
https://arxiv.org/abs/2503.10304
AI alignment, the challenge of ensuring AI systems act in accordance with human values, has emerged as a critical problem in the development of systems such as foundation models and recommender systems. Still, the current dominant approach, reinforcement learning with human feedback (RLHF) faces known theoretical limitations in aggregating diverse human preferences. Social choice theory provides a framework to aggregate preferences, but was not developed for the multidimensional applications typical of AI. Leveraging insights from a recently published urn process, this work introduces a preference aggregation strategy that adapts to the user's context and that inherits the good properties of the maximal lottery, a Condorcet-consistent solution concept.
AI对齐,即确保AI系统按照人类价值观行事的挑战,在诸如基础模型和推荐系统的开发中已作为一个关键问题浮现。然而,目前占主导地位的方法——基于人工反馈的强化学习(RLHF)在聚合多样的人类偏好方面存在已知的理论局限性。社会选择理论提供了一种聚合偏好的框架,但它并不是为AI典型的多维度应用而设计的。利用最近发表的一个抽样过程中的见解,这项工作介绍了一种适应用户上下文的偏好聚合策略,并继承了最大彩票方案(一种康多塞一致解决方案概念)的良好性质。 这段文字主要讨论的是在开发人工智能系统时遇到的一个关键问题:AI对齐。它指出了当前解决这一问题的主要方法——基于人工反馈的强化学习,存在一些理论上的局限性,特别是在处理多样化的用户偏好方面。同时,文章指出社会选择理论虽然提供了一种聚合偏好的框架,但并不适用于典型的AI应用场景。 为了克服这些挑战,作者提出了一种新的解决方案:利用一个最近发表的抽样过程中的见解来开发一种适应具体情境和用户的偏好聚合策略。这种新方法继承了“最大彩票方案”的优点,“最大彩票方案”是一种在康多塞一致框架下的理想解概念。通过这种方法,期望能够更好地确保AI系统的行为与人类的价值观相符合。
https://arxiv.org/abs/2503.10215
We propose PRISM, a novel framework designed to overcome the limitations of 2D-based Preference-Based Reinforcement Learning (PBRL) by unifying 3D point cloud modeling and future-aware preference refinement. At its core, PRISM adopts a 3D Point Cloud-Language Model (3D-PC-LLM) to mitigate occlusion and viewpoint biases, ensuring more stable and spatially consistent preference signals. Additionally, PRISM leverages Chain-of-Thought (CoT) reasoning to incorporate long-horizon considerations, thereby preventing the short-sighted feedback often seen in static preference comparisons. In contrast to conventional PBRL techniques, this integration of 3D perception and future-oriented reasoning leads to significant gains in preference agreement rates, faster policy convergence, and robust generalization across unseen robotic environments. Our empirical results, spanning tasks such as robotic manipulation and autonomous navigation, highlight PRISM's potential for real-world applications where precise spatial understanding and reliable long-term decision-making are critical. By bridging 3D geometric awareness with CoT-driven preference modeling, PRISM establishes a comprehensive foundation for scalable, human-aligned reinforcement learning.
我们提出了一种名为PRISM的新型框架,旨在通过统一三维点云建模和前瞻性的偏好优化来克服基于二维的偏好强化学习(PBRL)的局限性。在核心设计上,PRISM采用了一种3D点云-语言模型(3D-PC-LLM),以减轻遮挡和视角偏差的影响,从而确保更稳定且空间一致的偏好信号。此外,PRISM利用了链式思维(CoT)推理来纳入长远考虑,从而避免了静态偏好比较中常见的短视反馈。 与传统的PBRL技术相比,这种三维感知和面向未来的推理方法在偏同意率、策略收敛速度以及未见机器人环境中的稳健泛化能力方面取得了显著的进步。我们的实证结果涵盖了诸如机器人操作和自主导航等任务,在这些领域中精确的空间理解和可靠且长期的决策制定至关重要。 通过将3D几何意识与基于链式思维(CoT)的偏好建模相结合,PRISM为可扩展、人类一致性的强化学习奠定了全面的基础。
https://arxiv.org/abs/2503.10177
The sim-to-real gap remains a critical challenge in robotics, hindering the deployment of algorithms trained in simulation to real-world systems. This paper introduces a novel Real-Sim-Real (RSR) loop framework leveraging differentiable simulation to address this gap by iteratively refining simulation parameters, aligning them with real-world conditions, and enabling robust and efficient policy transfer. A key contribution of our work is the design of an informative cost function that encourages the collection of diverse and representative real-world data, minimizing bias and maximizing the utility of each data point for simulation refinement. This cost function integrates seamlessly into existing reinforcement learning algorithms (e.g., PPO, SAC) and ensures a balanced exploration of critical regions in the real domain. Furthermore, our approach is implemented on the versatile Mujoco MJX platform, and our framework is compatible with a wide range of robotic systems. Experimental results on several robotic manipulation tasks demonstrate that our method significantly reduces the sim-to-real gap, achieving high task performance and generalizability across diverse scenarios of both explicit and implicit environmental uncertainties.
仿真到现实的差距依然是机器人技术中的一个关键挑战,阻碍了在模拟环境中训练的算法部署到实际系统中。本文介绍了一种新颖的Real-Sim-Real (RSR) 循环框架,该框架利用可微分仿真迭代地优化仿真参数,使其与真实世界的条件相匹配,并使策略转移更加稳健和高效。我们工作的关键贡献之一是设计了一个信息丰富的代价函数,鼓励收集多样且具有代表性的现实世界数据,从而减少偏差并最大化每个数据点在模拟细化中的效用。该代价函数可以无缝集成到现有的强化学习算法(如PPO、SAC)中,并确保对真实领域中的关键区域进行平衡探索。 此外,我们的方法是在多功能的Mujoco MJX平台上实现的,且我们的框架与各种机器人系统兼容。在多个机器人操作任务上的实验结果表明,我们的方法显著缩小了仿真到现实之间的差距,在多种显式和隐式的环境不确定性场景中实现了高任务性能和泛化能力。
https://arxiv.org/abs/2503.10118
Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the transformation of the sampling process from the target policy into a re-ranking of preference data. Building on this hypothesis, We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preferences reordering. Extensive experimental results and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while reducing about 300x computational overheads.
强化学习(RL)算法在大型语言模型(LLMs)的安全对齐方面,如直接偏好优化(DPO),面临着分布偏移的挑战。目前的方法通常通过从目标策略在线采样来解决这一问题,但这需要大量的计算资源。本文假设,在离线策略训练期间,虽然由政策生成的输出排名顺序发生变化,但其整体分布相对稳定。这种稳定性使得将从目标策略中的采样过程转换为偏好数据的重新排序成为可能。 基于此假设,我们提出了一种新的框架,该框架利用模型内在的安全判断能力来提取奖励信号,并使用这些信号计算偏好的重排所需的标签置信度。广泛的实验结果和理论分析表明,所提出的这种方法有效地解决了分布偏移问题,在显著提高安全性性能的同时减少了大约300倍的计算开销。
https://arxiv.org/abs/2503.10093
Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.
多智能体系统(MAS)在执行复杂任务方面展现了巨大的潜力,但协调和安全性仍然是重大挑战。多代理强化学习(MARL)为代理协作提供了一个有前景的框架,但它在处理复杂任务和设计奖励函数时遇到了困难。大型语言模型(LLM)的引入使MAS具备了更强的推理能力和认知能力,但现有的基于LLM的系统难以在动态环境中快速准确地响应。为了应对这些挑战,我们提出了基于图协作多代理强化学习的LLM框架(LGC-MARL),该框架能够高效地结合LLM和MARL。此框架将复杂任务分解为可执行子任务,并通过基于图的协调实现多个代理之间的有效合作。 具体而言,LGC-MARL包含两个主要组成部分:一个LLM规划器和一个基于图的合作元策略。LLM规划器将复杂的任务指令转换成一系列可以执行的子任务,使用批评模型评估这些子任务的合理性,并生成动作依赖图。基于图的合作元策略根据动作依赖图促进代理之间的通信与合作,并通过元学习适应新的任务环境。 在AI2-THOR模拟平台上的实验结果表明,LGC-MARL在完成各种复杂任务方面具有优越的表现和可扩展性。
https://arxiv.org/abs/2503.10049
In recent years, quadruped robotics has advanced significantly, particularly in perception and motion control via reinforcement learning, enabling complex motions in challenging environments. Visual sensors like depth cameras enhance stability and robustness but face limitations, such as low operating frequencies relative to joint control and sensitivity to lighting, which hinder outdoor deployment. Additionally, deep neural networks in sensor and control systems increase computational demands. To address these issues, we introduce spiking neural networks (SNNs) and event cameras to perform a challenging quadruped parkour task. Event cameras capture dynamic visual data, while SNNs efficiently process spike sequences, mimicking biological perception. Experimental results demonstrate that this approach significantly outperforms traditional models, achieving excellent parkour performance with just 11.7% of the energy consumption of an artificial neural network (ANN)-based model, yielding an 88.3% energy reduction. By integrating event cameras with SNNs, our work advances robotic reinforcement learning and opens new possibilities for applications in demanding environments.
近年来,四足机器人的技术取得了显著进步,尤其是在通过强化学习实现感知和运动控制方面。这使得机器人能够在复杂环境中执行复杂的动作成为可能。视觉传感器(如深度相机)增强了机器人的稳定性和鲁棒性,但它们也存在一些限制,例如相对于关节控制的低操作频率以及对光照条件的敏感度,这些因素阻碍了其在户外环境中的部署。此外,在传感器和控制系统中使用的深层神经网络增加了计算需求。 为了解决这些问题,我们引入了脉冲神经网络(SNN)和事件相机来完成一项具有挑战性的四足机器人攀爬任务。事件相机能够捕捉动态的视觉数据,而脉冲神经网络则能高效地处理脉冲序列,模仿生物感知方式。实验结果表明,该方法在传统模型上实现了显著超越,在能耗仅为基于人工神经网络(ANN)模型11.7%的情况下,取得了卓越的攀爬性能,并减少了88.3%的能量消耗。 通过将事件相机与脉冲神经网络相结合,我们的工作推动了机器人强化学习的进步,并为苛刻环境中的应用开辟了新的可能性。
https://arxiv.org/abs/2503.09985
Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks require the empowerment of RL-based LLMs while these models also benefit from wireless networks to broaden their application scenarios. Specifically, RL-based LLMs can enhance wireless communication systems through intelligent resource allocation, adaptive network optimization, and real-time decision-making. Conversely, wireless networks provide a vital infrastructure for the efficient training, deployment, and distributed inference of RL-based LLMs, especially in decentralized and edge computing environments. This mutual empowerment highlights the need for a deeper exploration of the interplay between these two domains. We first review recent advancements in wireless communications, highlighting the associated challenges and potential solutions. We then discuss the progress of RL-based LLMs, focusing on key technologies for LLM training, challenges, and potential solutions. Subsequently, we explore the mutual empowerment between these two fields, highlighting key motivations, open challenges, and potential solutions. Finally, we provide insights into future directions, applications, and their societal impact to further explore this intersection, paving the way for next-generation intelligent communication systems. Overall, this survey provides a comprehensive overview of the relationship between RL-based LLMs and wireless networks, offering a vision where these domains empower each other to drive innovations.
基于强化学习(RL)的大规模语言模型(LLMs),如ChatGPT、DeepSeek和Grok-3,在自然语言处理和多模态数据理解方面表现出卓越的能力,因此受到了广泛关注。与此同时,信息服务的迅速扩张推动了对智能、高效且适应性强的无线网络的需求增长。无线网络需要基于RL的LLM的支持,而这些模型也从无线网络中受益,拓宽其应用场景。具体而言,基于RL的LLMs可以通过智能资源分配、自适应网络优化和实时决策来增强无线通信系统。相反地,无线网络为基于RL的LLMs的有效训练、部署以及分布式推理提供了重要的基础设施,尤其是在分散式和边缘计算环境中。这种相互赋能凸显了深入探索这两个领域之间互动关系的需求。 本文首先回顾了无线通信领域的最新进展,并强调相关挑战及潜在解决方案;然后讨论了基于RL的LLM的进步情况,重点关注LLM培训的关键技术、面临的挑战以及可能的解决方案;随后探讨了两个领域之间的相互赋能,重点突出关键动机、开放性问题和潜在解决方案。最后,本文提供了对未来方向、应用及其社会影响的洞察,以进一步探索这一交汇点,并为下一代智能通信系统铺平道路。 总体而言,本综述全面概述了基于RL的LLMs与无线网络之间的关系,提供了一种愿景,在该愿景中这两个领域相互赋能,推动创新。
https://arxiv.org/abs/2503.09956
Ensuring Large Language Models (LLMs) align with diverse human preferences while preserving privacy and fairness remains a challenge. Existing methods, such as Reinforcement Learning from Human Feedback (RLHF), rely on centralized data collection, making them computationally expensive and privacy-invasive. We introduce PluralLLM a federated learning-based approach that enables multiple user groups to collaboratively train a transformer-based preference predictor without sharing sensitive data, which can also serve as a reward model for aligning LLMs. Our method leverages Federated Averaging (FedAvg) to aggregate preference updates efficiently, achieving 46% faster convergence, a 4% improvement in alignment scores, and nearly the same group fairness measure as in centralized training. Evaluated on a Q/A preference alignment task, PluralLLM demonstrates that federated preference learning offers a scalable and privacy-preserving alternative for aligning LLMs with diverse human values.
确保大型语言模型(LLMs)与多样化的用户偏好保持一致,同时保护隐私和公平性仍然是一项挑战。现有的方法,如基于人类反馈的强化学习(RLHF),依赖于集中式数据收集,这使得它们在计算成本上较高且侵犯了隐私。我们引入了一种名为PluralLLM的方法,这是一种基于联邦学习的方案,它允许多个用户群体协作训练一个不共享敏感数据的基于转换器的偏好预测模型,该模型也可以作为对齐LLMs的奖励模型使用。我们的方法利用联邦平均(FedAvg)高效地聚合偏好更新,在收敛速度上提高了46%,在对齐分数上提升了4%,并且与集中式培训中的群体公平性指标几乎相同。通过一个问答偏好数值对齐任务的评估,PluralLLM展示了联邦学习的偏好学习为将LLMs与多样化的用户价值观保持一致提供了一个可扩展且保护隐私的替代方案。
https://arxiv.org/abs/2503.09925
Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or extensive data augmentation. Equivariant neural networks overcome these limitations by explicitly integrating symmetry and invariance into their architectures, leading to improved efficiency and generalization. This tutorial survey reviews a wide range of equivariant deep learning and control methods for robotics, from classic to state-of-the-art, with a focus on SE(3)-equivariant models that leverage the natural 3D rotational and translational symmetries in visual robotic manipulation and control design. Using unified mathematical notation, we begin by reviewing key concepts from group theory, along with matrix Lie groups and Lie algebras. We then introduce foundational group-equivariant neural network design and show how the group-equivariance can be obtained through their structure. Next, we discuss the applications of SE(3)-equivariant neural networks in robotics in terms of imitation learning and reinforcement learning. The SE(3)-equivariant control design is also reviewed from the perspective of geometric control. Finally, we highlight the challenges and future directions of equivariant methods in developing more robust, sample-efficient, and multi-modal real-world robotic systems.
近年来,深度学习和Transformer技术在机器人领域的突破主要得益于模仿学习、强化学习以及基于大语言模型的多模态感知与决策方法的应用。然而,传统的深度学习和Transformer模型处理具有固有对称性和不变性的数据时往往存在困难,通常依赖大规模的数据集或广泛的数据增强来解决这些问题。等变神经网络通过在其架构中明确地集成对称性和不变性克服了这些局限,从而提高了效率和泛化能力。 本教程综述从经典方法到最新进展,广泛回顾了用于机器人技术的等变深度学习与控制方法,重点在于利用自然三维旋转和平移对称性的SE(3)等变模型,在视觉机器人操作和控制系统设计中发挥重要作用。使用统一的数学符号,我们首先回顾群论的基本概念、矩阵李群以及李代数的知识。接着介绍基础的群等变神经网络的设计,并展示了如何通过其结构获得群等变性。然后讨论SE(3)等变神经网络在机器人技术中的应用,特别是在模仿学习和强化学习方面。从几何控制的角度还回顾了SE(3)等变控制系统设计的方法。最后,我们强调了等变方法在未来发展中面临的挑战和发展方向,旨在开发更稳健、样本效率更高且多模态的现实世界机器人系统。
https://arxiv.org/abs/2503.09829
Recent advances in robotics and large language models (LLMs) have sparked growing interest in human-robot collaboration and embodied intelligence. To enable the broader deployment of robots in human-populated environments, socially-aware robot navigation (SAN) has become a key research area. While deep reinforcement learning approaches that integrate human-robot interaction (HRI) with path planning have demonstrated strong benchmark performance, they often struggle to adapt to new scenarios and environments. LLMs offer a promising avenue for zero-shot navigation through commonsense inference. However, most existing LLM-based frameworks rely on centralized decision-making, lack robust verification mechanisms, and face inconsistencies in translating macro-actions into precise low-level control signals. To address these challenges, we propose SAMALM, a decentralized multi-agent LLM actor-critic framework for multi-robot social navigation. In this framework, a set of parallel LLM actors, each reflecting distinct robot personalities or configurations, directly generate control signals. These actions undergo a two-tier verification process via a global critic that evaluates group-level behaviors and individual critics that assess each robot's context. An entropy-based score fusion mechanism further enhances self-verification and re-query, improving both robustness and coordination. Experimental results confirm that SAMALM effectively balances local autonomy with global oversight, yielding socially compliant behaviors and strong adaptability across diverse multi-robot scenarios. More details and videos about this work are available at: this https URL.
近期,在机器人技术和大型语言模型(LLMs)方面取得的进展,激发了人们对人机协作和具身智能的兴趣。为了在有人类居住的环境中更广泛地部署机器人,社会感知型机器人导航(SAN)已经成为一个关键的研究领域。虽然将人类-机器人交互(HRI)与路径规划相结合的深度强化学习方法已经展示了强大的基准性能,但它们往往难以适应新的场景和环境变化。LLMs通过常识推理提供了零样本导航的一种有前景的方法,然而大多数现有的基于LLM的框架依赖于集中式决策制定、缺乏可靠的验证机制,并且在将宏观动作转化为精确的低级控制信号时存在一致性问题。 为了解决这些挑战,我们提出了SAMALM——一种用于多机器人社会导航的分散式多智能体LLM行为-批评者框架。在这个框架中,一组并行的LLM执行者直接生成控制信号,每个执行者反映了不同的机器人个性或配置。通过一个全局批评者和针对每个机器人的上下文评估的个体批评者的两层验证过程对这些动作进行审查。基于熵的方法进一步融合了自我验证和重新查询机制,提高了系统的鲁棒性和协作能力。实验结果证实,SAMALM能够有效地平衡局部自主性与全局监督,在多机器人场景中产生了符合社会规范的行为,并且具有强大的适应性。 有关这项工作的更多详情和视频,请参阅:[此处插入实际URL]。
https://arxiv.org/abs/2503.09758
In Amazon robotic warehouses, the destination-to-chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi-Agent Reinforcement Learning (DRMARL) framework that learns a destination-to-chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit-based predictor of the worst-case induction distribution for each state-action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80\% in the simulation scenario.
在亚马逊的机器人仓库中,目的地到滑槽(chute)的映射问题是高效包裹分类的关键。然而,由于不确定和动态变化的包裹输入速率,这个问题变得复杂化,可能导致更多包裹循环回流。为了解决这一挑战,我们引入了一个基于分布鲁棒多智能体强化学习(DRMARL)框架,该框架能够学习出一种面对敌对性的输入速率变化也能保持稳健的目的地到滑槽映射策略。 具体而言,DRMARL依赖于群体分布鲁棒优化(DRO)来学习一个不仅在平均意义上表现良好,而且在群体中的每个子人群中也表现出色的策略。这些子群可以代表系统不同季节性或操作模式下的输入速率变化。随后,我们结合了一种基于情境型多臂赌博机的新颖最差情况输入分布预测器,针对每一个状态-动作对显著降低了探索成本,从而提高了框架的学习效率和可扩展性。 广泛模拟结果表明,在不同的输入分布情况下,DRMARL能够实现稳健的滑槽映射,减少了包裹循环回流,平均达到了80%。
https://arxiv.org/abs/2503.09755