Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
长期以来,人们一直认为语言是人类推理的必要工具。大型语言模型(LLMs)的重大突破激发了利用这些模型来解决复杂推理任务的研究兴趣。研究人员已经超越了简单的自回归令牌生成,引入了“思维”这一概念——一系列代表推理过程中间步骤的令牌序列。这种创新的方法使LLM能够模仿复杂的类人类推理过程,如树搜索和反思思考。最近,一种新兴的学习推理趋势是应用强化学习(RL)来训练LLM掌握推理过程。这种方法通过试错算法实现了高质量推理轨迹的自动生成,并且通过提供大量训练数据显著扩展了LLM的推理能力。此外,近期研究表明,在测试时鼓励LLMs使用更多令牌进行“思考”可以进一步大幅提高推理准确性。因此,结合训练时间和测试时间上的扩展展示了一个新的研究前沿——通向大型推理模型的道路。OpenAI推出的o1系列标志着这一研究方向的一个重要里程碑。在这份综述中,我们将介绍近期在LLM推理方面的重大进展。我们首先引入LLMs的基础背景知识,然后探讨驱动大规模推理模型发展的关键技术组件,重点在于自动化数据构建、学习推理技术以及测试时间扩展。此外,我们还会分析一些热门的开源项目在构建大型推理模型中的应用,并最终提出开放性挑战和未来研究方向。
https://arxiv.org/abs/2501.09686
Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.
最近在大型语言模型(LLM)方面取得的进展已经证明了其在执行复杂任务方面的显著进步。虽然基于人类反馈的强化学习(RLHF)在将LLM与人类偏好对齐方面非常有效,但它容易受到奖励建模中的虚假相关性的困扰。这通常会导致诸如长度偏见、阿谀奉承倾向、概念偏差和歧视等偏见问题,这些都阻碍了模型捕捉真正因果关系的能力。为了解决这些问题,我们提出了一种新颖的因果奖励建模方法,该方法整合了因果推理来减轻这些虚假相关性。我们的方法强制执行反事实不变性,确保在无关变量改变时,奖励预测仍然保持一致。通过在合成和真实世界数据集上的实验,我们展示了我们的方法能够有效地缓解各种类型的虚假相关性,从而使得LLM与人类偏好的对齐更加可靠和公平。作为现有RLHF工作流程的即插即用增强功能,我们的因果奖励建模提供了一种实用的方法来提高LLM微调过程中的可信度和公平性。
https://arxiv.org/abs/2501.09620
Object detection plays a crucial role in smart video analysis, with applications ranging from autonomous driving and security to smart cities. However, achieving real-time object detection on edge devices presents significant challenges due to their limited computational resources and the high demands of deep neural network (DNN)-based detection models, particularly when processing high-resolution video. Conventional strategies, such as input down-sampling and network up-scaling, often compromise detection accuracy for faster performance or lead to higher inference latency. To address these issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven Partitioning and Edge Offloading framework designed to optimize the accuracy-latency trade-off in resource-constrained edge environments. Our approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that partitions video frames into non-uniform blocks based on object distribution and the computational characteristics of DNNs. Furthermore, a parallel edge offloading scheme is implemented to distribute these blocks across multiple edge servers for concurrent processing. Experimental evaluations show that RE-POSE significantly enhances detection accuracy and reduces inference latency, surpassing existing methods.
物体检测在智能视频分析中扮演着至关重要的角色,应用范围从自动驾驶和安全到智慧城市。然而,在计算资源有限的边缘设备上实现实时物体检测面临着巨大挑战,尤其是当处理高分辨率视频时,深度神经网络(DNN)模型的需求变得更高。传统的策略,如输入降采样和网络扩展,通常会牺牲检测精度以换取更快的速度或导致推理延迟增加。为了解决这些问题,本文介绍了RE-POSE框架,这是一种基于强化学习(RL)的分区与边缘卸载方法,旨在优化资源受限的边缘环境中准确性与延迟之间的权衡。我们的方法包括一个基于RL的动力聚类算法(RL-DCA),该算法根据物体分布和DNN计算特性将视频帧分割成非均匀块。此外,还实施了一种并行边缘卸载方案,用于将这些块分布在多个边缘服务器上进行并发处理。实验评估表明,RE-POSE显著提高了检测精度,并减少了推理延迟,超越了现有方法。
https://arxiv.org/abs/2501.09465
Agent-based models (ABMs) are valuable for modelling complex, potentially out-of-equilibria scenarios. However, ABMs have long suffered from the Lucas critique, stating that agent behaviour should adapt to environmental changes. Furthermore, the environment itself often adapts to these behavioural changes, creating a complex bi-level adaptation problem. Recent progress integrating multi-agent reinforcement learning into ABMs introduces adaptive agent behaviour, beginning to address the first part of this critique, however, the approaches are still relatively ad hoc, lacking a general formulation, and furthermore, do not tackle the second aspect of simultaneously adapting environmental level characteristics in addition to the agent behaviours. In this work, we develop a generic two-layer framework for ADaptive AGEnt based modelling (ADAGE) for addressing these problems. This framework formalises the bi-level problem as a Stackelberg game with conditional behavioural policies, providing a consolidated framework for adaptive agent-based modelling based on solving a coupled set of non-linear equations. We demonstrate how this generic approach encapsulates several common (previously viewed as distinct) ABM tasks, such as policy design, calibration, scenario generation, and robust behavioural learning under one unified framework. We provide example simulations on multiple complex economic and financial environments, showing the strength of the novel framework under these canonical settings, addressing long-standing critiques of traditional ABMs.
基于代理的模型(ABM)在模拟复杂、可能偏离平衡态的情景时非常有用。然而,长期以来,这些模型一直受到卢卡斯批判的影响,该批判指出代理行为应该适应环境变化。此外,环境本身也会因这种行为的变化而发生变化,从而形成了一个复杂的双层适应问题。最近将多智能体强化学习融入ABM的进展引入了自适应智能体行为的概念,开始解决这一批评的第一部分,但这些方法仍然相对零散,缺乏通用形式化,并且尚未处理同时调整环境层面特征和代理行为的问题。 在本文中,我们开发了一个通用的两层框架——ADaptive AGEnt建模(ADAGE),用于解决这些问题。该框架将双层问题正式定义为具有条件行为策略的斯塔克尔伯格博弈,提供了一种基于解耦非线性方程组来实现自适应代理模型的综合框架。我们展示了这种通用方法如何涵盖了几个常见的(以前被视为独立)ABM任务,如政策设计、校准、情景生成以及在统一框架下的鲁棒行为学习。 我们提供了多个复杂经济和金融环境中的示例模拟,展示了该新框架在这类经典设置下的优势,并解决了对传统ABMs的长期批评。
https://arxiv.org/abs/2501.09429
Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.
构建具备优化效率和适应能力的自主移动机器人(AMR),使其能够应对任务需求的变化和动态环境,是推进建筑机器人技术发展的一个重要目标。这类机器人可以在自动化、减少运营碳足迹和支持模块化施工流程方面发挥关键作用。受生物体自适应自主性的启发,我们引入了内感受(interoception)的概念,它侧重于机器人的内部状态表示,并以此为基础开发自我反思和有意识的学习能力,以实现持续学习和适应性。 本文中,我们将内部状态变量和数学属性视为在共享控制范式中的“认知失调”,其中偶尔有人类干预。我们提出了一种新视角,说明内感受如何通过整合基于网格/图算法的传统启发式成本与神经科学及强化学习的最新进展来帮助构建具有适应性运动规划能力的AMR。 从人类语义输入中提取的声明性和程序性知识被编码到一个超图模型中,该模型与其现场布局的空间配置重叠,用于路径规划。此外,我们设计了一个速度回放模块,采用带有少量样本学习能力的编码器-解码器架构,使机器人能够在情境化的场景中复制速度曲线,以实现多机器人同步和交接协作。 这些“缓存”的知识表示在模拟环境中展示了多机器人运动规划和堆叠任务的效果。本研究的见解为AMR的人工通用智能铺平了道路,并推动它们从复杂性向建筑自动化能力的发展。
https://arxiv.org/abs/2501.09290
A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.
在训练大型语言模型(LLM)时,一个关键挑战是正确地将其与人类偏好对齐。带有人类反馈的强化学习(RLHF)使用来自人类标注者的成对比较来训练奖励函数,并已成为一种流行的对齐方法。然而,RLHF中的输入数据集包含的问题和答案类型并不一定是平衡的。因此,我们希望RLHF算法即使在备选方案不均匀分布的情况下也能表现良好。借鉴社会选择理论的见解,我们引入了对近似克隆的鲁棒性,这是一种RLHF算法的理想属性,要求添加近乎重复的备选项不会显著改变学习到的奖励函数。首先,我们证明基于正则化最大似然估计(MLE)的标准RLHF算法不满足这一特性。然后,我们提出了加权MLE,一种新的RLHF算法,通过对备选方案与其其他备选方案之间的相似性进行加权来修改标准的正则化MLE。这种新算法在保证对近似克隆鲁棒性的前提下,保留了理想化的理论性质。
https://arxiv.org/abs/2501.09254
In reinforcement learning, the value function is typically trained to solve the Bellman equation, which connects the current value to future values. This temporal dependency hints that the value function may contain implicit information about the environment's transition dynamics. By rearranging the Bellman equation, we show that a converged value function encodes a model of the underlying dynamics of the environment. We build on this insight to propose a simple method for inferring dynamics models directly from the value function, potentially mitigating the need for explicit model learning. Furthermore, we explore the challenges of next-state identifiability, discussing conditions under which the inferred dynamics model is well-defined. Our work provides a theoretical foundation for leveraging value functions in dynamics modeling and opens a new avenue for bridging model-free and model-based reinforcement learning.
在强化学习中,价值函数通常被训练以求解贝尔曼方程,该方程将当前值与未来的值联系起来。这种时间依赖性暗示了价值函数可能隐含着关于环境转换动态的信息。通过重新排列贝尔曼方程,我们展示了收敛的价值函数编码了环境中潜在动力学模型。基于这一见解,我们提出了一种从价值函数直接推断动态模型的简单方法,这有可能消除对显式模型学习的需求。此外,我们探讨了下一状态可识别性的挑战,并讨论了所推断的动力学模型定义良好的条件。我们的工作为利用价值函数进行动力学建模提供了理论基础,并开辟了一条将无模型强化学习和基于模型的强化学习相结合的新途径。
https://arxiv.org/abs/2501.09081
The average-reward formulation of reinforcement learning (RL) has drawn increased interest in recent years due to its ability to solve temporally-extended problems without discounting. Independently, RL algorithms have benefited from entropy-regularization: an approach used to make the optimal policy stochastic, thereby more robust to noise. Despite the distinct benefits of the two approaches, the combination of entropy regularization with an average-reward objective is not well-studied in the literature and there has been limited development of algorithms for this setting. To address this gap in the field, we develop algorithms for solving entropy-regularized average-reward RL problems with function approximation. We experimentally validate our method, comparing it with existing algorithms on standard benchmarks for RL.
近年来,强化学习(RL)的平均奖励形式由于其解决长时间问题而不使用折扣的能力而引起了越来越多的关注。独立于这一发展,熵正则化方法也被证明能为RL算法带来益处:通过使最优策略具有随机性,从而使其更能应对噪声干扰,提高鲁棒性。尽管这两种方法各自具备独特的优势,但它们与平均奖励目标相结合的研究在文献中却相对较少,且针对这种结合的算法开发也十分有限。为了填补这一领域的空白,我们开发了用于解决带函数逼近的熵正则化平均回报RL问题的算法,并通过实验验证我们的方法,在标准的RL基准测试上将其与其他现有算法进行了比较。
https://arxiv.org/abs/2501.09080
Urban air mobility (UAM) is a transformative system that operates various small aerial vehicles in urban environments to reshape urban transportation. However, integrating UAM into existing urban environments presents a variety of complex challenges. Recent analyses of UAM's operational constraints highlight aircraft noise and system safety as key hurdles to UAM system implementation. Future UAM air traffic management schemes must ensure that the system is both quiet and safe. We propose a multi-agent reinforcement learning approach to manage UAM traffic, aiming at both vertical separation assurance and noise mitigation. Through extensive training, the reinforcement learning agent learns to balance the two primary objectives by employing altitude adjustments in a multi-layer UAM network. The results reveal the tradeoffs among noise impact, traffic congestion, and separation. Overall, our findings demonstrate the potential of reinforcement learning in mitigating UAM's noise impact while maintaining safe separation using altitude adjustments
城市空中交通(UAM)是一种在城市环境中操作各种小型航空器,以重塑城市交通的变革性系统。然而,在现有的城市环境中整合UAM面临着多种复杂的挑战。最近对UAM运营限制的分析强调了飞机噪音和系统安全是实施UAM系统的两个主要障碍。未来的UAM空中交通管理系统必须确保该系统既安静又安全。我们提出了一种多智能体强化学习方法来管理UAM交通,旨在同时保证垂直间隔和减少噪声。通过大量训练,强化学习代理学会了在多层UAM网络中使用高度调整来平衡这两个主要目标。结果揭示了噪音影响、交通拥堵与间隔之间的权衡关系。总体而言,我们的研究发现表明,强化学习具有通过调整飞行高度来减轻UAM噪音影响并确保安全间隔的潜力。
https://arxiv.org/abs/2501.08941
Offline Reinforcement Learning (RL) faces a critical challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning, effectively mitigating the risks associated with OOD actions. However, the fixed hyperparameter in policy evaluation and density-based policy improvement method limit its overall efficiency. In this paper, we propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In the policy evaluation phase, Proj-IQL generalizes the one-step approach to a multi-step approach through vector projection, while maintaining in-sample learning and expectile regression framework. In the policy improvement phase, Proj-IQL introduces support constraint that is more aligned with the policy evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL guarantees monotonic policy improvement and enjoys a progressively more rigorous criterion for superior actions. Empirical results demonstrate the Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially in challenging navigation domains.
离线强化学习(RL)面临的主要挑战之一是由分布外(OOD)动作引起的插值误差。隐式Q学习(IQL)算法通过采用期望回归来实现样本内学习,从而有效缓解了与OOD动作相关的风险。然而,策略评估中的固定超参数和基于密度的策略改进方法限制了其整体效率。 在本文中,我们提出了Proj-IQL算法,这是一种通过支持约束增强的投影隐式Q学习(IQL)算法。在策略评估阶段,Proj-IQL通过向量投影将单步方法推广到多步方法,同时保持样本内学习和期望回归框架的有效性。在策略改进阶段,Proj-IQL引入了与策略评估方法更一致的支持约束。 此外,我们从理论上证明了Proj-IQL能够保证单调的策略改进,并且为选择更好的动作提供了一个渐进更为严格的准则。实证结果表明,Proj-IQL在D4RL基准测试上达到了最先进的性能,在导航等具有挑战性的领域尤为突出。
https://arxiv.org/abs/2501.08907
Accurate detection and resilience of object detectors in structural damage detection are important in ensuring the continuous use of civil infrastructure. However, achieving robustness in object detectors remains a persistent challenge, impacting their ability to generalize effectively. This study proposes DetectorX, a robust framework for structural damage detection coupled with a micro drone. DetectorX addresses the challenges of object detector robustness by incorporating two innovative modules: a stem block and a spiral pooling technique. The stem block introduces a dynamic visual modality by leveraging the outputs of two Deep Convolutional Neural Network (DCNN) models. The framework employs the proposed event-based reward reinforcement learning to constrain the actions of a parent and child DCNN model leading to a reward. This results in the induction of two dynamic visual modalities alongside the Red, Green, and Blue (RGB) data. This enhancement significantly augments DetectorX's perception and adaptability in diverse environmental situations. Further, a spiral pooling technique, an online image augmentation method, strengthens the framework by increasing feature representations by concatenating spiraled and average/max pooled features. In three extensive experiments: (1) comparative and (2) robustness, which use the Pacific Earthquake Engineering Research Hub ImageNet dataset, and (3) field-experiment, DetectorX performed satisfactorily across varying metrics, including precision (0.88), recall (0.84), average precision (0.91), mean average precision (0.76), and mean average recall (0.73), compared to the competing detectors including You Only Look Once X-medium (YOLOX-m) and others. The study's findings indicate that DetectorX can provide satisfactory results and demonstrate resilience in challenging environments.
准确检测和增强结构损伤检测中对象探测器的鲁棒性对于确保民用基础设施的持续使用至关重要。然而,实现对象探测器的稳健性能仍然是一个持久的挑战,影响了它们的有效泛化能力。本研究提出了一种名为DetectorX的框架,它结合了一个微型无人机,并用于结构损坏检测,具有较强的鲁棒性。DetectorX通过引入两个创新模块——茎干块和螺旋池化技术,解决了对象探测器稳健性的挑战。 干细胞块通过利用两个深度卷积神经网络(DCNN)模型的输出来引入动态视觉模式。框架采用了一种基于事件的奖励增强学习方法,约束父级与子级DCNN模型的操作以获得奖励。这导致了除传统的红绿蓝(RGB)数据外,还产生了两种新的动态视觉模式。这一改进显著增强了DetectorX在各种环境情况下的感知和适应能力。 此外,螺旋池化技术作为一种在线图像增强方法,通过连接螺旋式和平均/最大池化的特征来提高特征表示的强度,从而加强了框架的整体性能。 在三项广泛的实验中(1)比较实验、(2)鲁棒性实验以及(3)实地实验,DetectorX在各种度量标准下表现出色,包括精度(0.88)、召回率(0.84)、平均精度(0.91)、均值平均精度(0.76)和均值平均召回率(0.73),与YOLOX-medium和其他竞争探测器相比表现优异。研究结果表明,DetectorX可以在具有挑战性的环境中提供令人满意的性能,并展现出强大的适应性。
https://arxiv.org/abs/2501.08807
We propose a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA-MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA-MARL increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA-MARL over previous methods.
我们提出了一种针对联网代理的新型合作多智能体强化学习(MARL)方法。与以往依赖于完整状态信息或联合观测的方法不同,我们的代理必须学会在部分可观测性条件下如何达成共同目标。在训练过程中,它们收集个体奖励并通过本地通信近似团队价值函数,从而表现出合作行为。为了描述我们的问题,我们引入了网络动态部分可观察马尔可夫博弈框架,在该框架中,代理通过一个切换拓扑的通信网络进行交流。 我们的分布式方法DNA-MARL使用共识机制实现本地通信,并利用梯度下降法进行局部计算。DNA-MARL扩大了联网代理应用范围,特别适合于那些需要保护隐私且消息可能无法到达接收者的真实世界领域。我们对多个基准MARL场景评估了DNA-MARL的效果。我们的结果突显了DNA-MARL相对于先前方法的优越性能。
https://arxiv.org/abs/2501.08778
A key challenge in Deep Reinforcement Learning is sample efficiency, especially in real-world applications where collecting environment interactions is expensive or risky. Recent off-policy algorithms improve sample efficiency by increasing the Update-To-Data (UTD) ratio and performing more gradient updates per environment interaction. While this improves sample efficiency, it significantly increases computational cost due to the higher number of gradient updates required. In this paper we propose a sample-efficient method to improve computational efficiency by separating training into distinct learning phases in order to exploit gradient updates more effectively. Our approach builds on top of the Dropout Q-Functions (DroQ) algorithm and alternates between an online, low UTD ratio training phase, and an offline stabilization phase. During the stabilization phase, we fine-tune the Q-functions without collecting new environment interactions. This process improves the effectiveness of the replay buffer and reduces computational overhead. Our experimental results on continuous control problems show that our method achieves results comparable to state-of-the-art, high UTD ratio algorithms while requiring 56\% fewer gradient updates and 50\% less training time than DroQ. Our approach offers an effective and computationally economical solution while maintaining the same sample efficiency as the more costly, high UTD ratio state-of-the-art.
深度强化学习(Deep Reinforcement Learning,DRL)中的一个关键挑战是样本效率问题,特别是在现实世界的应用中,环境交互的收集既昂贵又具有风险。最近的一些离策略算法通过提高更新到数据(Update-To-Data, UTD)比率和增加每一步环境交互的梯度更新次数来提升样本效率。尽管这提高了样本效率,但由于需要更多的梯度更新,计算成本也随之显著增加。在本文中,我们提出了一种样本高效的方法,旨在通过将训练分为不同的学习阶段来提高计算效率,从而更有效地利用梯度更新。 我们的方法基于掉队Q函数(Dropout Q-Functions, DroQ)算法,并且交替进行在线、低UTD比率的训练阶段和离线稳定化阶段。在稳定化阶段,我们会在不收集新的环境交互的情况下微调Q函数。这一过程改进了重放缓冲区的效果并减少了计算开销。 我们在连续控制问题上的实验结果表明,我们的方法实现了与最新的高UTD比率算法相媲美的性能,但所需的梯度更新次数比DroQ少了56%,训练时间也缩短了50%。我们的方法提供了一种既有效又计算经济的解决方案,并且在保持相同样本效率的同时避免了使用成本更高的高UTD比率最新技术带来的额外开销。
https://arxiv.org/abs/2501.08669
This paper summarizes in depth the state of the art of aerial swarms, covering both classical and new reinforcement-learning-based approaches for their management. Then, it proposes a hybrid AI system, integrating deep reinforcement learning in a multi-agent centralized swarm architecture. The proposed system is tailored to perform surveillance of a specific area, searching and tracking ground targets, for security and law enforcement applications. The swarm is governed by a central swarm controller responsible for distributing different search and tracking tasks among the cooperating UAVs. Each UAV agent is then controlled by a collection of cooperative sub-agents, whose behaviors have been trained using different deep reinforcement learning models, tailored for the different task types proposed by the swarm controller. More specifically, proximal policy optimization (PPO) algorithms were used to train the agents' behavior. In addition, several metrics to assess the performance of the swarm in this application were defined. The results obtained through simulation show that our system searches the operation area effectively, acquires the targets in a reasonable time, and is capable of tracking them continuously and consistently.
这篇论文深入总结了空中群集技术的最新进展,涵盖了传统方法和基于强化学习的新方法。然后提出了一种混合人工智能系统,该系统在多智能体集中式群架构中结合了深度强化学习技术。所提出的系统专为特定区域的安全监控、搜索和追踪地面目标而设计,适用于安全和执法应用。 该空中群集由一个中央群控制器管理,负责将不同的搜索和跟踪任务分配给合作的无人机(UAV)。每架无人机则由一组协作子代理控制,这些行为模型通过不同类型的深度强化学习进行训练。具体而言,使用了近端策略优化(PPO)算法来训练各个智能体的行为。 此外,还定义了几种评估系统性能的指标。仿真结果表明,我们的系统能够有效搜索操作区域,在合理的时间内获取目标,并且具备持续、稳定地追踪它们的能力。
https://arxiv.org/abs/2501.08655
Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.
生成式AI系统,如基础模型(FMs),必须与人类价值观保持一致,以确保其行为既有益又值得信赖。虽然基于人类反馈的强化学习(RLHF)展示了通过使用人类判断来优化模型性能的潜力,但现有的RLHF流程主要依赖于即时反馈,这种反馈往往无法准确反映用户效用的变化。我们发现,基于评估者对后续影响预测的反馈会系统性地诱导出戈德哈特定律(Goodhart's Law)动态效应,这会导致诸如奉承和欺骗等不一致的行为,并最终损害用户体验。 为了解决这个问题,我们提出通过重新聚焦于事后反馈来解耦评价与预测。我们的理论分析表明,在条件化评估者反馈时考虑后续观察结果可以减轻偏移并提高预期的人类效用,即使这些观察结果是由AI系统自身模拟出来的也是如此。为了将这一见解应用于实际的对齐算法中,我们引入了基于事后模拟反馈的强化学习(RLHS)。该方法首先模拟可能的结果,然后收集反馈以评估哪些行为实际上在事后被证明是有益的。 我们将RLHS应用到两种广泛使用的在线和离线偏好优化方法——近端策略优化(PPO)和直接偏好优化(DPO)上,并通过实验显示这两种方法中的不一致现象显著减少。通过一项线上的人类用户研究,我们发现即使是在仅使用模拟事后反馈进行训练的情况下,RLHS也始终在帮助用户达成目标方面优于RLHF,并获得了更高的满意度评分。 这些结果强调了关注长期后果(即使是模拟的)的重要性,以减轻RLHF中的偏移问题。
https://arxiv.org/abs/2501.08617
As REST APIs have become widespread in modern web services, comprehensive testing of these APIs has become increasingly crucial. Due to the vast search space consisting of operations, parameters, and parameter values along with their complex dependencies and constraints, current testing tools suffer from low code coverage, leading to suboptimal fault detection. To address this limitation, we present a novel tool, AutoRestTest, which integrates the Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation-dependent parameters using the SODG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. AutoRestTest provides a command-line interface and continuous telemetry on successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised. In this paper, we introduce our tool and present preliminary results.
随着REST API在现代网络服务中的普及,这些API的全面测试变得越来越重要。由于操作、参数及其值构成的巨大搜索空间以及复杂的依赖性和约束条件,现有的测试工具面临着代码覆盖率低的问题,从而导致故障检测效果不佳。为了解决这一限制,我们提出了一种新的工具AutoRestTest,该工具将语义操作依赖图(SODG)与多智能体强化学习(MARL)和大型语言模型(LLMs)相结合,以实现有效的REST API测试。 AutoRestTest使用SODG确定操作相关的参数,并部署了五个专门的代理(操作、参数、值、依赖性和头信息),用以识别操作之间的依赖关系并生成操作序列、参数组合以及相应的值。该工具提供了一个命令行界面和持续监控功能,能够跟踪成功的操作次数、检测到的独特服务器错误数量以及所花费的时间。 测试完成后,AutoRestTest会生成一份详细的报告,突出显示已发现的错误及执行的操作情况。在本文中,我们介绍了我们的工具并展示了初步的研究结果。
https://arxiv.org/abs/2501.08600
In this paper, we propose an Adaptive Neuro-Symbolic Learning Framework for digital twin technology called ``ANSR-DT." Our approach combines pattern recognition algorithms with reinforcement learning and symbolic reasoning to enable real-time learning and adaptive intelligence. This integration enhances the understanding of the environment and promotes continuous learning, leading to better and more effective decision-making in real-time for applications that require human-machine collaboration. We evaluated the \textit{ANSR-DT} framework for its ability to learn and adapt to dynamic patterns, observing significant improvements in decision accuracy, reliability, and interpretability when compared to existing state-of-the-art methods. However, challenges still exist in extracting and integrating symbolic rules in complex environments, which limits the full potential of our framework in heterogeneous settings. Moreover, our ongoing research aims to address this issue in the future by ensuring seamless integration of neural models at large. In addition, our open-source implementation promotes reproducibility and encourages future research to build on our foundational work.
在这篇论文中,我们提出了一种名为“ANSR-DT”的自适应神经符号学习框架,用于数字孪生技术。我们的方法结合了模式识别算法、强化学习和符号推理,以实现实时学习和自适应智能。这种集成增强了对环境的理解,并促进了持续学习,在需要人机协作的应用中实现了更优且更有效的实时决策。 我们评估了ANSR-DT框架在学习和适应动态模式方面的能力,发现其决策准确性、可靠性和可解释性相较于现有的最先进方法有了显著提高。然而,仍然存在一些挑战,即如何从复杂环境中提取并整合符号规则,这限制了我们的框架在异构环境中的全面潜力。 此外,我们正在进行的研究旨在通过确保神经模型的大规模无缝集成来解决这一问题。另外,我们的开源实现促进了结果的可重复性,并鼓励未来的研究在此基础上进一步发展。
https://arxiv.org/abs/2501.08561
Procedural Content Generation (PCG) is widely used to create scalable and diverse environments in games. However, existing methods, such as the Wave Function Collapse (WFC) algorithm, are often limited to static scenarios and lack the adaptability required for dynamic, narrative-driven applications, particularly in augmented reality (AR) games. This paper presents a reinforcement learning-enhanced WFC framework designed for mobile AR environments. By integrating environment-specific rules and dynamic tile weight adjustments informed by reinforcement learning (RL), the proposed method generates maps that are both contextually coherent and responsive to gameplay needs. Comparative evaluations and user studies demonstrate that the framework achieves superior map quality and delivers immersive experiences, making it well-suited for narrative-driven AR games. Additionally, the method holds promise for broader applications in education, simulation training, and immersive extended reality (XR) experiences, where dynamic and adaptive environments are critical.
程序化内容生成(PCG)广泛应用于游戏中的可扩展和多样化的环境创建。然而,现有的方法,如波函数坍缩(WFC)算法,在静态场景中应用较多,缺乏动态、叙事驱动应用程序所需的适应性,尤其是在增强现实(AR)游戏中更为明显。本文提出了一种强化学习增强的WFC框架,旨在为移动AR环境生成地图。通过整合环境特定规则和基于强化学习(RL)进行的动态贴图权重调整,所提方法能够产生既符合上下文连贯性又能响应游戏需求的地图。比较评估和用户研究证明,该框架在地图质量和沉浸式体验方面表现出色,非常适合叙事驱动型AR游戏的需求。此外,此方法还为教育、模拟训练以及需要具备动态适应性的沉浸式扩展现实(XR)体验等更广泛的应用领域提供了潜力。
https://arxiv.org/abs/2501.08552
With the development of deep learning, Dynamic Portfolio Optimization (DPO) problem has received a lot of attention in recent years, not only in the field of finance but also in the field of deep learning. Some advanced research in recent years has proposed the application of Deep Reinforcement Learning (DRL) to the DPO problem, which demonstrated to be more advantageous than supervised learning in solving the DPO problem. However, there are still certain unsolved issues: 1) DRL algorithms usually have the problems of slow learning speed and high sample complexity, which is especially problematic when dealing with complex financial data. 2) researchers use DRL simply for the purpose of obtaining high returns, but pay little attention to the problem of risk control and trading strategy, which will affect the stability of model returns. In order to address these issues, in this study we revamped the intrinsic structure of the model based on the Deep Deterministic Policy Gradient (DDPG) and proposed the Augmented DDPG model. Besides, we also proposed an innovative risk control strategy based on Quantum Price Levels (QPLs) derived from Quantum Finance Theory (QFT). Our experimental results revealed that our model has better profitability as well as risk control ability with less sample complexity in the DPO problem compared to the baseline models.
随着深度学习的发展,动态投资组合优化(DPO)问题近年来在金融领域和深度学习领域都受到了广泛关注。近年来的一些先进研究提出将深度强化学习(DRL)应用于DPO问题,并表明其比监督学习更有利于解决DPO问题。然而,仍然存在一些未解决问题:1) DRL算法通常存在学习速度慢、样本复杂度高的问题,在处理复杂的金融数据时尤为突出;2) 研究人员使用DRL的主要目的是为了获取高收益,但对风险控制和交易策略的关注较少,这将影响模型回报的稳定性。为了解决这些问题,本研究基于深度确定性策略梯度(DDPG)算法重构了模型的基本结构,并提出了增强型DDPG(Augmented DDPG)模型。此外,我们还提出了一种创新的风险控制策略,该策略基于量子金融理论(QFT)中衍生的量子价格水平(QPL)。实验结果表明,在DPO问题上,与基准模型相比,我们的模型具有更好的盈利能力和风险控制能力,并且所需的样本复杂度更低。
https://arxiv.org/abs/2501.08528
Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.
从人类演示中进行模仿学习使机器人能够执行复杂的操作任务,并且最近取得了巨大成功。然而,这些技术通常难以适应新的偏好或环境变化。为了解决这些问题,我们提出了基于人类偏好的微调扩散策略(FDPP)。FDPP通过基于偏好的学习来学习奖励函数。然后使用此奖励对预训练的策略进行强化学习(RL)微调,从而使预训练策略与新的人类偏好相协调,同时仍然解决原始任务。我们在各种机器人任务和偏好上的实验表明,FDPP能够有效地定制策略行为而不影响性能。此外,我们还展示了在微调过程中加入克尔背莱布勒(Kullback-Leibler, KL)正则化可以防止过拟合,并有助于保持初始策略的能力。
https://arxiv.org/abs/2501.08259