Exploration efficiency poses a significant challenge in goal-conditioned reinforcement learning (GCRL) tasks, particularly those with long horizons and sparse rewards. A primary limitation to exploration efficiency is the agent's inability to leverage environmental structural patterns. In this study, we introduce a novel framework, GEASD, designed to capture these patterns through an adaptive skill distribution during the learning process. This distribution optimizes the local entropy of achieved goals within a contextual horizon, enhancing goal-spreading behaviors and facilitating deep exploration in states containing familiar structural patterns. Our experiments reveal marked improvements in exploration efficiency using the adaptive skill distribution compared to a uniform skill distribution. Additionally, the learned skill distribution demonstrates robust generalization capabilities, achieving substantial exploration progress in unseen tasks containing similar local structures.
在目标条件强化学习(GCRL)任务中,特别是具有长距离和稀疏奖励的任务,探索效率面临着一个显著的挑战。探索效率的主要限制是代理程序无法利用环境结构模式。在这项研究中,我们引入了一个新的框架,GEASD,通过在学习过程中自适应地分配技能来捕捉这些模式。这种分布通过优化实现目标的局部熵,增强目标传播行为,并促进在包含熟悉结构模式的状态中进行深度探索。我们的实验表明,与使用均匀技能分布相比,自适应技能分布的探索效率明显得到了改善。此外,学习到的技能分布表现出稳健的泛化能力,在包含类似局部结构模式的新任务中,实现了显著的探索进步。
https://arxiv.org/abs/2404.12999
Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose an LMM-based chatbot to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. To improve the LMM's performance, we experiment with two techniques, RLHF (Reinforcement Learning from Human Feedback) and Image Captioning. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors. We further explore the integration of Reinforcement Learning from Human Feedback (RLHF) methodology inspired by the ranking approach in RLHF to enhance the human-like problem-solving abilities of the models. The RLHF approach incorporates human feedback into the learning process of LLMs, improving the model's problem-solving skills, truthfulness, and reasoning capabilities, minimizing the hallucinations in the answers, and improving the quality instead of using vanilla-supervised fine-tuned models. We employ the LLaVA open-source model to answer multimodal physics MCQs and compare the performance with and without using RLHF.
近年来,LLM(自然语言处理)在诸如文本摘要和生成等任务方面的进展已经展示了其重要的应用潜力。然而,在解决需要进行数学计算并具备对概念的理解的复杂物理学问题时,它们往往遇到困难。此外,许多物理学问题包括包含了解题必要细节的图像。因此,我们提出了一个基于LLM的聊天机器人来回答多模态物理学多项选择题。为了进行领域迁移,我们利用包括印度高中水平多模态物理学问题的MM-PhyQA数据集。为了提高LLM的性能,我们尝试了两种方法:RLHF(基于人类反馈的强化学习)和图像捕获。在图像描述中,我们在每个图像中添加了详细的图表解释,减少了幻觉和图像处理误差。我们进一步探讨了将RLHF中的人反馈方法论与排名方法相结合,以增强模型的类人问题解决能力。RLHF方法将人类反馈纳入LLM的学习过程中,提高了模型的问题解决技能、真理度、推理能力,减少了答案中的幻觉,并提高了模型的质量,而不是使用预训练的微调模型。我们使用LLaVA开源模型来回答多模态物理学多项选择题,并将性能与有无使用RLHF进行比较。
https://arxiv.org/abs/2404.12926
Visual Reinforcement Learning is a popular and powerful framework that takes full advantage of the Deep Learning breakthrough. However, it is also known that variations in the input (e.g., different colors of the panorama due to the season of the year) or the task (e.g., changing the speed limit for a car to respect) could require complete retraining of the agents. In this work, we leverage recent developments in unifying latent representations to demonstrate that it is possible to combine the components of an agent, rather than retrain it from scratch. We build upon the recent relative representations framework and adapt it for Visual RL. This allows us to create completely new agents capable of handling environment-task combinations never seen during training. Our work paves the road toward a more accessible and flexible use of reinforcement learning.
视觉强化学习是一个受欢迎且强大的框架,它充分利用了深度学习的突破。然而,值得注意的是,输入(例如,由于季节不同而出现全景的不同颜色)或任务(例如,改变汽车的速度限制以尊重交通规则)的差异可能需要对代理进行完整的重新训练。在这项工作中,我们利用最近统一 latent 表示的研究成果,证明将代理的各个组件结合在一起是可能的,而不是从头开始重新训练它。我们在视觉强化学习的基础上进行了改编,使其能够创建具有处理训练环境中从未见过的环境-任务组合的新代理。我们的工作为使用强化学习实现更加易用和灵活的目标铺平了道路。
https://arxiv.org/abs/2404.12917
The sim-to-real gap poses a significant challenge in RL-based multi-agent exploration due to scene quantization and action discretization. Existing platforms suffer from the inefficiency in sampling and the lack of diversity in Multi-Agent Reinforcement Learning (MARL) algorithms across different scenarios, restraining their widespread applications. To fill these gaps, we propose MAexp, a generic platform for multi-agent exploration that integrates a broad range of state-of-the-art MARL algorithms and representative scenarios. Moreover, we employ point clouds to represent our exploration scenarios, leading to high-fidelity environment mapping and a sampling speed approximately 40 times faster than existing platforms. Furthermore, equipped with an attention-based Multi-Agent Target Generator and a Single-Agent Motion Planner, MAexp can work with arbitrary numbers of agents and accommodate various types of robots. Extensive experiments are conducted to establish the first benchmark featuring several high-performance MARL algorithms across typical scenarios for robots with continuous actions, which highlights the distinct strengths of each algorithm in different scenarios.
模拟-现实差距在基于强化学习的多智能体探索中提出了一个重大的挑战,由于场景量化和解码动作的离散化,现有的平台在采样效率和多智能体强化学习(MARL)算法在不同场景下的多样性方面存在低效,限制了它们在各个领域的广泛应用。为了填补这些空白,我们提出了MAexp,一个通用的多智能体探索平台,整合了最先进的MARL算法和代表性的场景。此外,我们还使用点云来表示我们的探索场景,导致高保真度环境映射和采样速度约比现有平台快40倍。此外,配备了基于注意力的多智能体目标生成器和单智能体运动规划器,MAexp可以与任意数量的智能体一起工作,并可以容纳各种类型的机器人。为了确定机器人连续行动场景中多个高性能MARL算法的第一个基准,我们进行了大量实验。这些实验突出了每个算法在不同场景中的独特优势。
https://arxiv.org/abs/2404.12824
Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at this https URL.
表示 rank是对神经网络(NNs)在深度强化学习(DRL)中的作用的一个重要概念,它衡量了价值网络的表征能力。现有研究集中于无限制地最大化这个排名;然而,那样的方法会在学习中引入过于复杂模型,从而削弱性能。因此,微调表示排名呈现了一个具有挑战性和关键性的优化问题。为解决这一问题,我们找到了一个指导原则用于自适应控制表示排名。我们利用贝叶斯方程作为理论基础,并得出连续状态-动作对价值网络表示的余弦相似性的上界。然后,我们利用这个上界提出了一种新 regularizer,即基于贝叶斯方程的自排名 regularizer (BEER)。这个 regularizer 会自适应地调整表示排名,从而提高 DRL 代理器的性能。我们首先通过示例实验验证了自动控制排名的有效性。然后,我们将 BEER 与确定性策略梯度方法结合,用于放大复杂连续控制任务。在 12 个具有挑战性的 DeepMind 控制任务中,BEER 超越了基线。此外,BEER 在 Q 值近似方面表现出显著优势。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.12754
In multi-agent reinforcement learning, decentralized execution is a common approach, yet it suffers from the redundant computation problem. This occurs when multiple agents redundantly perform the same or similar computation due to overlapping observations. To address this issue, this study introduces a novel method referred to as locally centralized team transformer (LCTT). LCTT establishes a locally centralized execution framework where selected agents serve as leaders, issuing instructions, while the rest agents, designated as workers, act as these instructions without activating their policy networks. For LCTT, we proposed the team-transformer (T-Trans) architecture that allows leaders to provide specific instructions to each worker, and the leadership shift mechanism that allows agents autonomously decide their roles as leaders or workers. Our experimental results demonstrate that the proposed method effectively reduces redundant computation, does not decrease reward levels, and leads to faster learning convergence.
在多智能体强化学习中,分散执行是一种常见的策略,然而,它却受到冗余计算问题的困扰。这是因为在多个智能体进行观测时,它们会不自觉地执行相同或相似的计算。为了解决这个问题,本研究提出了一种名为局部集中团队Transformer(LCTT)的新方法。LCTT建立了一个局部集中执行框架,其中选择的角色作为领导者,发出指令,而其余的角色,被指定为工人,则根据这些指令行动,但不会激活其策略网络。对于LCTT,我们提出了团队Transformer(T-Trans)架构,允许领导者向每个工人提供具体指令,以及领导角色转移机制,允许代理商自主决定其角色是领导者还是工人。我们的实验结果表明,与所提出的方法相比,该方法有效地减少了冗余计算,没有降低奖励水平,并促进了更快的学习收敛。
https://arxiv.org/abs/2404.13096
Virtual network embedding (VNE) is an essential resource allocation task in network virtualization, aiming to map virtual network requests (VNRs) onto physical infrastructure. Reinforcement learning (RL) has recently emerged as a promising solution to this problem. However, existing RL-based VNE methods are limited by the unidirectional action design and one-size-fits-all training strategy, resulting in restricted searchability and generalizability. In this paper, we propose a FLexible And Generalizable RL framework for VNE, named FlagVNE. Specifically, we design a bidirectional action-based Markov decision process model that enables the joint selection of virtual and physical nodes, thus improving the exploration flexibility of solution space. To tackle the expansive and dynamic action space, we design a hierarchical decoder to generate adaptive action probability distributions and ensure high training efficiency. Furthermore, to overcome the generalization issue for varying VNR sizes, we propose a meta-RL-based training method with a curriculum scheduling strategy, facilitating specialized policy training for each VNR size. Finally, extensive experimental results show the effectiveness of FlagVNE across multiple key metrics. Our code is available at GitHub (this https URL).
虚拟网络嵌入(VNE)是网络虚拟化中一个关键的资源分配任务,旨在将虚拟网络请求(VNRs)映射到物理基础设施。最近,强化学习(RL)已成为解决这个问题的一种有前景的解决方案。然而,现有的基于RL的VNE方法受到单向动作设计和高薪策略的限制,导致搜索性和泛化能力受限。在本文中,我们提出了一个灵活且可扩展的RL框架,名为FlagVNE。具体来说,我们设计了一个双向动作基于马尔可夫决策过程模型,使得虚拟和物理节点能够联合选择,从而提高解决方案空间的探索灵活性。为了应对广泛和动态的动作空间,我们设计了一个分层的解码器,用于生成自适应的动作概率分布,并确保高训练效率。此外,为了克服不同VNR大小下的泛化问题,我们提出了一种基于元RL的培训方法,通过课程计划策略实现针对每个VNR大小的专用策略训练。最后,大量实验结果表明,FlagVNE在多个关键指标上具有有效性。我们的代码可于GitHub上获取(此https链接)。
https://arxiv.org/abs/2404.12633
Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to acquire fit behaviour. This imposes an information bottleneck that precludes learning from diverse non-reward stimulus information, limiting learning efficiency. We consider the question of how biological evolution circumvents this bottleneck to produce DAL. We propose that species first evolve the ability to learn from reward signals, providing inefficient (bottlenecked) but broad adaptivity. From there, integration of non-reward information into the learning process can proceed via gradual accumulation of biases induced by such information on specific task domains. This scenario provides a biologically plausible pathway towards bottleneck-free, domain-adapted learning. Focusing on the second phase of this scenario, we set up a population of NNs with reward-driven learning modelled as Reinforcement Learning (A2C), and allow evolution to improve learning efficiency by integrating non-reward information into the learning process using a neuromodulatory update mechanism. On a navigation task in continuous 2D space, evolved DAL agents show a 300-fold increase in learning speed compared to pure RL agents. Evolution is found to eliminate reliance on reward information altogether, allowing DAL agents to learn from non-reward information exclusively, using local neuromodulation-based connection weight updates only.
高级生物智能从信息丰富刺激信息的流中高效学习,即使对于行为质量的反馈稀少或缺失。这种学习利用了关于任务领域的隐含假设。我们将这种学习称为领域适应学习(DAL)。相比之下,人工智能学习算法依赖于明确的外部提供的行为质量指标来获得适应行为。这导致信息瓶颈,限制了学习从多样非奖励刺激信息中获取知识的能力,降低了学习效率。我们考虑生物进化如何绕过这一瓶颈,产生领域适应学习。我们提出,物种首先进化出从奖励信号中学习的能力,提供低效(瓶颈ed)但广泛的适应性。从那时开始,将非奖励信息整合到学习过程中可以通过渐进积累由这种信息引起的偏差来逐步进行。这种情况为无瓶颈、领域适应学习提供了生物学的合理途径。专注于这个情景的第二阶段,我们建立了一个由奖励驱动学习建模为强化学习(A2C)的NN种群,并使用神经调节更新机制逐步将非奖励信息整合到学习过程中,以提高学习效率。在连续2D空间中的导航任务中,进化的领域适应学习代理表现出比纯RL代理学习速度快300倍。进化被发现完全消除了对奖励信息的依赖,使领域适应学习代理能够仅从非奖励信息中学习,并通过局部神经调节基于连接权重更新来完成。
https://arxiv.org/abs/2404.12631
With the flourishing development of intelligent warehousing systems, the technology of Automated Guided Vehicle (AGV) has experienced rapid growth. Within intelligent warehousing environments, AGV is required to safely and rapidly plan an optimal path in complex and dynamic environments. Most research has studied deep reinforcement learning to address this challenge. However, in the environments with sparse extrinsic rewards, these algorithms often converge slowly, learn inefficiently or fail to reach the target. Random Network Distillation (RND), as an exploration enhancement, can effectively improve the performance of proximal policy optimization, especially enhancing the additional intrinsic rewards of the AGV agent which is in sparse reward environments. Moreover, most of the current research continues to use 2D grid mazes as experimental environments. These environments have insufficient complexity and limited action sets. To solve this limitation, we present simulation environments of AGV path planning with continuous actions and positions for AGVs, so that it can be close to realistic physical scenarios. Based on our experiments and comprehensive analysis of the proposed method, the results demonstrate that our proposed method enables AGV to more rapidly complete path planning tasks with continuous actions in our environments. A video of part of our experiments can be found at this https URL.
随着智能仓库系统的发展,自动导引车(AGV)的技术得到了快速发展。在智能仓库环境中,AGV需要安全快速地规划复杂动态环境中的最优路径。大多数研究都研究了深度强化学习来应对这个挑战。然而,在稀疏外部奖励的环境中,这些算法往往 converge缓慢,学习效率低下,或者无法达到目标。随机网络蒸馏(RND)作为一种探索增强技术,可以有效地提高最近策略优化器的性能,特别是增强AGV代理在稀疏奖励环境中的额外 intrinsic 奖励。此外,大部分当前的研究仍然使用2D网格迷宫作为实验环境。这些环境具有足够的复杂性和有限的动作集合。为了克服这个局限,我们提出了带有连续动作和位置的AGV路径规划仿真环境,以便更接近现实物理场景。基于我们实验和提出的方法的全面分析,结果表明,在我们的环境中,使用连续动作可以更快地完成AGV的路径规划任务。我们实验的部分视频可以在这个链接中找到:https://www.youtube.com/watch?v=uRstRQkZlc。
https://arxiv.org/abs/2404.12594
The widespread use of knowledge graphs in various fields has brought about a challenge in effectively integrating and updating information within them. When it comes to incorporating contexts, conventional methods often rely on rules or basic machine learning models, which may not fully grasp the complexity and fluidity of context information. This research suggests an approach based on reinforcement learning (RL), specifically utilizing Deep Q Networks (DQN) to enhance the process of integrating contexts into knowledge graphs. By considering the state of the knowledge graph as environment states defining actions as operations for integrating contexts and using a reward function to gauge the improvement in knowledge graph quality post-integration, this method aims to automatically develop strategies for optimal context integration. Our DQN model utilizes networks as function approximators, continually updating Q values to estimate the action value function, thus enabling effective integration of intricate and dynamic context information. Initial experimental findings show that our RL method outperforms techniques in achieving precise context integration across various standard knowledge graph datasets, highlighting the potential and effectiveness of reinforcement learning in enhancing and managing knowledge graphs.
在各个领域的知识图谱的广泛应用给有效地整合和更新知识带来了挑战。当涉及到纳入上下文时,传统的做法通常依赖于规则或基本的机器学习模型,这些模型可能无法完全理解上下文信息的复杂性和流动性。这项研究建议了一种基于强化学习(RL)的方法,具体利用深度 Q 网络(DQN)增强将上下文融入知识图谱的过程。通过将知识图谱的状态定义为操作,将知识图谱的状态作为上下文的整合操作,并使用奖励函数衡量知识图谱质量的改善,这种方法旨在自动开发最优上下文整合策略。我们的 DQN 模型使用网络作为函数近似的参数,不断更新 Q 值来估计动作价值函数,从而实现对复杂和动态上下文信息的有效整合。初步实验结果表明,我们的 RL 方法在各种标准知识图谱数据集上实现了精确的上下文整合,突出了强化学习在增强和管理知识图谱中的潜力和有效性。
https://arxiv.org/abs/2404.12587
The widespread adoption of electric vehicles (EVs) poses several challenges to power distribution networks and smart grid infrastructure due to the possibility of significantly increasing electricity demands, especially during peak hours. Furthermore, when EVs participate in demand-side management programs, charging expenses can be reduced by using optimal charging control policies that fully utilize real-time pricing schemes. However, devising optimal charging methods and control strategies for EVs is challenging due to various stochastic and uncertain environmental factors. Currently, most EV charging controllers operate based on a centralized model. In this paper, we introduce a novel approach for distributed and cooperative charging strategy using a Multi-Agent Reinforcement Learning (MARL) framework. Our method is built upon the Deep Deterministic Policy Gradient (DDPG) algorithm for a group of EVs in a residential community, where all EVs are connected to a shared transformer. This method, referred to as CTDE-DDPG, adopts a Centralized Training Decentralized Execution (CTDE) approach to establish cooperation between agents during the training phase, while ensuring a distributed and privacy-preserving operation during execution. We theoretically examine the performance of centralized and decentralized critics for the DDPG-based MARL implementation and demonstrate their trade-offs. Furthermore, we numerically explore the efficiency, scalability, and performance of centralized and decentralized critics. Our theoretical and numerical results indicate that, despite higher policy gradient variances and training complexity, the CTDE-DDPG framework significantly improves charging efficiency by reducing total variation by approximately %36 and charging cost by around %9.1 on average...
电动汽车(EVs)的广泛采用对电力配电网络和智能电网基础设施提出了几个挑战,尤其是在高峰时段,可能会显著增加电力需求。此外,当EV参与需求侧管理计划时,通过利用最优充电控制策略,可以降低充电费用,这些策略完全利用实时定价方案。然而,由于各种随机和不确定的环境因素,设计电动汽车的优化充电方法和控制策略具有挑战性。目前,大多数电动汽车充电控制器基于集中模型。在本文中,我们提出了使用多智能体强化学习(MARL)框架的分布式和合作充电策略。我们的方法基于一个住宅社区中一组EV的集中训练和分散执行(CTDE)方法。该方法被称为CTDE-DDPG,在训练阶段采用集中训练、分散执行的方式建立代理之间的合作,同时在执行阶段确保分布隐私。我们理论探讨了基于DDPG的MARL实现的中央化和分散化的批评器的性能,并展示了它们的权衡。此外,我们通过数值研究探讨了中央化和分散化的批评器的效率、可扩展性和性能。我们的理论和数值结果表明,尽管政策梯度变异性较高和训练复杂性较高,但CTDE-DDPG框架通过降低总方差约36%和充电成本约9.1%的方式显著提高了充电效率。
https://arxiv.org/abs/2404.12520
There have been growing discussions on estimating and subsequently reducing the operational carbon footprint of enterprise data centers. The design and intelligent control for data centers have an important impact on data center carbon footprint. In this paper, we showcase PyDCM, a Python library that enables extremely fast prototyping of data center design and applies reinforcement learning-enabled control with the purpose of evaluating key sustainability metrics including carbon footprint, energy consumption, and observing temperature hotspots. We demonstrate these capabilities of PyDCM and compare them to existing works in EnergyPlus for modeling data centers. PyDCM can also be used as a standalone Gymnasium environment for demonstrating sustainability-focused data center control.
近年来,对估计和后续减少企业数据中心操作碳足迹的研究逐渐增加。数据中心的设计和智能控制对数据中心的碳足迹具有重要影响。在本文中,我们展示了PyDCM,一个Python库,可用于极快原型设计数据中心,并使用强化学习实现控制,以评估包括碳排放、能源消耗和观察热点在内的关键可持续指标。我们展示了PyDCM的这些功能,并将其与能源加权控制(EnergyPlus)中的现有工作进行了比较。PyDCM还可以作为一个专注于可持续数据中心控制的独立Gymnasium环境进行展示。
https://arxiv.org/abs/2404.12498
Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at this https URL
诸如强化学习这样的模型自由控制策略已经表明,在没有要求准确的世界模型或仿真器的情况下,可以学习控制策略。虽然这种方法由于缺乏建模要求而具有吸引力,但由于缺乏模拟数据,这些方法在许多现实世界领域中可能是不可行的。另一方面,基于模型的控制技术利用准确的仿真器可以规避这些挑战,并使用大量廉价的仿真数据来学习控制器,这些控制器可以有效地将知识传递到现实生活中。这类基于模型的技术的挑战在于,需要要求极其准确的仿真,这需要投入大量的人力精力来为每个考虑的环境进行设计。在本文中,我们提出了一个学习系统,可以利用少量的现实世界数据来自动优化仿真模型,并 then 规划一个准确的控制策略,然后将其部署到现实生活中。我们的方法的关键在于利用一个初始(可能不准确的)仿真器来设计有效的探索策略,当部署到现实生活中时,可以收集高质量的数据。我们在几个具有挑战性的机器人操作任务中展示了这种范例的有效性,并表明,仅少量现实世界数据可以实现有效的仿真-现实转移。这个项目的网站链接是 https://www.xxxxxx.com。
https://arxiv.org/abs/2404.12308
Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.
语言模型(LMs)通过大量文本数据进行训练可以获得复杂的技能,如生成概述、回答问题或生成代码。然而,它们也表现出了违反人类偏好的行为,例如生成具有攻击性的内容、虚假信息或传播社会偏见。在这篇论文中,我探讨了几种将LMs与人类偏好对齐的方法。首先,我认为将LMs对齐可以看作是贝叶斯推理:通过给定关于人类偏好的证据来条件化先验(基础,预训练LM)(第2章)。通过人类偏好进行条件可以以多种方式实现。在第3章中,我研究了使用评分函数给反馈的两种方法:基于人类反馈的强化学习(RLHF)和分布匹配。我表明,RLHF可以被视为分布匹配的特殊情况,但分布匹配比它更一般。在第4章中,我展示了如何将分布匹配扩展到条件语言模型。最后,在第5章中,我探讨了另一种根源:在预训练过程中将LM对齐于人类偏好。我表明,从从一开始涉及人类反馈往往比仅在监督微调过程中使用它更有效。总体而言,这些结果突出了与RLHF不同的、互补的alignment技术。
https://arxiv.org/abs/2404.12150
The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model's robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.
通过多个交通信号灯之间的更好合作,交通信号灯控制的有效性已经显著提高。然而,仍然存在一个持续的问题:如何获得一个具有显著跨城家传播性的多代理交通信号灯控制算法?在本文中,我们提出了一个Transformer on Transformer (TonT)模型,名为X-Light:我们输入完整的马尔可夫决策过程轨迹,下层Transformer聚合目标交叉点和其邻居之间的状态、动作和奖励,而上层Transformer学习不同城市之间的通用决策轨迹。这种双层方法增强了模型的鲁棒性和可转移性。值得注意的是,当直接应用于未见过的场景时,我们模型的平均+7.91%的超越率,并且在某些情况下,+16.3%的超越率,从而获得了最佳结果。
https://arxiv.org/abs/2404.12090
Traditional trajectory planning methods for autonomous vehicles have several limitations. Heuristic and explicit simple rules make trajectory lack generality and complex motion. One of the approaches to resolve the above limitations of traditional trajectory planning methods is trajectory planning using reinforcement learning. However, reinforcement learning suffers from instability of learning and prior works of trajectory planning using reinforcement learning didn't consider the uncertainties. In this paper, we propose a trajectory planning method for autonomous vehicles using reinforcement learning. The proposed method includes iterative reward prediction method that stabilizes the learning process, and uncertainty propagation method that makes the reinforcement learning agent to be aware of the uncertainties. The proposed method is experimented in the CARLA simulator. Compared to the baseline method, we have reduced the collision rate by 60.17%, and increased the average reward to 30.82 times.
传统的轨迹规划方法对于自动驾驶车辆具有多个局限性。基于策略和显式简单的规则使轨迹缺乏普适性和复杂运动。解决传统轨迹规划方法上述局限性的一个方法是使用强化学习进行轨迹规划。然而,强化学习的学习不稳定,而使用强化学习进行轨迹规划的前人工作没有考虑到不确定性。在本文中,我们提出了一种使用强化学习进行自动驾驶车辆轨迹规划的方法。所提出的方法包括迭代奖励预测方法,该方法稳定了学习过程,以及不确定性传播方法,该方法使强化学习智能体意识到不确定性。所提出的方法在CARLA仿真器上进行了实验。与基线方法相比,我们降低了碰撞率60.17%,并将平均奖励提高了30.82倍。
https://arxiv.org/abs/2404.12079
In this review paper, we delve into the realm of Large Language Models (LLMs), covering their foundational principles, diverse applications, and nuanced training processes. The article sheds light on the mechanics of in-context learning and a spectrum of fine-tuning approaches, with a special focus on methods that optimize efficiency in parameter usage. Additionally, it explores how LLMs can be more closely aligned with human preferences through innovative reinforcement learning frameworks and other novel methods that incorporate human feedback. The article also examines the emerging technique of retrieval augmented generation, integrating external knowledge into LLMs. The ethical dimensions of LLM deployment are discussed, underscoring the need for mindful and responsible application. Concluding with a perspective on future research trajectories, this review offers a succinct yet comprehensive overview of the current state and emerging trends in the evolving landscape of LLMs, serving as an insightful guide for both researchers and practitioners in artificial intelligence.
在这篇综述论文中,我们深入探讨了大型语言模型(LLMs)的领域,涵盖了它们的基本原理、多种应用以及复杂的训练过程。该文章揭示了在上下文学习中以及在参数使用上的微调方法,特别关注优化参数使用效率的方法。此外,它探讨了LLM如何通过创新强化学习框架和其他结合人类反馈的新方法更紧密地与人类偏好保持一致。文章还检查了正在崛起的检索增强生成技术,将外部知识集成到LLM中。讨论了LLM部署的伦理维度,强调需要谨慎和负责任的应用。结论从当前LLM发展演变的视角出发,这份综述为研究人员和实践者提供了一篇简洁而全面的现状和趋势概述,为人工智能领域的研究人员和实践者提供了一个有益的指导。
https://arxiv.org/abs/2404.11973
Multi-object transport using multi-robot systems has the potential for diverse practical applications such as delivery services owing to its efficient individual and scalable cooperative transport. However, allocating transportation tasks of objects with unknown weights remains challenging. Moreover, the presence of infeasible tasks (untransportable objects) can lead to robot stoppage (deadlock). This paper proposes a framework for dynamic task allocation that involves storing task experiences for each task in a scalable manner with respect to the number of robots. First, these experiences are broadcasted from the cloud server to the entire robot system. Subsequently, each robot learns the exclusion levels for each task based on those task experiences, enabling it to exclude infeasible tasks and reset its task priorities. Finally, individual transportation, cooperative transportation, and the temporary exclusion of tasks considered infeasible are achieved. The scalability and versatility of the proposed method were confirmed through numerical experiments with an increased number of robots and objects, including unlearned weight objects. The effectiveness of the temporary deadlock avoidance was also confirmed by introducing additional robots within an episode. The proposed method enables the implementation of task allocation strategies that are feasible for different numbers of robots and various transport tasks without prior consideration of feasibility.
多机器人系统利用多机器人传输具有实现多种实际应用潜力,如配送服务。然而,分配未知重量的对象的运输任务仍然具有挑战性。此外,存在不可行任务(无法运输的对象)可能会导致机器人停止(死锁)。本文提出了一种动态任务分配框架,其中将每个任务的任务经历以可扩展的方式存储在机器人数量上。首先,这些经历从云服务器传播到整个机器人系统。然后,根据任务经历学习每个机器人任务的排除级别,使它能够排除不可行任务并重置任务优先级。最后,通过实现个体传输、合作传输和暂时排除任务,实现了所需的任务传输。通过增加机器人数量和对象,包括未学习的重量对象,验证了所提出的可扩展性和多样性。临时避免死锁的效果也得到了证实,通过在单个 episode 中引入额外的机器人。所提出的方法使不同机器人数量和各种传输任务下实现任务分配策略成为可能,而无需事先考虑可行性。
https://arxiv.org/abs/2404.11817
The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.
文本到图像扩散模型的快速演变为生成型人工智能打开了大门,使将文本描述转化为具有引人入胜视觉效果的图像成为可能,特别是在描述抽象概念方面。然而,这一领域的一个持续挑战是优化提示以有效地传达抽象概念为具体物体。例如,文本编码器很难表达“和平”,但可以轻松地描绘橄榄枝和白鸽子。本文介绍了一种名为抽象概念提示优化器(POAC)的新颖方法,专门设计用于提高文本到图像扩散模型在解释和生成图像时的性能。我们提出了一个基于预训练语言模型的Prompt语言模型(PLM),然后用经过精心挑选的抽象概念提示数据集进行微调。数据集使用GPT-4来扩展抽象概念场景和物体。我们的框架采用了一种基于强化学习的优化策略,重点关注通过稳定的扩散模型生成的图像与优化提示之间的对齐。通过广泛的实验,我们证明了我们的POAC显著提高了生成图像的准确性和美学质量,特别是在描述抽象概念和与优化提示对齐方面。我们还对在不同设置下的扩散模型性能进行了全面的分析,展示了模型在增强抽象概念表示方面的多样性和有效性。
https://arxiv.org/abs/2404.11589
Temporal logics, such as linear temporal logic (LTL), offer a precise means of specifying tasks for (deep) reinforcement learning (RL) agents. In our work, we consider the setting where the task is specified by an LTL objective and there is an additional scalar reward that we need to optimize. Previous works focus either on learning a LTL task-satisfying policy alone or are restricted to finite state spaces. We make two contributions: First, we introduce an RL-friendly approach to this setting by formulating this problem as a single optimization objective. Our formulation guarantees that an optimal policy will be reward-maximal from the set of policies that maximize the likelihood of satisfying the LTL specification. Second, we address a sparsity issue that often arises for LTL-guided Deep RL policies by introducing Cycle Experience Replay (CyclER), a technique that automatically guides RL agents towards the satisfaction of an LTL specification. Our experiments demonstrate the efficacy of CyclER in finding performant deep RL policies in both continuous and discrete experimental domains.
temporal logics,如线性时间逻辑(LTL)为(深度)强化学习(RL)智能体提供了精确指定任务的方法。在我们的工作中,我们考虑了将任务由LTL目标指定,并且还需要优化一个标量的奖励的情况。以前的工作只关注学习一个LTL任务满足的策略,或者受到有限状态空间的限制。我们做出了两个贡献:首先,我们通过将这个问题建模为一个优化目标,引入了一种RL友好的方法。我们的建模保证了一个最优策略将从满足LTL规范的策略中具有最大可能性的集合中选择。其次,我们解决了通常在LTL指导的深度RL策略中出现的稀疏性问题,通过引入Cycle Experience Replay(CER)技术,一种自动引导RL代理满足LTL规范的技术。我们的实验证明了CER在找到高效的深度RL策略在连续和离散实验域中的有效性。
https://arxiv.org/abs/2404.11578