Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
大视觉语言模型(VLMs)在专用视觉指令跟随数据上进行微调已经展示了令人印象深刻的语言推理能力。然而,这种微调范式可能无法有效地从交互环境中学习最优决策策略。为解决这个问题,我们提出了一个使用强化学习(RL)微调VLMs的算法框架。具体来说,我们的框架提供任务描述,然后提示VLM生成连锁推理(CoT)思维,使VLM能够高效探索导致最终文本基于行动的中间推理步骤。接下来,开放的文本输出被解析为可执行动作,以与环境交互以获得目标导向任务奖励。最后,我们的框架使用这些任务奖励对整个VLM进行微调。实验证明,我们提出的框架增强了VLM代理在不同任务中的决策能力,使得7b模型能够优于诸如GPT4-V或Gemini等商业模型。此外,我们发现,CoT推理是提高性能的关键组成部分,因为去除CoT推理会导致我们方法的整体性能显著下降。
https://arxiv.org/abs/2405.10292
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
近年来大型语言模型(LLMs)的成功引起了人们对开发个性化的角色扮演对话代理商的广泛关注,以增强他们在执行通用和专用对话任务方面的能力。然而,在将生成的语句个性化给说话人方面(无论是人类还是LLM),尚缺乏深入研究。为了填补这一空白,我们的研究引入了一个新颖的评估挑战:在代理生成的对话中进行说话人验证,旨在验证两组语句是否来自同一个说话人。为此,我们收集了一个包括成千上万个说话人和他们的话语的大型数据集。我们还在一个实验设置中开发和评估了说话人验证模型。进一步利用说话人验证模型评估基于LLM的角色扮演模型的个性化能力。全面的实验结果表明,当前的角色扮演模型在准确复制说话人方面失败,主要原因是它们固有的语言特性。
https://arxiv.org/abs/2405.10150
In autonomous driving, accurately interpreting the movements of other road users and leveraging this knowledge to forecast future trajectories is crucial. This is typically achieved through the integration of map data and tracked trajectories of various agents. Numerous methodologies combine this information into a singular embedding for each agent, which is then utilized to predict future behavior. However, these approaches have a notable drawback in that they may lose exact location information during the encoding process. The encoding still includes general map information. However, the generation of valid and consistent trajectories is not guaranteed. This can cause the predicted trajectories to stray from the actual lanes. This paper introduces a new refinement module designed to project the predicted trajectories back onto the actual map, rectifying these discrepancies and leading towards more consistent predictions. This versatile module can be readily incorporated into a wide range of architectures. Additionally, we propose a novel scene encoder that handles all relations between agents and their environment in a single unified heterogeneous graph attention network. By analyzing the attention values on the different edges in this graph, we can gain unique insights into the neural network's inner workings leading towards a more explainable prediction.
在自动驾驶中,准确地解释其他道路用户的行为并利用这些知识预测未来轨迹至关重要。通常,这是通过将地图数据和不同代理的跟踪轨迹整合来实现这一目标的。许多方法将这一信息整合为每个代理的单一嵌入,然后用于预测未来的行为。然而,这些方法的一个显著缺点是在编码过程中可能会丢失精确的地理位置信息。编码过程仍然包括一般地图信息。然而,生成有效的和一致的轨迹并不是绝对的保证。这可能导致预测的轨迹与实际车道脱离。本文介绍了一种新的优化模块,旨在将预测的轨迹投射回实际地图,纠正这些差异并实现更一致的预测。这个多功能模块可以轻松地集成到各种架构中。此外,我们提出了一个全新的场景编码器,用于处理所有代理和它们环境之间的关系,在单个统一的异质图注意力网络中。通过分析该图不同边缘的注意力值,我们可以深入了解神经网络的工作原理,从而实现更有解释性的预测。
https://arxiv.org/abs/2405.10134
Using Unmanned Aerial Vehicles (UAVs) in Search and rescue operations (SAR) to navigate challenging terrain while maintaining reliable communication with the cellular network is a promising approach. This paper suggests a novel technique employing a reinforcement learning multi Q-learning algorithm to optimize UAV connectivity in such scenarios. We introduce a Strategic Planning Agent for efficient path planning and collision awareness and a Real-time Adaptive Agent to maintain optimal connection with the cellular base station. The agents trained in a simulated environment using multi Q-learning, encouraging them to learn from experience and adjust their decision-making to diverse terrain complexities and communication scenarios. Evaluation results reveal the significance of the approach, highlighting successful navigation in environments with varying obstacle densities and the ability to perform optimal connectivity using different frequency bands. This work paves the way for enhanced UAV autonomy and enhanced communication reliability in search and rescue operations.
使用无人机在搜索和救援行动(SAR)中导航具有挑战性的地形,同时保持与移动网络的可靠通信,是一种有前景的方法。本文提出了一种采用强化学习多Q-学习算法来优化无人机连接的新技术。我们引入了一个用于高效路径规划和避障的策略规划代理和一个用于实时适应基站策略的实时适应代理。这些代理在模拟环境中使用多Q-学习进行训练,鼓励它们从经验中学习并调整其决策以适应多样地形和通信场景。评估结果显示出这种方法的重要性,突出了在具有不同障碍密度和通信场景的环境中实现最优连接的成功导航。这项工作为增强无人机自主性和提高搜索和救援行动中的通信可靠性奠定了基础。
https://arxiv.org/abs/2405.10042
As natural language generation (NLG) models have become prevalent, systematically assessing the quality of machine-generated texts has become increasingly important. Recent studies introduce LLM-based evaluators that operate as reference-free metrics, demonstrating their capability to adeptly handle novel tasks. However, these models generally rely on a single-agent approach, which, we argue, introduces an inherent limit to their performance. This is because there exist biases in LLM agent's responses, including preferences for certain text structure or content. In this work, we propose DEBATE, an NLG evaluation framework based on multi-agent scoring system augmented with a concept of Devil's Advocate. Within the framework, one agent is instructed to criticize other agents' arguments, potentially resolving the bias in LLM agent's answers. DEBATE substantially outperforms the previous state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation, SummEval and TopicalChat. We also show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
随着自然语言生成(NLG)模型的普及,系统地评估机器生成的文本质量变得越来越重要。最近的研究引入了LLM基评估器,这些评估器作为参考无评估指标,证明了它们巧妙处理新任务的能力。然而,这些模型通常依赖于单一代理商方法,我们认为这限制了它们的表现。这是因为LLM代理商回答中存在偏见,包括对某些文本结构和内容的偏好。在本文中,我们提出了DEBATE,一个基于多代理评分系统增强的NLG评估框架,并引入了恶魔辩护的概念。在框架内,一个代理商被指示批评其他代理商的论点,可能解决这个问题,即LLM代理商回答中的偏见。DEBATE在NLG评估中的两个元评估基准指标——SummEval和TopicalChat方面显著超过了最先进的水平。我们还证明了代理之间辩论的普遍性和一个代理的个性会影响评估器的性能。
https://arxiv.org/abs/2405.09935
Large Language Models have recently gained significant attention in scientific discovery for their extensive knowledge and advanced reasoning capabilities. However, they encounter challenges in effectively simulating observational feedback and grounding it with language to propel advancements in physical scientific discovery. Conversely, human scientists undertake scientific discovery by formulating hypotheses, conducting experiments, and revising theories through observational analysis. Inspired by this, we propose to enhance the knowledge-driven, abstract reasoning abilities of LLMs with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework: LLMs act as knowledgeable and versatile thinkers, proposing scientific hypotheses and reason about discrete components, such as physics equations or molecule structures; meanwhile, simulations function as experimental platforms, providing observational feedback and optimizing via differentiability for continuous parts, such as physical parameters. We conduct extensive experiments to demonstrate our framework's efficacy in constitutive law discovery and molecular design, unveiling novel solutions that differ from conventional human expectations yet remain coherent upon analysis.
近年来,大型语言模型在科学研究中因其广泛的知識和先进的推理能力而受到了显著的关注。然而,在有效地模拟观测反馈并将其与语言相结合以推动物理科学研究进展方面,它们面临着挑战。相反,人类科学家通过制定假设、进行实验和通过观测分析修改理论来进行科学研究。受到这一启发,我们提出了一种通过计算强度增强知识驱动的抽象推理能力的方法。我们引入了科学生成代理 (SGA),一种二元优化框架:LLM 充当富有知识和多才多艺的思考者,提出科学假设和讨论离散组件(如物理学方程或分子结构);同时,模拟作为实验平台,提供观测反馈并通过不同的可导性优化连续部分(如物理参数)。我们进行了广泛的实验,以证明我们框架在构成定律发现和分子设计方面的有效性,并揭示了与传统人类预期不同的全新解决方案,然而这些解决方案在分析过程中仍然保持逻辑上的连贯性。
https://arxiv.org/abs/2405.09783
This paper proposes a method to combine reinforcement learning (RL) and imitation learning (IL) using a dynamic, performance-based modulation over learning signals. The proposed method combines RL and behavioral cloning (IL), or corrective feedback in the action space (interactive IL/IIL), by dynamically weighting the losses to be optimized, taking into account the backpropagated gradients used to update the policy and the agent's estimated performance. In this manner, RL and IL/IIL losses are combined by equalizing their impact on the policy's updates, while modulating said impact such that IL signals are prioritized at the beginning of the learning process, and as the agent's performance improves, the RL signals become progressively more relevant, allowing for a smooth transition from pure IL/IIL to pure RL. The proposed method is used to learn local planning policies for mobile robots, synthesizing IL/IIL signals online by means of a scripted policy. An extensive evaluation of the application of the proposed method to this task is performed in simulations, and it is empirically shown that it outperforms pure RL in terms of sample efficiency (achieving the same level of performance in the training environment utilizing approximately 4 times less experiences), while consistently producing local planning policies with better performance metrics (achieving an average success rate of 0.959 in an evaluation environment, outperforming pure RL by 12.5% and pure IL by 13.9%). Furthermore, the obtained local planning policies are successfully deployed in the real world without performing any major fine tuning. The proposed method can extend existing RL algorithms, and is applicable to other problems for which generating IL/IIL signals online is feasible. A video summarizing some of the real world experiments that were conducted can be found in this https URL.
本文提出了一种结合强化学习(RL)和模仿学习(IL)的方法,通过动态地对学习信号进行性能为基础的调节。所提出的方法将RL和行为复制(IL)相结合,或者在动作空间中使用交互式IL/IIL中的纠正反馈(interactive IL/IIL),通过动态地加权需要优化的损失,考虑到用于更新策略的回溯梯度以及代理器的估计绩效。这样,通过平衡它们对策略更新影响的等效性,RL和IL/IIL损失得以结合。在某种程度上,通过动态地加权它们对策略更新的影响,使得IL信号在学习过程开始时具有优先级,而随着代理器绩效的提高,RL信号逐渐变得更加相关,从而实现从纯IL/IIL到纯RL的平滑过渡。所提出的方法用于学习移动机器人的局部规划策略,通过编写脚本策略在线合成IL/IIL信号。对所提出方法在 this任务上的应用进行了广泛的仿真评估,实验结果表明,与纯RL相比,其在样本效率方面具有优势(在训练环境中实现与约4倍经验相同的性能),同时,它还 consistently产生具有更好性能指标的局部规划策略(在评估环境中,平均成功率为0.959,比纯RL高12.5%,比纯IL高13.9%)。此外,所获得的局部规划策略在实际环境中成功地得到了部署,没有进行任何重大微调。所提出的方法可以扩展现有的RL算法,并适用于其他可以通过在线生成IL/IIL信号的问题。可以在这个链接中找到一个总结了一些真实世界实验的视频:https://www.youtube.com/watch?v=
https://arxiv.org/abs/2405.09760
The ability to build and leverage world models is essential for a general-purpose AI agent. Testing such capabilities is hard, in part because the building blocks of world models are ill-defined. We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language models by testing their ability to use knowledge of a concept to match a target text with a plausible/implausible context. EWOK targets specific concepts from multiple knowledge domains known to be vital for world modeling in humans. Domains range from social interactions (help/hinder) to spatial relations (left/right). Both, contexts and targets are minimal pairs. Objects, agents, and locations in the items can be flexibly filled in enabling easy generation of multiple controlled datasets. We then introduce EWOK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 openweights large language models (1.3B--70B parameters) across a battery of evaluation paradigms along with a human norming study comprising 12,480 measurements. The overall performance of all tested models is worse than human performance, with results varying drastically across domains. These data highlight simple cases where even large models fail and present rich avenues for targeted research on LLM world modeling capabilities.
构建并利用世界模型的能力对于通用人工智能代理至关重要。测试这种能力是困难的,部分原因是因为世界模型的构建块是不明确的。我们提出了元素(EWOK),一个用于通过测试模型在语言模型中使用概念知识与目标文本匹配的可信/不可信上下文来评估世界模型的框架。EWOK针对已知对人类世界建模至关重要的一些知识领域。领域范围从社会互动(帮助/阻碍)到空间关系(左/右)。上下文和目标都是最小二乘。物品中的对象、代理和位置可以灵活填写,以轻松生成多个受控数据集。然后我们引入了EWOK-CORE-1.0,一个包含11个世界知识领域4,374个项目的数据集。我们通过一系列评估范式对20个大型语言模型(1.3B--70B参数)进行了评估,其中包括一个人类标准化研究,该研究包括12,480个测量值。所有测试模型的整体性能都低于人类性能,各个领域的结果差异极大。这些数据突显了即使大型模型也存在简单的情况,并为针对LLM世界建模能力的有针对性研究提供了丰富的方向。
https://arxiv.org/abs/2405.09605
To safely navigate intricate real-world scenarios, autonomous vehicles must be able to adapt to diverse road conditions and anticipate future events. World model (WM) based reinforcement learning (RL) has emerged as a promising approach by learning and predicting the complex dynamics of various environments. Nevertheless, to the best of our knowledge, there does not exist an accessible platform for training and testing such algorithms in sophisticated driving environments. To fill this void, we introduce CarDreamer, the first open-source learning platform designed specifically for developing WM based autonomous driving algorithms. It comprises three key components: 1) World model backbone: CarDreamer has integrated some state-of-the-art WMs, which simplifies the reproduction of RL algorithms. The backbone is decoupled from the rest and communicates using the standard Gym interface, so that users can easily integrate and test their own algorithms. 2) Built-in tasks: CarDreamer offers a comprehensive set of highly configurable driving tasks which are compatible with Gym interfaces and are equipped with empirically optimized reward functions. 3) Task development suite: This suite streamlines the creation of driving tasks, enabling easy definition of traffic flows and vehicle routes, along with automatic collection of multi-modal observation data. A visualization server allows users to trace real-time agent driving videos and performance metrics through a browser. Furthermore, we conduct extensive experiments using built-in tasks to evaluate the performance and potential of WMs in autonomous driving. Thanks to the richness and flexibility of CarDreamer, we also systematically study the impact of observation modality, observability, and sharing of vehicle intentions on AV safety and efficiency. All code and documents are accessible on this https URL.
为了在复杂的现实场景中安全导航,自动驾驶车辆必须能够适应各种道路条件并预测未来事件。基于强化学习的(RL)世界模型(WM)作为一种有前景的方法,通过学习和预测各种环境中的复杂动态而 emergence。然而,据我们所知,目前没有可用的平台来训练和测试这种算法在复杂驾驶环境中的自动驾驶算法。为填补这一空白,我们介绍了CarDreamer,第一个专为开发基于RL的自驾算法而设计的开源学习平台。它包括三个关键组件:1)世界模型骨架:CarDreamer集成了一些最先进的WMs,简化了RL算法的复制。骨架与其余部分解耦并使用标准的Gym界面通信,以便用户轻松地将自己的算法集成和测试。2)内置任务:CarDreamer提供了一系列高度可配置的驾驶任务,与Gym接口兼容,并配备经过实证优化的奖励函数。3)任务开发套件:该套件简化了驾驶任务的创建,用户可以轻松定义交通流量和车辆路线,并自动收集多模态观察数据。可视化服务器允许用户通过浏览器追踪实时代理驾驶员的视频和性能指标。此外,我们使用内置任务对WMs在自动驾驶中的性能和潜力进行了广泛的实验评估。由于CarDreamer的丰富性和灵活性,我们还系统地研究了观测模式、可观测性和车辆意图共享对AV安全性和效率的影响。所有代码和文档都可以在https://这个URL访问。
https://arxiv.org/abs/2405.09111
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.
本文解决了在机器人运动中需要精炼的问题,尽管通过人类-机器人对齐方法实现了高视觉相似性,但在物理世界中却缺乏实际执行。图形社区中现有的技术通常优先考虑视觉一致性而非基于物理的可行性,这给在实际应用中部署双足机器人带来了巨大的挑战。我们的研究引入了一个约束的强化学习算法,用于在下肢式机器人上产生基于物理的高质量运动模仿,同时成功跟踪参考人类轨迹。我们将框架命名为I-CTRL。通过将运动复制问题重新表述为基于非物理对齐运动的约束优化,我们的框架在具有简单和独特奖励的简单和独特的基础上表现出色,并且可以适用于四台机器人。此外,我们的框架可以跟随大规模运动数据集,并使用独特的RL代理。所提出的方法标志着在进步控制双足机器人方面迈出了关键的一步,强调了在成功运动复制中实现视觉和物理现实之间的一致性至关重要。
https://arxiv.org/abs/2405.08726
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.
自动驾驶交叉管理(AIM)由于现实交通场景复杂性和需要一个昂贵的集中式服务器同时控制所有车辆而带来了显著的挑战。为了应对这些问题,本研究通过提出一种新型的分布式AIM方法利用多智能体强化学习(MARL)来解决这些问题。我们证明了通过利用高级辅助系统3D环绕视技术,自动驾驶车辆可以在不需要任何集中式控制器的情况下准确地导航路口场景。因此,本文的贡献包括基于MARL的自動管理4个路口的算法和引入了一种名为优先场景回放的新策略,以提高训练效果。我们验证了我们的方法作为传统集中AIM技术的一个创新替代方案,确保了我们的结果的完整可重复性。具体来说,使用SMARTS平台在虚拟环境中进行的实验强调了其在各种指标上优于基准测试的优越性。
https://arxiv.org/abs/2405.08655
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380
This paper presents a new tool learning dataset Seal-Tools, which contains self-instruct API-like tools. Seal-Tools not only offers a large number of tools, but also includes instances which demonstrate the practical application of tools. Seeking to generate data on a large scale while ensuring reliability, we propose a self-instruct method to generate tools and instances, allowing precise control over the process. Moreover, our Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings. For precise and comprehensive evaluation, we use strict format control and design three metrics from different dimensions. Therefore, Seal-Tools can serve as a new benchmark to evaluate the tool-calling ability of LLMs. Finally, we evaluate several prevalent LLMs and our finetuned model on Seal-Tools. The results show that current systems are far from perfect. The code, data and experiment results are available at this https URL .
本文介绍了一个名为Seal-Tools的新工具学习数据集,其中包含类似于自我指导的API工具。Seal-Tools不仅提供了大量工具,还包括一些实例,展示了工具的实际应用。为了在保证可靠性的同时生成大规模数据,我们提出了一个自指导方法来生成工具和实例,允许对过程进行精确控制。此外,我们的Seal-Tools还包含一些嵌套工具调用实例。为了进行精确和全面的评估,我们使用了严格的格式控制并从不同维度设计三个指标。因此,Seal-Tools可以作为评估LLM工具调用能力的新的基准。最后,我们在Seal-Tools上评估了几个流行的LLM模型和我们的优化模型。结果显示,当前系统离完美还有很长的路要走。代码、数据和实验结果都可以在以下链接处获取:https://url.cn/ Seal-Tools。
https://arxiv.org/abs/2405.08355
We offer philosophical motivations for a method we call Virtual World Cognitive Science (VW CogSci), in which researchers use virtual embodied agents that are embedded in virtual worlds to explore questions in the field of Cognitive Science. We focus on questions about mental and linguistic representation and the ways that such computational modeling can add rigor to philosophical thought experiments, as well as the terminology used in the scientific study of such representations. We find that this method forces researchers to take a god's-eye view when describing dynamical relationships between entities in minds and entities in an environment in a way that eliminates the need for problematic talk of belief and concept types, such as the belief that cats are silly, and the concept CAT, while preserving belief and concept tokens in individual cognizers' minds. We conclude with some further key advantages of VW CogSci for the scientific study of mental and linguistic representation and for Cognitive Science more broadly.
我们为一种名为Virtual World Cognitive Science(VW CogSci)的方法提供哲学动机,在这种方法中,研究者使用嵌入在虚拟世界中的虚拟实体来探索认知科学领域中的问题。我们关注的是关于心灵和语言表示以及这种计算建模如何为哲学思维实验增加严谨性,以及这种表示在科学研究中使用的术语。我们发现,这种方法要求研究者以神视角描述心灵中实体与环境中实体之间的动态关系,从而消除关于信念和概念类型的可疑谈论,例如相信猫很傻的信念和概念CAT,同时保留个体认知者头脑中的信念和概念标记。我们得出结论,VW CogSci对于精神学和语言表示的科学研究和认知科学具有更广泛的意义。
https://arxiv.org/abs/2405.08304
ChatGPT is a conversational agent built on a large language model. Trained on a significant portion of human output, ChatGPT can mimic people to a degree. As such, we need to consider what social identities ChatGPT simulates (or can be designed to simulate). In this study, we explored the case of identity simulation through Japanese first-person pronouns, which are tightly connected to social identities in intersectional ways, i.e., intersectional pronouns. We conducted a controlled online experiment where people from two regions in Japan (Kanto and Kinki) witnessed interactions with ChatGPT using ten sets of first-person pronouns. We discovered that pronouns alone can evoke perceptions of social identities in ChatGPT at the intersections of gender, age, region, and formality, with caveats. This work highlights the importance of pronoun use for social identity simulation, provides a language-based methodology for culturally-sensitive persona development, and advances the potential of intersectional identities in intelligent agents.
ChatGPT是一个基于大型语言模型构建的会话机器人。它通过训练大量人类输出而得到,能够以某种程度模仿人类。因此,我们需要考虑ChatGPT模拟的社会身份(或可以通过设计模拟)是什么。在这项研究中,我们通过使用日本第一人称代词来探讨身份模拟,这些代词与性别、年龄、地区和正式程度等方面紧密相关,即 intersectional pronouns。我们进行了一项控制性的在线实验,让来自日本两个地区的(关东和关西)人使用十组第一人称代词与ChatGPT进行互动。我们发现,代词本身就可以引发关于ChatGPT在性别、年龄、地区和正式程度等方面模拟社会身份的感知,不过需要指出的是,这种作用是有局限性的。这项工作强调了在社交身份模拟中使用代词的重要性,为具有文化敏感性的个性发展提供了语言基础,并推动了智能代理中交叉身份的潜在可能性。
https://arxiv.org/abs/2405.08238
Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the ``wooden barrel effect'' caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as the battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments on various well-known datasets show that DR-FL can not only maximise knowledge sharing among heterogeneous models under the energy constraint of large-scale AIoT systems but also improve the model performance of each involved heterogeneous device.
尽管联邦学习(FL)在知识共享方面对异构人工智能设备(AIoT)具有前景,但在实际电池驱动场景中,它们的训练效果和能效受到了由异构模型范式和异构设备能力之间的“木桶效应”引起的影响。因此,由于设备之间的各种差异,现有的FL方法很难在能源受限的场景中进行有效训练,例如设备的电池限制。为了解决上述问题,我们提出了一个能源感知FL框架,名为DR-FL,它考虑了客户端和异构深度学习模型的能源限制,以实现能源高效的FL。与Vanilla FL不同,DR-FL采用了我们提出的基于MARL的双选方法,允许参与设备根据其计算能力和能源能力以一种MARL方式有效且适当地为全局模型做出贡献。在各种知名数据集上的实验表明,DR-FL不仅可以提高大规模AIoT系统中的异构模型之间的知识共享,而且还可以提高涉及的所有异构设备的模型性能。
https://arxiv.org/abs/2405.08183
Mixed-integer quadratic programs (MIQPs) are a versatile way of formulating vehicle decision making and motion planning problems, where the prediction model is a hybrid dynamical system that involves both discrete and continuous decision variables. However, even the most advanced MIQP solvers can hardly account for the challenging requirements of automotive embedded platforms. Thus, we use machine learning to simplify and hence speed up optimization. Our work builds on recent ideas for solving MIQPs in real-time by training a neural network to predict the optimal values of integer variables and solving the remaining problem by online quadratic programming. Specifically, we propose a recurrent permutation equivariant deep set that is particularly suited for imitating MIQPs that involve many obstacles, which is often the major source of computational burden in motion planning problems. Our framework comprises also a feasibility projector that corrects infeasible predictions of integer variables and considerably increases the likelihood of computing a collision-free trajectory. We evaluate the performance, safety and real-time feasibility of decision-making for autonomous driving using the proposed approach on realistic multi-lane traffic scenarios with interactive agents in SUMO simulations.
混合整数二次规划(MIQPs)是一种将车辆决策和运动规划问题形式化的 versatile方法,其中预测模型是一个涉及离散和连续决策变量的混合动态系统。然而,即使是最先进的MIQP求解器也可能很难满足汽车嵌入平台上的挑战要求。因此,我们使用机器学习来简化,从而加速优化。我们的工作基于通过训练神经网络预测整数变量的最优值来解决实时MIQPs的想法,并通过在线二次规划解决剩余问题。具体来说,我们提出了一个循环移位等价深度集,特别适用于涉及许多障碍物的MIQPs,这是运动规划问题中计算负担的主要来源。我们的框架还包括一个可行性投影器,用于纠正整数变量的不可行预测,大大增加了计算无碰撞轨迹的可能性。我们在SUMO仿真中使用该方法对现实世界的多道交通场景进行自动驾驶的决策分析。
https://arxiv.org/abs/2405.08122