While neural implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, thereby limiting their applications in physics-demanding domains like embodied AI and robotics. The lack of plausibility originates from both the absence of physics modeling in the existing pipeline and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, which stands as the first approach to harness both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Our framework proposes a novel differentiable particle-based physical simulator seamlessly integrated with the neural implicit representation. At its core is an efficient transformation between SDF-based implicit representation and explicit surface points by our proposed algorithm, Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Moreover, we model both rendering and physical uncertainty to identify and compensate for the inconsistent and inaccurate monocular geometric priors. The physical uncertainty additionally enables a physics-guided pixel sampling to enhance the learning of slender structures. By amalgamating these techniques, our model facilitates efficient joint modeling with appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods in terms of reconstruction quality. Our reconstruction results also yield superior physical stability, verified by Isaac Gym, with at least a 40% improvement across all datasets, opening broader avenues for future physics-based applications.
虽然多视角3D重建中神经隐式表示已经获得了越来越多的关注,但之前的 work 很难产生物理上合理的成果,从而限制了它们在需要物理要求的领域(如 embodied AI 和机器人学)的应用。缺乏可信度源于现有流程中缺少物理建模以及它们无法恢复复杂的几何结构。在本文中,我们引入了 PhyRecon,这是第一个利用可导渲染和可导物理仿真来学习隐式表面表示的方法。我们的框架将新颖的可导粒子基于物理仿真与神经隐式表示无缝集成。其核心是基于我们提出的表面点前进立方(SP-MC)算法在 SDF 基于隐式表示和显式表面点之间进行有效的转换,实现基于渲染和物理损失的可导学习。此外,我们还建模了渲染和物理不确定性以识别和弥补不一致和不准确的单目几何先验。物理不确定性还允许我们进行基于物理的像素采样,以增强对细长结构的学习。通过将这些技术相结合,我们的模型实现了与外观、几何和物理的效率共生建模。大量实验证明,PhyRecon 在重建质量方面显著超过了所有现有方法。我们的重建结果还证明了伊萨·格雷戈尔(Isaac Gym)验证的卓越物理稳定性,在所有数据集上实现了至少 40% 的改进,为未来的基于物理的应用于开辟了更广泛的道路。
https://arxiv.org/abs/2404.16666
Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.
半监督动作识别旨在通过与大量未标记数据相结合,通过几标记数据来提高空间和时间推理能力。尽管有最近的研究进展,但现有的强大方法在稀疏标记数据下仍然容易产生模糊预测,这表现为用类似的时空信息区分不同动作的局限性。在本文中,我们通过赋予模型两个能力方面来解决这个问题,即判别性空间建模和时间结构建模,以学习具有判别性的时空表示。具体来说,我们提出了自适应对比学习(ACL)策略。它通过标记数据的类原型评估所有未标记样本的置信度,并从预标记样本库中选择正负样本进行对比学习。此外,我们还引入了多尺度时间学习(MTL)策略。它可以从长期视频片段中突出有用的语义信息,并将它们整合到短期视频片段中,同时抑制噪声信息。然后,这两种新的技术都被融入到统一的框架中,以鼓励模型做出准确的预测。在UCF101、HMDB51和Kinetics400等数据集上进行的大量实验表明,我们的方法优越于先前的最先进方法。
https://arxiv.org/abs/2404.16416
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively.
人体推理系统将机器人硬件和认知过程集成在一起,以应对自然语言查询关于特定物理环境中复杂任务的执行。这通常包括改变场景的信念或通过身体交互来改变场景(例如,'将物体按重量从轻到重排序')。为了促进这种系统的发展,我们引入了一个新的模拟环境,利用MuJoCo物理引擎和高质量渲染器Blender提供真实的视觉观察,并且准确地反映场景的物理状态。与模拟器一起,我们提出了一个由10个多步骤推理场景组成的新的基准。最后,我们开发了一种新型的模块化闭合循环交互推理(CLIER)方法,考虑了非视觉对象属性的测量、由外部干扰引起的场景变化以及机器人行动不确定性的结果。我们在模拟和现实世界的操作任务中对其推理方法进行了广泛评估。我们在模拟和现实世界的操作任务中的成功率分别达到76%和64%。
https://arxiv.org/abs/2404.15194
Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens further opportunities for language-based interaction with embodied agents, such as a user instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. Therefore, we present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are matched. The code, trained models, and datasets will be made public.
自然语言界面与 embodied 人工智能越来越普及地出现在我们日常生活中。这为基于语言与嵌入式代理的交互提供了更多的机会,例如用户指示代理在特定位置执行一些任务。例如,"把碗放进冰箱里的餐柜里" 或 "在红标志下路口见面"。因此,我们需要方法在自然语言和环境地图表示之间进行交互。为此,我们探讨了是否可以使用 open-set 自然语言查询来确定由 3D 场景图表示的场景。我们将这个任务定义为 "基于语言的场景检索",它与 "粗定位" 密切相关,但我们实际上在寻找一个匹配从一系列不相交的场景集合中,而不是一个大规模连续地图。因此,我们提出了 Text2SceneGraphMatcher,一个 "场景检索" 管道,它学会了文本描述和场景图之间的联合嵌入,以确定它们是否匹配。代码,训练的模型和数据集将公开发布。
https://arxiv.org/abs/2404.14565
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information.
实体化指令跟随(EIF)是通过在3D环境中导航和与物体互动来执行自然语言指令的任务。在EIF中,一个主要的挑战是合成任务规划,通常通过带有标签数据的监督或上下文学习来解决。为此,我们引入了Socratic Planner,第一种无需训练数据的原子规划方法。Socratic Planner首先通过自问自答将指令分解为任务的主干信息,然后将其转换为高级计划,即一系列子目标。子目标按顺序执行,我们的视觉 grounded 的再规划机制通过密集的视觉反馈动态调整计划。我们还引入了一个高层次计划评估指标,RelaxedHLP,以进行更全面的评估。实验证明Socratic Planner的有效性,在ALFRED基准中实现了与零 shot和少 shot 任务规划的竞争性能,特别是在需要更高维推理的任务中表现尤为出色。此外,通过将环境视觉信息融入规划,实现了精确的计划调整。
https://arxiv.org/abs/2404.15190
Social robots, owing to their embodied physical presence in human spaces and the ability to directly interact with the users and their environment, have a great potential to support children in various activities in education, healthcare and daily life. Child-Robot Interaction (CRI), as any domain involving children, inevitably faces the major challenge of designing generalized strategies to work with unique, turbulent and very diverse individuals. Addressing this challenging endeavor requires to combine the standpoint of the robot-centered perspective, i.e. what robots technically can and are best positioned to do, with that of the child-centered perspective, i.e. what children may gain from the robot and how the robot should act to best support them in reaching the goals of the interaction. This article aims to help researchers bridge the two perspectives and proposes to address the development of CRI scenarios with insights from child psychology and child development theories. To that end, we review the outcomes of the CRI studies, outline common trends and challenges, and identify two key factors from child psychology that impact child-robot interactions, especially in a long-term perspective: developmental stage and individual characteristics. For both of them we discuss prospective experiment designs which support building naturally engaging and sustainable interactions.
社会机器人由于其在人类空间中的实体物理存在和能够直接与用户及其环境互动的能力,在教育、医疗和日常生活中对儿童支持具有巨大的潜力。儿童与机器人交互(CRI)作为任何涉及儿童的领域,不可避免地面临着制定适用于独特、动荡和高度多样化个体的普遍策略的主要挑战。解决这个具有挑战性的任务需要将机器人中心观点(即机器人技术上可以并且最有可能做的事情)与儿童中心观点(即儿童从机器人那里可能获得的东西以及机器人如何更好地支持他们达到互动目标)相结合。本文旨在帮助研究人员跨越这两个观点,并从儿童心理和儿童发展理论的角度探讨CRI场景的发展。为此,我们回顾了CRI研究的成果,概述了常见的趋势和挑战,并从儿童心理学中识别出两个对儿童与机器人互动具有重要影响的关键因素,尤其是从长期的角度来看:发展阶段和个体特征。对于这两个因素,我们讨论了支持自然参与和可持续互动的 prospective 实验设计。
https://arxiv.org/abs/2404.13432
Embodied agents operating in complex and uncertain environments face considerable challenges. While some advanced agents handle complex manipulation tasks with proficiency, their success often hinges on extensive training data to develop their capabilities. In contrast, humans typically rely on recalling past experiences and analogous situations to solve new problems. Aiming to emulate this human approach in robotics, we introduce the Retrieval-Augmented Embodied Agent (RAEA). This innovative system equips robots with a form of shared memory, significantly enhancing their performance. Our approach integrates a policy retriever, allowing robots to access relevant strategies from an external policy memory bank based on multi-modal inputs. Additionally, a policy generator is employed to assimilate these strategies into the learning process, enabling robots to formulate effective responses to tasks. Extensive testing of RAEA in both simulated and real-world scenarios demonstrates its superior performance over traditional methods, representing a major leap forward in robotic technology.
操作于复杂且不确定的环境中的嵌入式智能体面临着巨大的挑战。虽然一些先进的智能体通过熟练处理复杂操作任务而表现出色,但他们的成功往往取决于广泛的训练数据来发展其能力。相比之下,人类通常依赖于回忆过去的经验和类似的情况来解决问题。为了在机器人领域模仿人类方法,我们引入了 Retrieval-Augmented Embodied Agent(RAEA)系统。这种创新系统使机器人具备了一种共享记忆形式,显著提高了其性能。我们的方法结合了策略检索器,使机器人在基于多模态输入的外部策略记忆库中访问相关策略。此外,我们还使用策略生成器将这些策略纳入学习过程,使机器人能够对任务形成有效的响应。对RAEA在模拟和现实世界场景的广泛测试表明,其性能超过了传统方法,代表机器人技术取得了重大进展。
https://arxiv.org/abs/2404.11699
In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. To evaluate the efficacy of our methods, we create two challenging embodied multi-agent long-horizon cooperation tasks using the ThreeDWorld simulator and conduct experiments with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed framework. More videos can be found at this https URL.
在本文中,我们研究了 embodied multi-agent 合作问题,在这种设置中,分散的代理必须根据仅有的部分自中心化视图来合作。为了在這種设置中有效地规划,与在单 agent 情景中学习世界动态不同,我们必须根据仅有的部分自中心化视图来模拟世界动态。为了应对这种部分可观测性问题,我们首先训练生成模型来估计给定部分自中心化观测值的总体世界状态。为了确保准确地模拟多个动作在世界状态上,我们 then 提出了一种多代理合作的世界模型,通过分解多个代理的自然可组合的联合动作并递归地生成视频来实现。通过利用这个可组合的世界模型,我们可以结合 Vision Language Models 推断其他代理的行动,从而使用树搜索过程将这些模块整合起来,促进在线合作规划。为了评估我们方法的有效性,我们使用 ThreeDWorld 模拟器创建两个具有挑战性的 embodied multi-agent 长时合作任务,并对 2-4 个代理进行实验。结果表明,我们的可组合的世界模型有效,并为不同任务和任意数量代理的 embodied 合作提供了高效的框架,展示了我们提出框架的潜在前景。更多视频可以在这个链接 https://www.youtube.com/watch?v= 找到。
https://arxiv.org/abs/2404.10775
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at this https URL
自主嵌入式代理生活在多媒体网站的互联网上。它们能否在多模态网站上跳跃以完成复杂的用户任务呢?现有的基准测试无法在现实、不断演变的環境中评估它们在網站上的本体嵌入。为了回答这个问题,我们提出了MMInA,一个多跳和多模态基准来评估用于多模态网站的代理,具有几个有趣的属性:1)不断演变的现实世界多模态网站。我们的基准独特地运行在不断演变的现实世界网站上,确保了高度的现实主义和应用性,以应对自然用户任务;2)多跳网页浏览。我们的數據集包括1,050个由人类编写的任务,涵盖了各种领域,如购物和旅游,每个任务都需要代理从网站页面自动提取多模态信息作为观察结果;3)整体评估。我们提出了一个新颖的代理完成多跳任务进度的评估协议。我们分别与单独的多模态语言模型和基于规则的网页代理进行实验。 extensive实验证明,尽管长链条多跳网页任务对人类来说是容易的,但它们仍然对最先进的网络代理具有挑战性。我们发现,代理在解决多跳任务时更容易在较早的跳数上失败,导致任务成功率降低。为了解决这个问题,我们提出了一个简单的记忆增强方法,通过重放过去的动作轨迹来反映。我们的方法显著提高了代理的单跳和多跳网页浏览能力。您可以在此处查看我们的代码和数据:https://www.mmina.org/
https://arxiv.org/abs/2404.09992
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned tracker from the virtual world to real-world scenarios.
肢体视觉跟踪是通过使用代理的以自我为中心的视觉来跟随动态3D环境中的目标对象。这是 embodied 代理的一种关键和具有挑战性的技能。然而,现有的方法在训练和泛化方面存在效率低和表现差的问题。在本文中,我们提出了一种结合视觉基础模型(VFM)和离线强化学习(offline RL)的新框架,以增强 embodied 视觉跟踪。我们使用预训练的 VFM,如 "Tracking Anything",以提取带文本提示的语义分割掩码。然后,我们使用离线 RL 训练一个循环策略网络,例如 Conservative Q-Learning,以从收集的演示中学习,而无需与在线代理环境和交互。为了进一步提高策略网络的稳健性和泛化性,我们还引入了掩码重置机制和多级数据收集策略。通过这种方式,我们可以在消费者级 GPU(例如 Nvidia RTX 3090)上训练一个稳健的跟踪器,例如一个小时。这是基于 RL 的视觉跟踪方法前所未有的效率。我们在具有挑战性的环境中评估我们的跟踪器,例如分心和遮挡。结果表明,我们的代理在样本效率、对干扰者的鲁棒性和对未见过的场景和目标的泛化方面优于最先进的 methods。我们还证明了从虚拟世界中学到的跟踪器在现实世界场景中的可转移性。
https://arxiv.org/abs/2404.09857
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: this http URL.
随着最近在Embodied人工智能(EAI)研究中的发展,对于高质量、大规模互动场景生成的高需求不断增加。然而,以前的方法在场景合成中过于关注生成场景的自然性和真实性,而场景的物理可行性和互动性却被大大忽视了。为了应对这一差异,我们引入了PhyScene,一种专为身体代理生成具有真实布局、关节和丰富物理交互性的交互式3D场景的新方法。基于条件扩散模型来捕捉场景布局,我们设计了一种新的基于物理和交互性的指导机制,结合了物体碰撞、房间布局和物体可达性等方面的约束。通过大量实验,我们证明了PhyScene有效地利用了这些指导功能进行物理交互式场景生成,在现有状态最先进的方法之上取得了很大的优势。我们的研究结果表明,PhyScene生成的场景在促进交互环境中的代理多样化技能学习方面具有相当大的潜力,从而推动了在身体人工智能研究中的进一步发展。项目网站:此链接。
https://arxiv.org/abs/2404.09465
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset, thereby enabling energy-efficient embodied SNN deployments for autonomous agents.
近年来,自动驾驶车辆(AGVs)、无人机(UAVs)和移动机器人等自主 agent有效提高了人类在解决多样化任务中的生产力。然而,由于这些 agent通常由便携式电池供电,因此它们在长时间内运行时需要极其低功耗/能量。为解决这个问题,神经形态计算作为一种有前景的解决方案应运而生,其中仿生 Spiking Neural Networks (SNNs) 使用基于事件的数据转换预处理或事件相机中的尖峰来执行稀疏计算 efficiently。然而,针对自主 agent 的 SNN 部署的研究仍处于早期阶段。因此,尚未对 enabling efficient embodied SNN deployments for autonomous agents 的优化阶段进行系统地定义。为了实现这一目标,我们提出了一个名为 SNN4Agents 的 novel framework,它包括一个针对自主 agent 应用设计能量高效的 embodied SNN 的优化技术集合。我们的 SNN4Agents 使用权重量化、时钟步减少和注意力窗口减少来共同提高能源效率、降低内存足迹、优化处理延迟,同时保持高精度。在评估中,我们研究了基于事件的汽车识别用例,并探讨了准确性、延迟、内存和能量消耗之间的权衡。实验结果表明,与最先进的 NCARS 数据集相比,我们的框架可以在降低 68.75% 的内存和使用 68.75% 更快的速度和 4.03x 的能量效率提升的同时保持高精度(即 84.12% 的准确率),从而实现自主 agent 的高效 embodied SNN 部署。
https://arxiv.org/abs/2404.09331
Home robots intend to make their users lives easier. Our work assists in this goal by enabling robots to inform their users of dangerous or unsanitary anomalies in their home. Some examples of these anomalies include the user leaving their milk out, forgetting to turn off the stove, or leaving poison accessible to children. To move towards enabling home robots with these abilities, we have created a new dataset, which we call SafetyDetect. The SafetyDetect dataset consists of 1000 anomalous home scenes, each of which contains unsafe or unsanitary situations for an agent to detect. Our approach utilizes large language models (LLMs) alongside both a graph representation of the scene and the relationships between the objects in the scene. Our key insight is that this connected scene graph and the object relationships it encodes enables the LLM to better reason about the scene -- especially as it relates to detecting dangerous or unsanitary situations. Our most promising approach utilizes GPT-4 and pursues a categorization technique where object relations from the scene graph are classified as normal, dangerous, unsanitary, or dangerous for children. This method is able to correctly identify over 90% of anomalous scenarios in the SafetyDetect Dataset. Additionally, we conduct real world experiments on a ClearPath TurtleBot where we generate a scene graph from visuals of the real world scene, and run our approach with no modification. This setup resulted in little performance loss. The SafetyDetect Dataset and code will be released to the public upon this papers publication.
家庭机器人旨在使使用者的生活更加便捷。我们的工作通过使机器人通知用户他们在家中的危险或不卫生的异常情况来实现这一目标。这些异常情况包括用户将牛奶放在桌子上,忘记关炉子,或者将毒物留给孩子们。为了实现具有这些能力的家庭机器人,我们创建了一个名为SafetyDetect的新数据集,我们称之为安全检测数据集。安全检测数据集包括1000个异常的家庭场景,每个场景都包含一个机器人可以检测到的不安全或不卫生的情况。我们的方法利用了大型语言模型(LLMs),并借助场景图和场景中物体的关系来表示场景。我们的关键见解是,这个连接的场景图和它编码的对象关系能够使LLM更好地理解场景,尤其是与检测危险或不卫生的情况有关的情况。 我们最具有前景的方法利用了GPT-4,并采用了一种分类技术,将场景图中的物体关系分类为正常、危险、不卫生或危险。这种方法在安全检测数据集中的异常场景中能够正确地识别超过90%的情况。此外,我们在ClearPath TurtleBot上进行了现实世界的实验,从现实世界的场景视觉中生成场景图,并运行我们的方法。这个设置结果几乎没有性能损失。安全检测数据集和代码将在本文发表时公开发布。
https://arxiv.org/abs/2404.08827
This paper introduces a novel zero-shot motion planning method that allows users to quickly design smooth robot motions in Cartesian space. A Bézier curve-based Cartesian plan is transformed into a joint space trajectory by our neuro-inspired inverse kinematics (IK) method CycleIK, for which we enable platform independence by scaling it to arbitrary robot designs. The motion planner is evaluated on the physical hardware of the two humanoid robots NICO and NICOL in a human-in-the-loop grasping scenario. Our method is deployed with an embodied agent that is a large language model (LLM) at its core. We generalize the embodied agent, that was introduced for NICOL, to also be embodied by NICO. The agent can execute a discrete set of physical actions and allows the user to verbally instruct various different robots. We contribute a grasping primitive to its action space that allows for precise manipulation of household objects. The new CycleIK method is compared to popular numerical IK solvers and state-of-the-art neural IK methods in simulation and is shown to be competitive with or outperform all evaluated methods when the algorithm runtime is very short. The grasping primitive is evaluated on both NICOL and NICO robots with a reported grasp success of 72% to 82% for each robot, respectively.
本文提出了一种新颖的零 shot运动规划方法,允许用户在二维空间中快速设计平滑的机器人运动。通过我们基于Bézier曲线的人体启发式逆运动学(IK)方法CycleIK,将基于Bézier曲线的二维计划变换为机器人空间轨迹。该运动规划器在人类监督下的两个大型机器人NICO和NICOL上的物理硬件上进行评估。我们使用具有身体代理的自主机器人(LLM)来部署该方法。我们还将基于NICO的 embodied agent 扩展到也具有NICO 的身体代理。该代理可以执行一系列物理动作,并允许用户通过口头指令控制各种不同机器人。我们在其动作空间中添加了抓握原语,允许用户精确操作家庭用品。与流行的数值IK求解器和最先进的神经IK方法在模拟中进行了比较,并在算法运行时间非常短时,证明了该方法与所有评估方法具有竞争性或优越性。抓握原语在NICOL和NICO机器人上的报告抓握成功率在72%到82%之间。
https://arxiv.org/abs/2404.08825
The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models.
端到端深度强化学习(DRL)在物体目标视觉导航中的泛化是一个长期以来的挑战,因为不同的测试环境中的物体类和放置位置不同。学习领域无关的视觉表示对于让训练后的DRL智能体具有泛化到未见过的场景和物体的能力至关重要。在本文中,提出了一种目标定向注意力网络(TDANet),用于学习具有零散射击能力的端到端物体目标视觉导航策略。TDANet具有一个新颖的目标注意力(TA)模块,学习物体之间的空间和语义关系,以帮助TDANet关注目标中最相关的观测物体。通过Siamese架构(SA)设计,TDANet区分了当前状态和目标状态,并生成领域无关的视觉表示。为了评估TDANet的导航性能,在AI2-THOR embodied AI环境中进行了广泛的实验。模拟结果表明,TDANet对未见过的场景和目标物体的泛化能力很强,具有比其他最先进的模型更高的导航成功率(SR)和成功加权长度(SPL)。
https://arxiv.org/abs/2404.08353
Our goal is to build embodied agents that can learn inductively generalizable spatial concepts in a continual manner, e.g, constructing a tower of a given height. Existing work suffers from certain limitations (a) (Liang et al., 2023) and their multi-modal extensions, rely heavily on prior knowledge and are not grounded in the demonstrations (b) (Liu et al., 2023) lack the ability to generalize due to their purely neural approach. A key challenge is to achieve a fine balance between symbolic representations which have the capability to generalize, and neural representations that are physically grounded. In response, we propose a neuro-symbolic approach by expressing inductive concepts as symbolic compositions over grounded neural concepts. Our key insight is to decompose the concept learning problem into the following steps 1) Sketch: Getting a programmatic representation for the given instruction 2) Plan: Perform Model-Based RL over the sequence of grounded neural action concepts to learn a grounded plan 3) Generalize: Abstract out a generic (lifted) Python program to facilitate generalizability. Continual learning is achieved by interspersing learning of grounded neural concepts with higher level symbolic constructs. Our experiments demonstrate that our approach significantly outperforms existing baselines in terms of its ability to learn novel concepts and generalize inductively.
我们的目标是构建能够以连续的方式学习归纳通用空间概念的 embodied 代理,例如,构建一个指定高度的塔。现有工作存在某些局限性:(a)(Liang 等人,2023) 和他们的多模态扩展,严重依赖先验知识,并且缺乏演示(b)(Liu 等人,2023) 的物理基础,无法进行泛化。一个关键挑战是实现符号表示与具有泛化能力的神经表示之间的良好平衡。为了应对这一挑战,我们提出了一个基于神经符号的方法,通过将归纳性概念表达为 grounded 神经概念的符号组合来表示。我们的关键见解是将概念学习问题分解为以下步骤:1) 草图:为给定指令获得程序化表示;2)规划:在 grounded 神经动作概念的序列上执行基于模型的强化学习,以学习一个 grounded 计划;3)泛化:抽象出通用的(lifted)Python 程序,以促进泛化。持续学习是通过在 grounded 神经概念的学习中插入更高层次的符号构建来实现的。我们的实验证明,与其他现有基线相比,我们的方法在学习和泛化方面的表现显著出色。
https://arxiv.org/abs/2404.07774
Language models trained on internet-scale data sets have shown an impressive ability to solve problems in Natural Language Processing and Computer Vision. However, experience is showing that these models are frequently brittle in unexpected ways, and require significant scaffolding to ensure that they operate correctly in the larger systems that comprise "language-model agents." In this paper, we argue that behavior trees provide a unifying framework for combining language models with classical AI and traditional programming. We introduce Dendron, a Python library for programming language model agents using behavior trees. We demonstrate the approach embodied by Dendron in three case studies: building a chat agent, a camera-based infrastructure inspection agent for use on a mobile robot or vehicle, and an agent that has been built to satisfy safety constraints that it did not receive through instruction tuning or RLHF.
在互联网规模数据集上训练的语言模型展现出在自然语言处理和计算机视觉问题中解决问题的令人印象深刻的能力。然而,经验表明,这些模型往往会在意外的方式中变得脆弱,需要显著的支架来确保它们在大型包含 "语言模型代理" 的系统中正常运行。在本文中,我们认为行为树提供了一个统一框架,将语言模型与经典人工智能和传统编程相结合。我们介绍了 Dendron,一个使用行为树的 Python 库来编写语言模型代理。我们通过三个实证研究展示了 Dendron 所代表的方案:构建聊天机器人、用于移动机器人或车辆的相机基础设施检查代理和一种已通过指令调整或 RLHF 满足安全约束的代理。
https://arxiv.org/abs/2404.07439
There is a growing interest in applying large language models (LLMs) in robotic tasks, due to their remarkable reasoning ability and extensive knowledge learned from vast training corpora. Grounding LLMs in the physical world remains an open challenge as they can only process textual input. Recent advancements in large vision-language models (LVLMs) have enabled a more comprehensive understanding of the physical world by incorporating visual input, which provides richer contextual information than language alone. In this work, we proposed a novel paradigm that leveraged GPT-4V(ision), the state-of-the-art LVLM by OpenAI, to enable embodied agents to perceive liquid objects via image-based environmental feedback. Specifically, we exploited the physical understanding of GPT-4V to interpret the visual representation (e.g., time-series plot) of non-visual feedback (e.g., F/T sensor data), indirectly enabling multimodal perception beyond vision and language using images as proxies. We evaluated our method using 10 common household liquids with containers of various geometry and material. Without any training or fine-tuning, we demonstrated that our method can enable the robot to indirectly perceive the physical response of liquids and estimate their viscosity. We also showed that by jointly reasoning over the visual and physical attributes learned through interactions, our method could recognize liquid objects in the absence of strong visual cues (e.g., container labels with legible text or symbols), increasing the accuracy from 69.0% -- achieved by the best-performing vision-only variant -- to 86.0%.
随着大型语言模型(LLMs)在机器人任务中的应用越来越受到关注,这是因为它们出色的推理能力和从庞大的训练数据集中的广泛知识。然而,将LLMs grounded in the physical world仍然是一个开放挑战,因为它们只能处理文本输入。近年来,大型视觉语言模型(LVLM)的进步使得对物理世界的更全面理解成为可能,通过将视觉输入集成到模型中,提供了比语言更丰富的上下文信息。在这项工作中,我们提出了一个新颖的方法,即利用GPT-4V(由OpenAI开发的最新LVLM),使实体代理通过图像为基础的环境反馈感知液体物体。具体来说,我们利用GPT-4V的物理理解来解释非视觉反馈(如F/T传感器数据)的视觉表示,从而通过图像作为代理实现多模态感知,超越视觉和语言。我们对我们的方法进行了评估,使用具有各种几何形状和材料的10个常见家庭液体容器进行了实验。在没有进行训练或微调的情况下,我们证明了我们的方法可以使机器人间接感知液体的物理响应,并估计其粘度。我们还证明了通过联合推理视觉和物理属性通过交互获得的知识,我们的方法可以在没有强烈视觉提示的情况下识别液体物体,从最佳视觉Only变体的69.0%精度增加到了86.0%的精度。
https://arxiv.org/abs/2404.06904