The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at this https URL.
由大型视觉-语言模型(LVLM)提升的机器人泛化能力日益明显。因此,基于自视角视频的LVLM所具有的嵌入式认知能力引起了极大的兴趣。然而,当前用于嵌入式视频问答的数据集缺乏全面和系统的评估框架。一些关键的嵌入式认知问题,如机器人的自我意识、动态场景感知以及幻觉等,很少被涉及。为了应对这些挑战,我们提出了ECBench,这是一个设计来系统性评估LVLM嵌入式认知能力的高质量基准测试平台。ECBench 特征涵盖多样化的场景视频来源、开放且多样的问题格式,并涵盖了30个维度的嵌入式认知。 为确保质量、平衡和高视觉依赖度,ECBench 采用了独立于类别的详细人工标注及多次问题筛选策略。此外,我们引入了ECEval,这是一个全面的评估系统,旨在保证指标的公平性和合理性。通过使用 ECBench,我们对专有、开源以及任务特定的LVLM进行了广泛的评估。 ECBench 对推进 LVLM 的嵌入式认知能力至关重要,并为开发可靠的核心模型以支持嵌入式代理打下了坚实的基础。所有数据和代码均可在 [提供的URL] 获取。
https://arxiv.org/abs/2501.05031
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in this https URL.
多模态视觉语言模型(VLMs)在计算机视觉和自然语言处理的交叉领域中作为变革性技术崭露头角,使机器能够通过视觉和文本两种方式感知和理解世界。例如,CLIP、Claude 和 GPT-4V 等模型展示了强大的推理能力和对视觉及文本数据的理解能力,并且在零样本分类任务上超越了传统的单一模态视觉模型。尽管这些多模态视觉语言模型在研究中取得了快速进展,并在应用领域越来越受欢迎,但目前还缺少全面的综述性文章,特别是针对希望在其特定领域内利用 VLMs 的研究人员而言。为此,我们从以下几个方面系统地概述了 VLMs: 1. 过去五年(2019-2024)主要 VLM 模型的信息; 2. 这些 VLM 的主要架构和训练方法; 3. 流行的基准测试及评价指标的总结与分类; 4. 包括具身智能体、机器人技术及视频生成在内的应用案例; 5. 当前 VLM 所面临的主要挑战,如幻觉(hallucination)、公平性和安全性等。 详细资料包括论文和模型仓库链接可以在以下网址查阅:[此 URL](https://this-url.com)。
https://arxiv.org/abs/2501.02189
We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot's generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.
我们介绍了EnerVerse,这是一个全面的框架,用于生成专为机器人操作任务设计的身体化未来空间。EnerVerse无缝集成了卷积和双向注意力机制以进行内块空间建模,确保了低层次的一致性和连续性。鉴于视频数据固有的冗余,我们提出了一种稀疏内存上下文结合块式单向生成范例,从而能够生成无限长的序列。为了进一步增强机器人的能力,我们引入了自由锚点视图(FAV)空间,该空间提供了灵活的视角以提升观察和分析的能力。FAV空间减轻了运动建模的模糊性,在狭窄环境中去除了物理限制,并显著提升了机器人在各种任务和环境中的泛化能力和适应性。 为了解决获取多摄像头观测数据的成本高昂且劳动密集的问题,我们提出了一种结合生成模型与4D高斯点阵(4DGS)的数据引擎管道。该管道利用了生成模型强大的泛化能力以及由4DGS提供的空间约束条件,从而可以迭代地提高数据的质量和多样性,并因此产生一种数据飞轮效应,有效缩小仿真到现实的差距。 最后,我们的实验表明,身体化的未来空间生成前提显著增强了策略预测的能力,从而使整体性能得到了提升,尤其是在长距离机器人操作任务中。
https://arxiv.org/abs/2501.01895
In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks. However, their performance in 3D spatial comprehension, which is critical for embodied intelligence, remains limited. Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results. However, we propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding. This paper empirically investigates the limitations of VLMs in 3D spatial knowledge, revealing that their primary shortcoming lies in the lack of global-local correspondence between the scene and individual frames. To address this, we introduce GPT4Scene, a novel visual prompting paradigm in VLM training and inference that helps build the global-local relationship, significantly improving the 3D spatial understanding of indoor scenes. Specifically, GPT4Scene constructs a 3D Bird's Eye View (BEV) image from the video and marks consistent object IDs across both frames and the BEV image. The model then inputs the concatenated BEV image and video frames with markers. In zero-shot evaluations, GPT4Scene improves performance over closed-source VLMs like GPT-4o. Additionally, we prepare a processed video dataset consisting of 165K text annotation to fine-tune open-source VLMs, achieving state-of-the-art performance on all 3D understanding tasks. Surprisingly, after training with the GPT4Scene paradigm, VLMs consistently improve during inference, even without visual prompting and BEV image as explicit correspondence. It demonstrates that the proposed paradigm helps VLMs develop an intrinsic ability to understand 3D scenes, which paves the way for a noninvasive approach to extending pre-trained VLMs for 3D scene understanding.
近年来,二维视觉-语言模型(VLM)在图像-文本理解任务中取得了显著进展。然而,在涉及具身智能的关键领域——三维空间感知方面,它们的表现仍然有限。最近的研究通过使用点云和多视角图像作为输入的方式取得了一些成果。然而,我们提出了一种纯粹基于视觉的解决方案,该方案受到人类感知启发,仅依赖于视觉线索进行三维空间理解。本文从实证角度探讨了VLM在三维空间知识方面的局限性,并指出其主要缺陷在于场景与单个帧之间的全局-局部对应关系不足。 为此,我们引入了一个新的概念——GPT4Scene,这是一种新颖的视觉提示框架,在训练和推理过程中用于帮助建立全球和本地的关系,从而显著提高室内场景中的三维空间理解能力。具体而言,GPT4Scene 从视频中构建一个3D鸟瞰图(BEV),并在帧与BEV图像之间标记一致的对象ID。随后,模型将带有标识符的BEV图像和视频帧一起输入。 在零样本评估中,GPT4Scene 的性能优于封闭源 VLMs 如 GPT-4o。此外,我们准备了一个包含165K文本注释的预处理视频数据集以微调开源VLMs,在所有三维理解任务上均取得了最先进的表现。值得注意的是,经过GPT4Scene框架训练后,即使在没有视觉提示和BEV图像作为明确对应关系的情况下,VLM 在推理过程中仍持续提升性能。这表明所提出的范式有助于VLM发展出对3D场景的内在理解能力,为非侵入性扩展预训练 VLM 以用于三维场景理解开辟了道路。
https://arxiv.org/abs/2501.01428
3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.
3D视觉定位(3DVG)涉及根据自然语言文本在三维场景中定位实体。此类模型对于具身AI和场景检索应用非常有用,这些应用通过使用自然语言描述来搜索对象或模式。尽管最近的研究主要集中在基于大规模语言模型的3DVG数据集扩展上,但这些数据集并未涵盖英语中所有可能的语言提示。为了确保我们能够针对一组有用的、具有代表性的提示进行扩展和测试,我们提出了一种用于分析3DVG提示的语言框架,并引入了Visual Grounding with Diverse Language in 3D(ViGiL3D),这是一个诊断数据集,旨在评估视觉定位方法对多样化的语言模式的性能。通过评估现有的开放式词汇表3DVG方法,我们证明这些方法尚未熟练理解并识别更具挑战性的、分布外提示的目标,以应对实际应用中的需求。
https://arxiv.org/abs/2501.01366
Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: this https URL
开放词汇全景重建提供了全面的场景理解,推动了具身机器人和逼真模拟的进步。在本文中,我们提出了PanopticRecon++,这是一种端到端的方法,通过新颖的交叉注意力视角来制定全景重建问题。此视角模型化3D实例(作为查询)与场景3D嵌入场(作为键)之间的关系,并通过它们的注意力图表示。 不同于现有的将查询和键的优化分离或忽视空间邻近性的方法,PanopticRecon++引入了可学习的3D高斯分布作为实例查询。这种形式注入了3D空间先验信息以保持临近性,同时维护端到端可优化性。此外,该查询结构通过利用来自查询渲染出的实例掩码进行最优线性分配来促进帧间开放词汇2D实例ID的对齐。 为了确保语义-实例分割的一致性,我们通过一种新型全景头部将基于查询的实例分割概率与语义概率融合在一起,并采用全景损失进行监督。在训练过程中,实例查询标记的数量会动态适应以匹配对象数量的变化。 PanopticRecon++在模拟和真实世界数据集上的3D及2D分割和重建性能方面表现出竞争力,并展示了一个作为机器人模拟器的应用案例。我们项目的网站为:[这个URL](https://this-url.com)(请根据实际项目地址替换此占位符)。
https://arxiv.org/abs/2501.01119
Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.
为机器人赋予说话能力的努力得益于自然语言处理(NLP)领域的近期进展,尤其是大型语言模型的进步。然而,尽管当前的模型已经非常强大,它们仍然在句子或多个句子级别的输入上运行,而人类则是逐字进行交流。这种差异影响了机器响应的即时性,在语音交互中这一点尤为关键。 本文综述了增量操作(即单词级别及以下)的互动系统的文献。我们阐述了构建此类系统的需求,并对对话中的重要方面如语音识别和语言生成进行了增量建模的研究。重点在于系统决策制定的部分,即所谓的对话管理器。我们的研究发现,关于增量对话管理的研究非常有限,因此提出了一些实用性的增量对话管理系统需求,并探讨了增量对话在实体机器人平台上的意义与影响。
https://arxiv.org/abs/2501.00953
This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLM-based agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 4.9% on Ego4D-VQ3D, 5.8% on OpenEQA, and 11.7% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
本文研究了从第一人称视角观察理解动态三维场景的问题,这是机器人技术和具身人工智能中的一个关键挑战。与之前的探索此问题的长视频理解和仅使用第一人称视角视频的研究不同,我们提出了一种基于大语言模型(LLM)的代理——Embodied VideoAgent,该代理能够从第一人称视角视频和具身感官输入(例如深度和姿态感知)中构建场景记忆。此外,我们还引入了一种视觉语言模型(VLM)方法来自动更新当检测到动作或物体活动时的记忆。在三维场景中的复杂推理和规划任务中,Embodied VideoAgent 显著优于其他代理,在Ego4D-VQ3D、OpenEQA 和 EnvQA 上分别取得了4.9%、5.8%和11.7%的性能提升。我们还展示了其在生成具身交互以及机器人操作感知等多种具身人工智能任务中的潜力。代码和演示将公开发布。
https://arxiv.org/abs/2501.00358
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at this https URL.
我们介绍了Vinci,这是一个基于自视角(egocentric)视觉-语言模型构建的实时实体智能助手。设计用于便携设备如智能手机和可穿戴摄像头上的部署,Vinci采用“始终开启”模式运行,在此模式下它会持续观察环境并提供无缝交互与协助。用户可以通过唤醒系统并与之进行自然对话来提问或寻求帮助,而回答将以音频形式传达,方便用户无需动手操作。 凭借实时处理长视频流的能力,Vinci能够根据当前观察和历史背景回答用户的查询,并基于过去的互动提供建议任务规划。为了进一步增强可用性,Vinci集成了一个视频生成模块,该模块可以创建分步骤的视觉演示来指导需要详细指引的任务。 我们希望Vinci能为便携式、实时自视角AI系统建立一个稳健框架,赋予用户情境化的可操作见解。我们与一个用于上传视频测试的演示网络平台一同发布了设备开发的完整实现方案,详情请访问此链接:[提供URL]。
https://arxiv.org/abs/2412.21080
We introduce UnrealZoo, a rich collection of photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of the open worlds. Additionally, we offer a variety of playable entities for embodied AI agents. Based on UnrealCV, we provide a suite of easy-to-use Python APIs and tools for various potential applications, such as data collection, environment augmentation, distributed training, and benchmarking. We optimize the rendering and communication efficiency of UnrealCV to support advanced applications, such as multi-agent interaction. Our experiments benchmark agents in various complex scenes, focusing on visual navigation and tracking, which are fundamental capabilities for embodied visual intelligence. The results yield valuable insights into the advantages of diverse training environments for reinforcement learning (RL) agents and the challenges faced by current embodied vision agents, including those based on RL and large vision-language models (VLMs), in open worlds. These challenges involve latency in closed-loop control in dynamic scenes and reasoning about 3D spatial structures in unstructured terrain.
我们介绍了UnrealZoo,这是一个基于虚幻引擎构建的逼真的三维虚拟世界集合,旨在反映开放世界的复杂性和多变性。此外,我们还提供了一系列可供具身人工智能代理使用的可玩实体。基于UnrealCV,我们提供了一套易于使用的Python API和工具,用于各种潜在应用,如数据收集、环境增强、分布式训练以及基准测试。为了支持高级应用程序,例如多智能体交互,我们将UnrealCV的渲染效率和通信效率进行了优化。 我们的实验在各种复杂的场景中对代理进行基准测试,重点是视觉导航和追踪,这是具身视觉智能的基础能力。这些结果为强化学习(RL)代理的各种训练环境的优势提供了宝贵的见解,并揭示了当前基于RL和大型视觉语言模型(VLM)的具身视觉代理在开放世界面临的挑战。这些挑战包括动态场景中闭环控制中的延迟以及关于不规则地形中的3D空间结构的理解问题。
https://arxiv.org/abs/2412.20977
With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the embodied experience to highlight this more immersive experience and study the new problem, i.e., embodied perceptual quality assessment for egocentric spatial videos. Specifically, we introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs). Furthermore, we propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality. Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task, and exhibits strong generalization capability on traditional VQA tasks. The database and codes will be released upon the publication.
随着扩展现实(XR)的快速发展,第一人称空间拍摄和显示技术进一步增强了用户的沉浸感和参与度。评估第一人称空间视频的质量体验(QoE)对于确保高质量观看体验至关重要。然而,相关研究仍然不足。本文采用具身经验来强调这一更沉浸式的体验,并探讨新的问题,即针对第一人称空间视频的具身感知质量评估。具体而言,我们介绍了第一个“第一人称空间视频质量评估数据库”(ESVQAD),该库包含600个第一人称空间视频及其平均意见得分(MOS)。此外,我们提出了一种新颖的多维度双目特征融合模型,称为ESVQAnet,它整合了双目空间、运动和语义特征以预测感知质量。实验结果表明,ESVQAnet在具身感知质量评估任务上超越了16个最先进的视频质量评估(VQA)模型,并且在传统的VQA任务中表现出强大的泛化能力。该数据库和代码将在发表时公开发布。
https://arxiv.org/abs/2412.20423
Although robotic imitation learning (RIL) is promising for embodied intelligent robots, existing RIL approaches rely on computationally intensive multi-model trajectory predictions, resulting in slow execution and limited real-time responsiveness. Instead, human beings subconscious can constantly process and store vast amounts of information from their experiences, perceptions, and learning, allowing them to fulfill complex actions such as riding a bike, without consciously thinking about each. Inspired by this phenomenon in action neurology, we introduced subconscious robotic imitation learning (SRIL), wherein cognitive offloading was combined with historical action chunkings to reduce delays caused by model inferences, thereby accelerating task execution. This process was further enhanced by subconscious downsampling and pattern augmented learning policy wherein intent-rich information was addressed with quantized sampling techniques to improve manipulation efficiency. Experimental results demonstrated that execution speeds of the SRIL were 100\% to 200\% faster over SOTA policies for comprehensive dual-arm tasks, with consistently higher success rates.
尽管机器人模仿学习(RIL)对于具身智能机器人的发展很有前景,现有的RIL方法却依赖于计算密集型的多模型轨迹预测,这导致了执行速度缓慢和实时响应能力有限的问题。相比之下,人类无意识地能够不断地处理并存储来自经验和学习过程中的大量信息,使他们能够在骑自行车等复杂动作中不需刻意思考每一步就能完成任务。受到这一行为神经学现象的启发,我们引入了一种潜意识机器人模仿学习(SRIL),其中认知卸载与历史动作片段相结合以减少由模型推断引起的延迟,从而加快了任务执行速度。通过潜意识降采样和模式增强的学习策略进一步提高了效率,在这种策略中,意图丰富的信息采用量化抽样技术进行处理,以提高操作的效率。实验结果表明,SRIL在综合双臂任务中的执行速度比现有最先进的(SOTA)政策快100%到200%,并且成功率达到一致的高水平。
https://arxiv.org/abs/2412.20368
Learning to move is a primary goal for animals and robots, where ensuring safety is often important when optimizing control policies on the embodied systems. For complex tasks such as the control of human or humanoid control, the high-dimensional parameter space adds complexity to the safe optimization effort. Current safe exploration algorithms exhibit inefficiency and may even become infeasible with large high-dimensional input spaces. Furthermore, existing high-dimensional constrained optimization methods neglect safety in the search process. In this paper, we propose High-dimensional Safe Bayesian Optimization with local optimistic exploration (HdSafeBO), a novel approach designed to handle high-dimensional sampling problems under probabilistic safety constraints. We introduce a local optimistic strategy to efficiently and safely optimize the objective function, providing a probabilistic safety guarantee and a cumulative safety violation bound. Through the use of isometric embedding, HdSafeBO addresses problems ranging from a few hundred to several thousand dimensions while maintaining safety guarantees. To our knowledge, HdSafeBO is the first algorithm capable of optimizing the control of high-dimensional musculoskeletal systems with high safety probability. We also demonstrate the real-world applicability of HdSafeBO through its use in the safe online optimization of neural stimulation induced human motion control.
学习运动是动物和机器人的一项主要目标,在优化控制策略时,确保系统的安全性往往至关重要。对于复杂的任务(如人类或类人机器人的控制),高维参数空间会增加安全优化的难度。目前的安全探索算法在面对大规模、高维度输入空间时效率低下,甚至可能变得不可行。此外,现有的高维约束优化方法通常忽视了搜索过程中的安全性问题。 本文提出了一种名为“High-dimensional Safe Bayesian Optimization with local optimistic exploration”(HdSafeBO)的新方法,专门用于处理具有概率安全约束的高维采样问题。我们引入了一个局部乐观策略来高效且安全地优化目标函数,并提供了概率性的安全保障和累积的安全性违规上限。通过等距嵌入技术,HdSafeBO能够解决从几百到几千维度的问题同时保持安全性保障。 据我们所知,HdSafeBO是首个能够在高维骨骼肌肉系统控制中实现高度安全性的算法。此外,我们还通过在神经刺激诱导的人体运动控制的安全在线优化中的应用,展示了HdSafeBO的实际世界适用性。
https://arxiv.org/abs/2412.20350
This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs). Previous works typically train a planner to imitate expert trajectories, treating this as a supervised task. While these methods achieve competitive performance, they often lack sufficient robustness. When a suboptimal action is taken, the planner may encounter an out-of-distribution state, which can lead to task failure. In contrast, we frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Thus, we propose a closed-loop planner with an adaptation module and a novel hindsight method, aiming to use as much information as possible to assist the planner. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption. For the first time, our few-shot agent's performance approaches and even surpasses that of the full-shot supervised agent.
这项工作致力于使用大型语言模型(LLMs)为具身指令遵循(EIF)构建任务规划器。先前的研究通常训练一个模仿专家轨迹的规划器,将其视为监督学习任务。尽管这些方法能够达到竞争力的表现,但它们常常缺乏足够的鲁棒性。当采取次优行动时,规划器可能会遇到分布外的状态,这可能导致任务失败。相反,我们将任务建模为部分可观测马尔可夫决策过程(POMDP),并旨在开发一种在少量样本假设下的稳健规划器。因此,我们提出了一种闭环规划器,配备有适应模块和一种新颖的回顾方法,以尽可能利用信息来辅助规划器。我们在ALFRED数据集上的实验表明,在少量样本假设下,我们的规划器实现了竞争性表现。首次,我们的少量样本代理的表现接近甚至超过了完全监督下的全量样本代理。
https://arxiv.org/abs/2412.19562
Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI. Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding. Although LLMs are introduced, these methods decode the action steps into a closed-set of one-hot vectors, limiting the model's capability of generalizing to new steps or tasks. Additionally, fixed action step descriptions based on world-level commonsense may contain noise in specific instances of visual states. In this paper, we propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning. We propose an LLM-Enhanced Planning module which fully uses the generalization ability of LLMs to produce free-form planning output and to enhance action step decoding. We also propose Mutual Information Maximization module to connect world-level commonsense of step descriptions and sample-specific information of visual states, enabling LLMs to employ the reasoning ability to generate step sequences. With the assistance of LLMs, our method can both closed-set and open vocabulary procedure planning tasks. Our PlanLLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs.
视频过程规划,即根据起始和目标状态的视频帧来规划一系列动作步骤,是具身AI(Embodied AI)的一个核心能力。最近的研究利用大型语言模型(LLMs)生成增强的动作步骤描述文本以指导动作解码。尽管引入了LLM,但这些方法仍然将动作步骤解码为封闭集的一组一热向量,这限制了模型在新步骤或任务上泛化的潜力。此外,基于世界层面常识的固定动作步骤描述可能在特定视觉状态实例中包含噪音。 在此论文中,我们提出了PlanLLM,这是一个结合大型语言模型进行视频过程规划的跨模态联合学习框架。我们提出了一种增强型规划模块(LLM-Enhanced Planning module),该模块充分利用了大型语言模型的泛化能力来生成自由形式的规划输出,并改进动作步骤解码。同时,我们也提出了互信息最大化模块(Mutual Information Maximization module),它将世界级别的常识与视觉状态的具体信息相连接,使大型语言模型能够利用推理能力生成步骤序列。 借助于大型语言模型的帮助,我们的方法既可以处理封闭集内的过程规划任务,也能应对开放词汇的任务。在三个基准测试中,PlanLLM展示了优异的性能,证明了我们设计方案的有效性。
https://arxiv.org/abs/2412.19139
Image quality assessment (IQA) of user-generated content (UGC) is a critical technique for human quality of experience (QoE). However, for robot-generated content (RGC), will its image quality be consistent with the Moravec paradox and counter to human common sense? Human subjective scoring is more based on the attractiveness of the image. Embodied agent are required to interact and perceive in the environment, and finally perform specific tasks. Visual images as inputs directly influence downstream tasks. In this paper, we first propose an embodied image quality assessment (EIQA) frameworks. We establish assessment metrics for input images based on the downstream tasks of robot. In addition, we construct an Embodied Preference Database (EPD) containing 5,000 reference and distorted image annotations. The performance of mainstream IQA algorithms on EPD dataset is finally verified. The experiments demonstrate that quality assessment of embodied images is different from that of humans. We sincerely hope that the EPD can contribute to the development of embodied AI by focusing on image quality assessment. The benchmark is available at this https URL.
图像质量评估(IQA)对于用户生成内容(UGC)而言是一项关键技术,它直接影响到人类的质量体验(QoE)。然而,对于机器人生成的内容(RGC),其图像质量是否会与莫拉维克悖论相一致,并且违背人类的常识?人的主观评分主要基于图片的吸引力。实体代理需要在环境中互动和感知,并最终执行特定任务。作为输入的视觉图像直接影响下游任务的表现。 在这篇论文中,我们首次提出了一个基于身体的图像质量评估(EIQA)框架。根据机器人的下游任务需求,我们建立了针对输入图像的质量评估指标。此外,我们构建了一个包含5,000张参考图和失真图注释的实体偏好数据库(EPD)。最后,我们在EPD数据集上验证了主流IQA算法的表现。 实验表明,对于基于身体的图像质量评估与人类主观评价存在差异。我们衷心希望EPD能够为专注于图像质量评估的实体AI的发展做出贡献。基准测试可在此URL获取:[此链接应替换为实际提供的URL]。
https://arxiv.org/abs/2412.18774
In the rapidly evolving landscape of GameFi, a fusion of gaming and decentralized finance (DeFi), there exists a critical need to enhance player engagement and economic interaction within gaming ecosystems. Our GameFi ecosystem aims to fundamentally transform this landscape by integrating advanced embodied AI agents into GameFi platforms. These AI agents, developed using cutting-edge large language models (LLMs), such as GPT-4 and Claude AI, are capable of proactive, adaptive, and contextually rich interactions with players. By going beyond traditional scripted responses, these agents become integral participants in the game's narrative and economic systems, directly influencing player strategies and in-game economies. We address the limitations of current GameFi platforms, which often lack immersive AI interactions and mechanisms for community engagement or creator monetization. Through the deep integration of AI agents with blockchain technology, we establish a consensus-driven, decentralized GameFi ecosystem. This ecosystem empowers creators to monetize their contributions and fosters democratic collaboration among players and creators. Furthermore, by embedding DeFi mechanisms into the gaming experience, we enhance economic participation and provide new opportunities for financial interactions within the game. Our approach enhances player immersion and retention and advances the GameFi ecosystem by bridging traditional gaming with Web3 technologies. By integrating sophisticated AI and DeFi elements, we contribute to the development of more engaging, economically robust, and community-centric gaming environments. This project represents a significant advancement in the state-of-the-art in GameFi, offering insights and methodologies that can be applied throughout the gaming industry.
在GameFi(游戏与去中心化金融的融合)迅速发展的背景下,亟需增强玩家参与度和游戏生态系统中的经济互动。我们的GameFi生态体系旨在通过将先进的具身AI代理集成到GameFi平台中来从根本上改变这一格局。这些使用GPT-4和Claude AI等前沿大型语言模型开发的AI代理能够与玩家进行主动、适应性和内容丰富的交互,超越传统的脚本化回应。这些代理成为游戏叙事和经济系统的重要参与者,直接影响玩家策略和游戏内经济。我们解决了当前GameFi平台存在的问题,比如缺乏沉浸式AI互动以及社区参与或创作者货币化的机制。通过将AI代理与区块链技术深度集成,我们建立了一个共识驱动的去中心化GameFi生态系统。这个生态体系赋能创作者实现其贡献的价值,并促进玩家和创作者之间的民主协作。此外,通过在游戏体验中嵌入DeFi机制,我们增强了经济参与度,并为游戏中提供了新的金融互动机会。我们的方法提升了玩家沉浸感和留存率,同时推动了GameFi生态系统的进步,将传统游戏与Web3技术相结合。通过整合复杂的AI和DeFi元素,我们促进了更具吸引力、经济更稳健且以社区为中心的游戏环境的发展。这一项目代表了GameFi领域的重大技术创新,提供可以应用于整个游戏行业的见解和方法论。
https://arxiv.org/abs/2412.18601
Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics. While existing methods can synthesize realistic human motions in 3D scenes and generate plausible human-object interactions, they heavily rely on datasets containing paired 3D scene and motion capture data, which are expensive and time-consuming to collect across diverse environments and interactions. We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis by integrating video generation and neural human rendering. Our key insight is to leverage the rich motion priors learned by state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions. ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions.
人类-场景互动(Human-scene interaction,简称HSI)生成对于具身人工智能、虚拟现实和机器人技术的应用至关重要。虽然现有的方法可以合成三维场景中逼真的人类动作并生成合理的人机交互,但这些方法高度依赖于包含配对的3D场景和运动捕捉数据的数据集,而在多样化的环境和互动类型中收集这类数据既昂贵又耗时。我们提出了ZeroHSI,这是一种新的方法,通过整合视频生成与神经渲染技术来实现零样本4D人类-场景互动合成。我们的核心洞察是利用最先进的视频生成模型所学习到的丰富的动作先验知识,这些模型已经在大量的自然人动作和交互数据上进行了训练,并使用可微分渲染技术重构人类-场景交互。ZeroHSI可以在静态场景以及包含动态物体的环境中合成逼真的人类动作,而无需任何真实的运动数据。我们在一个精心策划的数据集上评估了ZeroHSI,该数据集包括各种不同类型的室内和室外场景及不同的互动提示,结果展示了其生成多样且情境适宜的人-场景交互的能力。
https://arxiv.org/abs/2412.18600
A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at this https URL.
一个3D场景图表示了一个紧凑的场景模型,存储了关于对象及其之间语义关系的信息,使其在机器人任务中的应用前景广阔。当与用户交互时,具身智能代理应该能够响应用自然语言表达的各种关于场景的问题。由于其对自然语言的理解和推理能力,大型语言模型(LLMs)是用户-机器人互动的有益解决方案。最近用于创建3D场景可学习表示的方法表明,通过适应三维世界可以提升LLMs回应的质量。然而,现有方法并没有明确利用对象间语义关系的信息,仅限于它们坐标的资讯。在本研究中,我们提出了一种名为3DGraphLLM的方法来构建3D场景图的可学习表示。这种可学习表示被用作输入给LLMs以执行3D视觉-语言任务。通过我们在流行的数据集ScanRefer、RIORefer、Multi3DRefer、ScanQA、Sqa3D和Scan2cap上的实验,我们展示了这种方法在不使用对象间语义关系信息的基线方法之上具有的优势。代码可在以下链接公开获取:[此HTTP链接]。
https://arxiv.org/abs/2412.18450
Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plan into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collision-free navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect $20k$ navigation episodes across $117$ scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge. Project website: this https URL.
人类自然地依赖楼层平面图在不熟悉的环境中导航,因为这些平面图易于获取、可靠,并且能提供丰富的几何指导。然而,现有的视觉导航设置忽略了这一有价值的先验知识,导致效率和准确性有限。为了消除这一差距,我们引入了一种新的导航任务:Floor Plan Visual Navigation(FloNa),这是首次尝试将楼层平面图融入具身化视觉导航中。虽然楼层平面图具有显著优势,但也出现了两个主要挑战:(1) 处理楼层平面图与实际场景布局之间的空间不一致性以实现无碰撞导航;(2) 尽管它们的模式不同,仍需对齐观察到的图像和楼层平面草图。为了解决这些挑战,我们提出了一种新的扩散策略框架FloDiff,该框架集成了一个定位模块,用于促进当前观察与楼层平面之间的对齐。我们进一步收集了在iGibson模拟器中的117个场景中进行的20,000次导航事件,以支持训练和评估。大量的实验表明,在使用楼层平面图知识的情况下,我们的框架在不熟悉的场景中具有有效性和效率。项目网站:此 https URL。
https://arxiv.org/abs/2412.18335