Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL
模仿学习已被证明是一种强大的训练复杂视觉运动策略的有力工具。然而,当前的方法通常需要几百到数千个专家演示才能处理高维视觉观察数据。导致这种低数据效率的一个关键原因是,视觉表示主要是通过行为克隆目标直接训练,或者在非域数据上预训练。在这项工作中,我们提出了DynaMo,一种新的在域自监督学习方法,用于学习视觉表示。给定一组专家演示,我们通过一系列图像嵌入共同学习一个潜在的逆动态模型和一个前动态模型,预测下一个帧在潜在空间中,无需增强、对比采样或访问地面真实动作。重要的是,DynaMo不需要任何非域数据,如互联网数据或跨嵌入数据。在六个模拟和现实环境上,我们证明了使用DynaMo学习到的表示能够显著提高下游模仿学习性能,以及预训练表示。使用DynaMo的优势在行为Transformer、扩散策略、MLP和最近邻策略的类别中保持不变。最后,我们抽象了DynaMo的关键组件,并测量了其对下游策略性能的影响。机器人视频最好在以下链接查看:
https://arxiv.org/abs/2409.12192
Object manipulation capabilities are essential skills that set apart embodied agents engaging with the world, especially in the realm of robotics. The ability to predict outcomes of interactions with objects is paramount in this setting. While model-based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target's goal specification for object positioning tasks. We introduce a general approach that empowers world model-based agents to effectively solve object-positioning tasks. We propose two declinations of this approach for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities, enabling the specification of goals through spatial coordinates or a visual goal. Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model-based control approaches.
对象操作能力是使与物体交互的实体代理人在机器人领域具有关键技能。在这样一个设置中,预测交互结果的能力至关重要。虽然基于模型的控制方法已经开始用于解决操作任务,但它们在准确操作物体方面遇到了挑战。在我们分析这一限制的原因时,我们得出了在当前世界模型中代表目标位置信息不足的问题,尤其是在目标定位任务中。我们引入了一种通用的方法,使基于世界模型的代理有效地解决物体定位任务。我们提出了两种基于世界模型学习策略的变体:位置条件(PCP)和隐式条件(LCP)策略学习。特别地,LCP采用对象中心化的潜在表示,明确地捕捉目标定位信息。这自然导致了多模态能力的出现,通过空间坐标或视觉目标来指定目标。我们对多个操作环境进行了严格的评估,我们的方法与当前基于模型的控制方法相比显示出良好的性能。
https://arxiv.org/abs/2409.12005
Embodied vision-based real-world systems, such as mobile robots, require a careful balance between energy consumption, compute latency, and safety constraints to optimize operation across dynamic tasks and contexts. As local computation tends to be restricted, offloading the computation, ie, to a remote server, can save local resources while providing access to high-quality predictions from powerful and large models. However, the resulting communication and latency overhead has led to limited usability of cloud models in dynamic, safety-critical, real-time settings. To effectively address this trade-off, we introduce UniLCD, a novel hybrid inference framework for enabling flexible local-cloud collaboration. By efficiently optimizing a flexible routing module via reinforcement learning and a suitable multi-task objective, UniLCD is specifically designed to support the multiple constraints of safety-critical end-to-end mobile systems. We validate the proposed approach using a challenging, crowded navigation task requiring frequent and timely switching between local and cloud operations. UniLCD demonstrates improved overall performance and efficiency, by over 35% compared to state-of-the-art baselines based on various split computing and early exit strategies.
具有身体嵌入式视觉的实时现实系统(如移动机器人)需要在能耗、计算延迟和安全约束之间进行仔细的平衡,以优化在动态任务和上下文中的操作。由于局部计算通常受到限制,将计算卸载到远程服务器上,可以在本地资源受限的同时提供来自强大的大型模型的优质预测。然而,这种方法产生的通信和延迟开销导致云模型在动态、安全关键、实时设置中的可用性有限。为了有效解决这一权衡,我们引入了UniLCD,一种新型的混合推理框架,用于实现灵活的本地-云协作。通过通过强化学习和合适的多任务目标,有效地优化了灵活的路由模块,UniLCD特别设计来支持安全关键端到端移动系统的多个约束。我们通过一个具有挑战性的拥挤导航任务来验证所提出的方法。UniLCD展示了比基于各种分裂计算和早期退出策略的最先进的基准超过35%的优越性能和效率。
https://arxiv.org/abs/2409.11403
Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.
身体化日常任务是身体化人工智能社区中的一种流行任务,要求代理根据自然语言指令进行一系列动作,并基于视觉观察结果进行推理。传统基于学习的方法面临两个挑战。首先,自然语言指令通常缺乏明确的任务规划。其次,要使模型具有任务环境的知识,需要进行大量训练。基于大型语言模型(LLM)的前置作品要么由于缺乏任务特定知识而性能较差,要么依赖于少量的通过地面真实现状进行训练。为了应对上述局限,我们提出了一个名为渐进式检索增强生成(P-RAG)的新方法,该方法不仅有效地利用了LLM的强大的语言处理能力,而且通过不断地积累任务特定知识,而不依赖于地面真实现状。与传统的RAG方法相比,P-RAG采用了一种迭代方法来逐步更新数据库。在每次迭代中,P-RAG检索最新的数据库,并从之前的交互中获得历史信息作为经验参照,用于当前交互。此外,我们还引入了一个更加细粒度的检索方案,不仅检索相似的任务,还包含了检索相似的情况,为用户提供更有价值的参考经验。大量的实验证明,在没有利用地面真实现状的情况下,P-RAG取得了竞争力的结果,甚至可以通过自迭代进一步改善性能。
https://arxiv.org/abs/2409.11279
Controlling hands in the high-dimensional action space has been a longstanding challenge, yet humans naturally perform dexterous tasks with ease. In this paper, we draw inspiration from the human embodied cognition and reconsider dexterous hands as learnable systems. Specifically, we introduce MoDex, a framework which employs a neural hand model to capture the dynamical characteristics of hand movements. Based on the model, a bidirectional planning method is developed, which demonstrates efficiency in both training and inference. The method is further integrated with a large language model to generate various gestures such as ``Scissorshand" and ``Rock\&Roll." Moreover, we show that decomposing the system dynamics into a pretrained hand model and an external model improves data efficiency, as supported by both theoretical analysis and empirical experiments. Additional visualization results are available at this https URL.
在本文中,我们受到了人类协同认知和可学习系统的启发,将灵巧的手视为一个可学习的系统。具体来说,我们引入了MODEX框架,它采用了一个神经手模型来捕捉手运动的动态特征。基于这个模型,我们开发了一种双向规划方法,证明了在训练和推理过程中具有效率。此外,我们还通过理论和实证实验证明了将系统动态分解为预训练手模型和外部模型的数据效率。更多可视化结果可在此链接查看。
https://arxiv.org/abs/2409.10983
Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. "In-the-wild" datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.
共同言语手势对交流是至关重要的。最近深度学习技术的出现促进了为嵌入式会话机器人创建逼真的、同步共同言语手势。 "在自然环境中"的数据集通过人体姿态检测技术汇总来自像YouTube这样的平台的视频内容,提供了一个可行的解决方案,它提供了与言语对齐的2D骨骼序列。"同时,举起模型的开发也在推动将2D序列转换为3D手势数据库。然而,需要注意的是,从2D提取的姿态估计的本质上是模拟真实,仍然停留在2D领域。这个区别引发了关于手势表示维度对生成动作质量的影响 - 据我们所知,这个话题大部分都没有被深入探讨。我们的研究探讨了使用2D或3D关节坐标作为训练数据对言语到手势深度生成模型的性能的影响。我们使用提升模型将生成的2D姿势序列转换为3D,并评估手势在3D堆叠与最初在2D中生成,然后转换为3D的效果。我们还在手势生成领域广泛使用的指标以及用户研究的基础上进行客观评估。
https://arxiv.org/abs/2409.10357
Image-goal navigation enables a robot to reach the location where a target image was captured, using visual cues for guidance. However, current methods either rely heavily on data and computationally expensive learning-based approaches or lack efficiency in complex environments due to insufficient exploration strategies. To address these limitations, we propose Bayesian Embodied Image-goal Navigation Using Gaussian Splatting, a novel method that formulates ImageNav as an optimal control problem within a model predictive control framework. BEINGS leverages 3D Gaussian Splatting as a scene prior to predict future observations, enabling efficient, real-time navigation decisions grounded in the robot's sensory experiences. By integrating Bayesian updates, our method dynamically refines the robot's strategy without requiring extensive prior experience or data. Our algorithm is validated through extensive simulations and physical experiments, showcasing its potential for embodied robot systems in visually complex scenarios.
图像目标导航使机器人能够使用视觉提示到达捕捉目标图像的位置,实现基于视觉的指导。然而,现有方法要么过于依赖数据和计算密集型基于学习的 approach,要么在复杂环境中缺乏效率,因为缺乏足够的探索策略。为了克服这些局限,我们提出了使用高斯展平的高熵图像目标导航方法,一种将图像导航视为模型预测控制框架中的最优控制问题的新颖方法。BEINGS 利用 3D 高斯展平作为场景先验,从而实现基于机器人感官体验的高效、实时导航决策。通过实现贝叶斯更新,我们的方法动态地优化机器人的策略,而无需大量的前经验或数据。通过广泛的仿真和实验验证,我们的算法证明了其在视觉复杂场景中 embodied 机器人系统的巨大潜力。
https://arxiv.org/abs/2409.10216
The deployment of embodied navigation agents in safety-critical environments raises concerns about their vulnerability to adversarial attacks on deep neural networks. However, current attack methods often lack practicality due to challenges in transitioning from the digital to the physical world, while existing physical attacks for object detection fail to achieve both multi-view effectiveness and naturalness. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches with learnable textures and opacity to objects. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which uses feedback from the navigation model to optimize the patch's texture. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, where opacity is refined after texture optimization. Experimental results show our adversarial patches reduce navigation success rates by about 40%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: [this https URL].
在将具有感知能力的智能体部署在关键环境中时,人们担心它们容易受到深度神经网络的对抗攻击。然而,由于从数字世界到物理世界的转换困难,当前的攻击方法通常缺乏实际应用。同时,现有的物体检测攻击方法也无法实现多视角效果和自然性。为了应对这个问题,我们提出了一种实用的攻击方法,通过在物体上附加具有可学习纹理和透明度的对抗补丁。具体来说,为了确保效果在不同的视点保持一致,我们基于物体感知的采样多视角优化策略,利用导航模型的反馈来优化补丁的纹理。为了使补丁对人类观察者不可见,我们引入了两个阶段的透明度优化机制,在纹理优化后进行透明度的优化。实验结果表明,我们的对抗补丁将导航成功率降低了约40%,超过了前方法的实用性、有效性和自然性。代码可在此处访问:[https:// this URL]。
https://arxiv.org/abs/2409.10071
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches. Code and supplementary materials are available at this https URL.
大语言模型(LLMs)在指导具身代理在各种任务中执行语言指令方面显示出显著潜力,包括机器人操作和导航。然而,现有的方法主要针对静态环境,并且没有利用代理自身的经验来优化其初始计划。考虑到现实世界环境固有随机性,仅基于LLM的通用知识制定的初始计划可能无法实现其目标,与静态场景相比更是如此。为了克服这一局限,本研究引入了经验情感图(E2Map),该方法不仅包含了LLM的知识,还整合了代理在现实世界中的经验,并从人类情感反应中汲取灵感。基于代理的经验,该方法可以通过更新E2Map来自动调整行为。我们对随机导航环境(包括模拟和真实世界场景)的评估表明,与现有的LLM基于方法相比,所提出的方法在随机环境中的性能显著增强。代码和补充材料可在此链接下载:https://www.example.com/。
https://arxiv.org/abs/2409.10027
Recent developments in language models have created new opportunities in air traffic control studies. The current focus is primarily on text and language-based use cases. However, these language models may offer a higher potential impact in the air traffic control domain, thanks to their ability to interact with air traffic environments in an embodied agent form. They also provide a language-like reasoning capability to explain their decisions, which has been a significant roadblock for the implementation of automatic air traffic control. This paper investigates the application of a language model-based agent with function-calling and learning capabilities to resolve air traffic conflicts without human intervention. The main components of this research are foundational large language models, tools that allow the agent to interact with the simulator, and a new concept, the experience library. An innovative part of this research, the experience library, is a vector database that stores synthesized knowledge that agents have learned from interactions with the simulations and language models. To evaluate the performance of our language model-based agent, both open-source and closed-source models were tested. The results of our study reveal significant differences in performance across various configurations of the language model-based agents. The best-performing configuration was able to solve almost all 120 but one imminent conflict scenarios, including up to four aircraft at the same time. Most importantly, the agents are able to provide human-level text explanations on traffic situations and conflict resolution strategies.
近年来,在语言模型的研究中,已经产生了许多新的机会。目前的研究主要集中在基于文本和语言的用例。然而,这些语言模型在空气交通控制领域可能具有更高的潜在影响,因为它们能够以实体代理的形式与空气交通环境进行交互。它们还具有类似于自然语言的推理能力,解释其决策,这一直是自动空气交通控制的障碍。本文研究了使用基于语言模型的代理解决空气交通冲突的方法。这项研究的主要组成部分是基础大语言模型、允许代理与模拟器交互的工具以及一个新的概念,即经验库。这一研究中的创新部分——经验库,是一个存储合成知识的数据库,这些知识是代理在与模拟器和语言模型交互过程中获得的。为了评估我们基于语言模型的代理的性能,我们测试了公开源代码和闭源代码的模型。我们研究的结果表明,基于语言模型的代理在不同配置下的性能存在显著差异。表现最好的配置能够解决几乎所有120个冲突场景,包括同时有四个飞机的情况。最重要的是,这些代理能够提供与交通情况和冲突解决策略相当的人类水平的文本解释。
https://arxiv.org/abs/2409.09717
Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at this https URL.
尽管在机器人和嵌入式人工智能领域近年来取得了显著的进步,但长时间的视野任务中部署机器人仍然是一个巨大的挑战。大多数先驱艺术作品坚持开放环路哲学,缺乏实时反馈,导致误差累积和不良的鲁棒性。少数方法试图利用像素级别的差异或预训练的视觉表示建立反馈机制,然而它们的效力和可扩展性都被证明是受限的。受到经典闭环控制系统的启发,我们提出了CLOVER,一种包含反馈机制的闭环视觉运动控制框架,以提高自适应机器人控制。CLOVER由一个文本条件下的视频扩散模型生成视觉计划作为参考输入,一个可测量嵌入空间用于准确误差量化,和一个反馈驱动的控制器组成,从反馈中优化动作并启动备选计划。我们的框架在现实世界的机器人任务中表现出显著的进步,并在CALVIN基准上实现了与前开环相比的8%的提高。代码和检查点可在该https网址进行维护。
https://arxiv.org/abs/2409.09016
Mobile manipulation typically entails the base for mobility, the arm for accurate manipulation, and the camera for perception. It is necessary to follow the principle of Distant Mobility, Close Grasping(DMCG) in holistic control. We propose Embodied Holistic Control for Mobile Manipulation(EHC-MM) with the embodied function of sig(w): By formulating the DMCG principle as a Quadratic Programming (QP) problem, sig(w) dynamically balances the robot's emphasis between movement and manipulation with the consideration of the robot's state and environment. In addition, we propose the Monitor-Position-Based Servoing (MPBS) with sig(w), enabling the tracking of the target during the operation. This approach allows coordinated control between the robot's base, arm, and camera. Through extensive simulations and real-world experiments, our approach significantly improves both the success rate and efficiency of mobile manipulation tasks, achieving a 95.6% success rate in the real-world scenarios and a 52.8% increase in time efficiency.
移动操作通常包括移动的基础、准确操作的手臂和感知相机。在整体控制中遵循远距离移动和近握持(DMCG)原则是必要的。我们提出了具有肢体函数sig(w)的 embodied 整体控制(EHC-MM)移动操作。通过将DMCG原则建模为二次规划(QP)问题,sig(w)动态平衡了机器人运动和操作的重点,同时考虑了机器人的状态和环境。此外,我们提出了基于监视位置的伺服(MPBS),具有sig(w),可以在操作过程中跟踪目标。这种方法允许机器人基础、手臂和相机之间的协同控制。通过广泛的仿真和现实世界的实验,我们的方法显著提高了移动操作任务的的成功率和效率。在现实场景中,成功率提高了95.6%,时间效率提高了52.8%。
https://arxiv.org/abs/2409.08527
Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generalization abilities. To address this limitation, we propose the framework of Online Decision MetaMorphFormer (ODM) which aims to achieve self-awareness, environment recognition, and action planning through a unified model architecture. Motivated by cognitive and behavioral psychology, an ODM agent is able to learn from others, recognize the world, and practice itself based on its own experience. ODM can also be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets. Through the use of pre-trained datasets, ODM can quickly warm up and learn the necessary knowledge to perform the desired task, while the target environment continues to reinforce the universal policy. Extensive online experiments as well as few-shot and zero-shot environmental tests are used to verify ODM's performance and generalization ability. The results of our study contribute to the study of general artificial intelligence in embodied and cognitive fields. Code, results, and video examples can be found on the website \url{this https URL}.
交互式人工智能在运动控制领域是一个有趣的话题,尤其是在实现通用知识和适应多个任务和通用环境时。尽管在强化学习(RL)领域随着Transformer等模型的帮助,投入了越来越多的精力,但大多数模型可能受到离线训练流程的限制,这禁止探索和泛化能力。为了解决这个问题,我们提出了在线决策元学习(ODM)框架,旨在通过统一的模型架构实现自意识、环境识别和动作规划。受到认知和行为心理学的影响,一个ODM代理能够从他人学习,认识世界,并基于自己的经验练习行动。ODM还可以应用于任何具有多关节身体、位于不同环境的任意代理,以及使用大型预训练数据集以不同类型任务进行训练。通过使用预训练数据集,ODM可以快速预热并学习执行所需的知识,而目标环境继续强化通用策略。我们通过大量的在线实验以及少样本和零样本环境测试来验证ODM的性能和泛化能力。我们研究的结果对一般人工智能在身体和认知领域的研究做出了贡献。代码、结果和视频示例可以在网站上 \url{this <https://this https URL>}找到。
https://arxiv.org/abs/2409.07341
This paper proposes a Few-shot Learning (FSL) approach for detecting Presentation Attacks on ID Cards deployed in a remote verification system and its extension to new countries. Our research analyses the performance of Prototypical Networks across documents from Spain and Chile as a baseline and measures the extension of generalisation capabilities of new ID Card countries such as Argentina and Costa Rica. Specifically targeting the challenge of screen display presentation attacks. By leveraging convolutional architectures and meta-learning principles embodied in Prototypical Networks, we have crafted a model that demonstrates high efficacy with Few-shot examples. This research reveals that competitive performance can be achieved with as Few-shots as five unique identities and with under 100 images per new country added. This opens a new insight for novel generalised Presentation Attack Detection on ID cards to unknown attacks.
本文提出了一种在远程验证系统中部署的身份证件上检测展示攻击的 Few-shot 学习 (FSL) 方法及其扩展到新的国家。通过对西班牙和智利文件中典型网络的性能分析作为基线,衡量了具有新身份证件国家(如阿根廷和哥斯达黎加)的一般化能力。特别关注屏幕显示展示攻击的挑战。通过利用具有卷积架构和元学习原则的典型网络所代表的卷积神经网络,我们设计了一个模型,在少量的样本上表现出高效。这项研究揭示了,具有五个独特身份的样本数量以及每新增一个国家不到100张图片的情况下,竞争性能可以实现。这为新颖的身份证件展示攻击检测提供了新的见解,并发现了对未知攻击的检测能力。
https://arxiv.org/abs/2409.06842
Embodied navigation requires robots to understand and interact with the environment based on given tasks. Vision-Language Navigation (VLN) is an embodied navigation task, where a robot navigates within a previously seen and unseen environment, based on linguistic instruction and visual inputs. VLN agents need access to both local and global action spaces; former for immediate decision making and the latter for recovering from navigational mistakes. Prior VLN agents rely only on instruction-viewpoint alignment for local and global decision making and back-track to a previously visited viewpoint, if the instruction and its current viewpoint mismatches. These methods are prone to mistakes, due to the complexity of the instruction and partial observability of the environment. We posit that, back-tracking is sub-optimal and agent that is aware of its mistakes can recover efficiently. For optimal recovery, exploration should be extended to unexplored viewpoints (or frontiers). The optimal frontier is a recently observed but unexplored viewpoint that aligns with the instruction and is novel. We introduce a memory-based and mistake-aware path planning strategy for VLN agents, called \textit{StratXplore}, that presents global and local action planning to select the optimal frontier for path correction. The proposed method collects all past actions and viewpoint features during navigation and then selects the optimal frontier suitable for recovery. Experimental results show this simple yet effective strategy improves the success rate on two VLN datasets with different task complexities.
人体导航需要机器人根据给定的任务理解和操作环境。视觉语言导航(VLN)是一个基于语言指令和视觉输入的环境中导航的任务。VLN代理需要访问局部和全局动作空间;前者用于立即做出决策,后者用于从导航错误中恢复。之前的VLN代理仅依赖指令视点对局部和全局决策和回到以前访问的观点来进行导航。由于指令和当前视点的复杂性,这些方法容易出错。我们提出了一种基于记忆的、有错觉的路径规划策略,称为\textit{StratXplore},为VLN代理提供全局和局部动作规划,以选择最优的边缘进行路径纠正。所提出的策略收集了在导航过程中所有过去的动作和视点特征,然后选择一个适合恢复的边缘。实验结果表明,这种简单而有效的策略在两个具有不同任务复杂性的VLN数据集上都有所改善。
https://arxiv.org/abs/2409.05593
This article introduces a novel heuristic for Task and Motion Planning (TAMP) named Interpretable Responsibility Sharing (IRS), which enhances planning efficiency in domestic robots by leveraging human-constructed environments and inherent biases. Utilizing auxiliary objects (e.g., trays and pitchers), which are commonly found in household settings, IRS systematically incorporates these elements to simplify and optimize task execution. The heuristic is rooted in the novel concept of Responsibility Sharing (RS), where auxiliary objects share the task's responsibility with the embodied agent, dividing complex tasks into manageable sub-problems. This division not only reflects human usage patterns but also aids robots in navigating and manipulating within human spaces more effectively. By integrating Optimized Rule Synthesis (ORS) for decision-making, IRS ensures that the use of auxiliary objects is both strategic and context-aware, thereby improving the interpretability and effectiveness of robotic planning. Experiments conducted across various household tasks demonstrate that IRS significantly outperforms traditional methods by reducing the effort required in task execution and enhancing the overall decision-making process. This approach not only aligns with human intuitive methods but also offers a scalable solution adaptable to diverse domestic environments. Code is available at this https URL.
这篇文章介绍了一种名为Interpretable Responsibility Sharing (IRS)的新任务和动作规划(TAMP)策略,通过利用人类构建的环境和固有偏见,提高了国内机器人的规划效率。通过利用辅助对象(如托盘和水桶等),这些对象在家庭环境中很常见,IRS系统性地将这些元素整合起来,简化和优化任务执行。策略基于责任共享(RS)的概念,其中辅助对象与实体代理共享任务责任,将复杂任务分解为可管理子问题。这种分类不仅反映了人类使用模式,还帮助机器人更有效地在人类空间中导航和操作。通过整合决策优化规则合成(ORS)为决策,IRS确保了辅助对象的利用既具有战略意义又具有上下文意识,从而提高了机器人的规划可读性和效果。在各种家庭任务上进行的实验证明,IRS明显超过了传统方法,通过降低任务执行所需的精力并提高整体决策过程,显著地优于传统方法。这种方法不仅符合人类直觉方法,而且提供了一种可扩展的解决方案,适用于各种家庭环境。代码可在此链接处获取:https://www.acm.org/dl/d/1922316
https://arxiv.org/abs/2409.05586
Embodied AI aims to develop robots that can \textit{understand} and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or \textit{Speaker} model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an adversarial setting to avoid systematic bias introduced by language evaluation metrics. Empirically, our method outperforms existing instruction generation models, evaluated using standard metrics. Our code is available at \url{this https URL}.
人体人工智能(Embodied AI)的目标是开发可以理解和执行人类语言指令的机器人,并能够自然地交流。在这方面,我们研究了为 embodied 机器人生成高度详细的导航指令的任务。尽管最近的研究在从图像序列中生成逐步指令方面取得了显著的飞跃,但生成的指令在参考对象和地标方面的多样性仍然较低。现有的说话者模型通过学习策略来规避评估指标,甚至对于低质量的句子也能获得更高的分数。在这项工作中,我们提出了 SAS(空间感知说话者)模型,一种利用环境结构化和语义知识生成更丰富指令的指令生成器或说话者模型。为了训练,我们在对抗设置中应用了奖励学习方法,以避免语言评估指标引入的系统偏差。实验证明,我们的方法超越了使用标准指标评估的现有指令生成模型。我们的代码可在此处下载:\url{this <https://this URL>.
https://arxiv.org/abs/2409.05583
Autonomous navigation for an embodied agent guided by natural language instructions remains a formidable challenge in vision-and-language navigation (VLN). Despite remarkable recent progress in learning fine-grained and multifarious visual representations, the tendency to overfit to the training environments leads to unsatisfactory generalization performance. In this work, we present a versatile Multi-Branch Architecture (MBA) aimed at exploring and exploiting diverse visual inputs. Specifically, we introduce three distinct visual variants: ground-truth depth images, visual inputs integrated with incongruent views, and those infused with random noise to enrich the diversity of visual input representation and prevent overfitting to the original RGB observations. To adaptively fuse these varied inputs, the proposed MBA extend a base agent model into a multi-branch variant, where each branch processes a different visual input. Surprisingly, even random noise can further enhance navigation performance in unseen environments. Extensive experiments conducted on three VLN benchmarks (R2R, REVERIE, SOON) demonstrate that our proposed method equals or even surpasses state-of-the-art results. The source code will be publicly available.
自主导航对于受自然语言指令指导的实体代理在视觉与语言导航(VLN)中仍然是一个具有挑战性的问题。尽管在最近的学习中取得了令人瞩目的进展,即学习细粒度和多异构视觉表示,但过拟合训练环境导致通用性能不令人满意。在本文中,我们提出了一个多分支架构(MBA),旨在探索和利用各种视觉输入。具体来说,我们引入了三种不同的视觉变体:真实深度图像、与原始视图不协调的视觉输入和随机噪声,以丰富视觉输入表示形式,防止过拟合到原始的RGB观察结果。为了适应性地融合这些不同的输入,所提出的MBA将基础代理模型扩展为多分支版本,其中每个分支处理不同的视觉输入。令人惊讶的是,即使在未见过的环境中,随机噪声也可以进一步提高导航性能。在三个VLN基准(R2R、REVERIE、SOON)上进行的大量实验证明,我们所提出的方法等于或甚至超过了最先进的结果。源代码将公开可用。
https://arxiv.org/abs/2409.05552
Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method. The source codes will be available at \url{this https URL}.
触觉感知在多模态大型模型的发展和 embodied智能发展中起着关键作用。为了以最小成本收集尽可能多的触觉数据,一系列研究试图通过视觉到触觉图像转换生成触觉图像。然而,与文本模态相比,视觉模态驱动的触觉生成无法准确描述人类触觉感知。在本文中,我们详细分析了两级粒度(物体级别和传感器级别)的触觉图像特征。我们通过文本描述来建模这些信息,并提出了一个细粒度的文本到触觉生成方法(TextToucher)来生成高质量的触觉样本。具体来说,我们引入了一个多模态大型语言模型来构建关于物体级别触觉信息的文本句子,并使用一系列可学习的文本提示来表示传感器级别的触觉信息。为了更好地使用构建的文本信息指导触觉生成过程,我们融合了文本信息的两个颗粒度,并在扩散变换器架构内探索各种双颗粒文本预处理方法。此外,我们提出了一个对比性文本到触觉预训练(CTTP)指标,以精确评估文本驱动生成的触觉数据的质量。大量实验证明了我们TextToucher方法的优势。源代码将公开在 \url{这个链接} 上。
https://arxiv.org/abs/2409.05427
This article studies the commonsense object affordance concept for enabling close-to-human task planning and task optimization of embodied robotic agents in urban environments. The focus of the object affordance is on reasoning how to effectively identify object's inherent utility during the task execution, which in this work is enabled through the analysis of contextual relations of sparse information of 3D scene graphs. The proposed framework develops a Correlation Information (CECI) model to learn probability distributions using a Graph Convolutional Network, allowing to extract the commonsense affordance for individual members of a semantic class. The overall framework was experimentally validated in a real-world indoor environment, showcasing the ability of the method to level with human commonsense. For a video of the article, showcasing the experimental demonstration, please refer to the following link: this https URL
这篇文章研究了在人造环境中,使远程接近人任务规划和任务优化具有常识物体启示概念的方法。物体启示的重点在于推理如何在任务执行过程中有效识别物体固有的使用价值,这在本文中通过分析三维场景图的上下文关系来实现。所提出的框架使用图卷积网络来学习概率分布,从而能够提取语义类中每个成员的常识物体启示。总体框架在现实世界中的室内环境中进行了实验验证,展示了该方法与人类常识相匹敌的能力。对于本文的实验演示视频,请点击以下链接:https://this.link
https://arxiv.org/abs/2409.05392