Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Abstract
Abstract (translated)
URL
PDF

Abstract

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned tracker from the virtual world to real-world scenarios.

Abstract (translated)

肢体视觉跟踪是通过使用代理的以自我为中心的视觉来跟随动态3D环境中的目标对象。这是 embodied 代理的一种关键和具有挑战性的技能。然而，现有的方法在训练和泛化方面存在效率低和表现差的问题。在本文中，我们提出了一种结合视觉基础模型（VFM）和离线强化学习（offline RL）的新框架，以增强 embodied 视觉跟踪。我们使用预训练的 VFM，如 "Tracking Anything"，以提取带文本提示的语义分割掩码。然后，我们使用离线 RL 训练一个循环策略网络，例如 Conservative Q-Learning，以从收集的演示中学习，而无需与在线代理环境和交互。为了进一步提高策略网络的稳健性和泛化性，我们还引入了掩码重置机制和多级数据收集策略。通过这种方式，我们可以在消费者级 GPU（例如 Nvidia RTX 3090）上训练一个稳健的跟踪器，例如一个小时。这是基于 RL 的视觉跟踪方法前所未有的效率。我们在具有挑战性的环境中评估我们的跟踪器，例如分心和遮挡。结果表明，我们的代理在样本效率、对干扰者的鲁棒性和对未见过的场景和目标的泛化方面优于最先进的 methods。我们还证明了从虚拟世界中学到的跟踪器在现实世界场景中的可转移性。

URL

https://arxiv.org/abs/2404.09857

PDF

https://arxiv.org/pdf/2404.09857.pdf

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Abstract

Abstract (translated)

URL

PDF Copy

PDF