Abstract
In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception. We focus on person detection and action recognition tasks and evaluate two prominent LMMs, namely YOLO-World and GPT-4V(ision) using a publicly available dataset captured from aerial views. Traditional deep learning approaches rely heavily on large and high-quality training datasets. However, in certain robotic settings, acquiring such datasets can be resource-intensive or impractical within a reasonable timeframe. The flexibility of prompt-based Large Multimodal Models (LMMs) and their exceptional generalization capabilities have the potential to revolutionize robotics applications in these scenarios. Our findings suggest that YOLO-World demonstrates good detection performance. GPT-4V struggles with accurately classifying action classes but delivers promising results in filtering out unwanted region proposals and in providing a general description of the scenery. This research represents an initial step in leveraging LMMs for drone perception and establishes a foundation for future investigations in this area.
Abstract (translated)
在本文中,我们探讨了在无人机感知领域中零击大型多模态模型的潜力。我们重点关注人物检测和动作识别任务,并使用从高空拍摄公开可用数据集来评估两个著名的LMM,即YOLO-World和GPT-4V(视觉)。传统的深度学习方法在很大程度上依赖于大型和高质量的训练数据集。然而,在某些机器人设置中,获取这类数据集可能具有资源密集性或不可行性。提示式大型多模态模型的灵活性和其卓越的泛化能力具有在这些场景中彻底改变机器人应用的可能性。我们的研究结果表明,YOLO-World在检测方面表现良好。GPT-4V在准确分类动作类别的方面表现不佳,但在过滤出不需要的区域建议和提供景观的一般描述方面带来了有前途的结果。这项研究代表了利用LMM进行无人机感知的第一步,为这个领域的未来研究奠定了基础。
URL
https://arxiv.org/abs/2404.01571