Abstract
Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: this https URL
Abstract (translated)
基础视觉语言模型正日益与机器人技术相关,因为它们能比狭窄的特定任务流水线提供更丰富的语义感知。然而,这些模型在机器人软件栈中的实际应用仍取决于可复现的中间件集成,而不仅仅是模型质量本身。在这方面,Florence-2 尤为突出,因为它以相对可控的模型规模统一了字幕生成、光学字符识别、开放词汇检测、定位及相关视觉语言任务。本文介绍了 Florence-2 的 ROS 2 封装器,该封装器通过三种互补的交互模式提供模型访问:连续主题驱动处理、同步服务调用和异步动作。该封装器设计用于本地执行,支持原生安装和 Docker 容器部署。它还将通用 JSON 输出与面向检测任务的标准 ROS 2 消息绑定相结合。文章报告了功能验证及在多个 GPU 上的吞吐量研究,表明在消费级硬件上实现本地部署是可行的。代码仓库已公开,链接如下:此 https URL
URL
https://arxiv.org/abs/2604.01179