Abstract
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
Abstract (translated)
机器图像压缩(ICM)已经成为视觉数据压缩领域中的一个关键研究方向。然而,随着机器智能的快速发展,压缩的目标已经从特定任务的虚拟模型转向了在现实环境中运作的具身代理(Embodied agents)。为了应对多代理系统中具身AI通信约束并确保实时任务执行,本文首次提出了具身图像压缩这一科学问题。我们建立了一个标准化基准测试平台EmbodiedComp,在闭环设置下的极低比特率条件下进行系统的评估。通过在模拟和现实世界环境中的广泛实证研究,我们发现现有的视觉-语言-行动模型(VLAs)在低于具身比特率阈值时无法可靠地执行简单的操作任务。我们认为,EmbodiedComp将推动针对具身代理的领域特定压缩技术的发展,从而加速具身AI在现实世界的部署。
URL
https://arxiv.org/abs/2512.11612