Abstract
The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to $1.58\times$ and $1.82\times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
Abstract (translated)
边缘计算的出现使得实时智能视频分析成为可能。之前的工作(如CNN,RNN等)基于传统模型架构,采用各种策略来过滤出非感兴趣内容,以最小化带宽和计算消耗,但是表现不佳。然而,基于Transformer的视觉基础模型在恶劣环境中表现出色,因为它们具有惊人的泛化能力。然而,它们需要大量的计算资源,这限制了它们在实时智能视频分析中的应用。在本文中,我们发现视觉基础模型(如Vision Transformer)也有专用的视频分析加速机制。为此,我们介绍了Arena,一种基于ViT的端到端边缘辅助视频推理加速系统。我们利用ViT通过仅卸载并输入兴趣区域(PoIs)来加速下游模型的能力。此外,我们还采用概率基础的补丁采样,这是一种简单但有效的机制,用于确定物体在后续帧中的可能位置。通过在公共数据集上进行广泛的评估,我们的发现表明,Arena可以在平均情况下将推理速度提高1.58倍,而在带宽消耗仅为54%和34%的情况下。
URL
https://arxiv.org/abs/2404.09245