Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Abstract
Abstract (translated)
URL
PDF

Abstract

Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.

Abstract (translated)

尽管多模态大型语言模型（MLLMs）的表现非常出色，但它们的部署需要大量的计算资源。一旦恶意用户诱导高能耗和延迟时间（能源延迟成本），就会耗尽计算资源并损害服务的可用性。在本文中，我们研究了MLLMs的这一漏洞，特别是基于图像和视频的 ones，并试图通过创建一个不可察觉的扰动来诱导高能源-延迟成本在推理过程中。我们发现，通过最大化生成序列的长度，可以轻松地操纵高能源-延迟成本，这激发了我们提出verbose样本，包括verbose图像和视频。具体来说，我们提出了两个模式非特定的损失，包括延迟结束标记（EOS）的损失和增加每个生成标记的不确定性的损失。此外，提高多样性对于鼓励更长的响应很重要，从而增加了复杂性，这激发了我们下面模式特定损失。对于verbose图像，我们提出了一个标记多样性损失，以促进多样隐藏状态。对于verbose视频，我们提出了一个帧特征多样性损失，以增加帧之间的特征多样性。为了平衡这些损失，我们提出了一个时间加权调整算法。实验证明，我们的verbose样本可以大大扩展生成序列的长度。

URL

https://arxiv.org/abs/2404.16557

PDF

https://arxiv.org/pdf/2404.16557.pdf

Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Abstract

Abstract (translated)

URL

PDF Copy

PDF