Paper Reading AI Learner

Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

2024-04-25 12:11:38
Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li

Abstract

Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.

Abstract (translated)

尽管多模态大型语言模型(MLLMs)的表现非常出色,但它们的部署需要大量的计算资源。一旦恶意用户诱导高能耗和延迟时间(能源延迟成本),就会耗尽计算资源并损害服务的可用性。在本文中,我们研究了MLLMs的这一漏洞,特别是基于图像和视频的 ones,并试图通过创建一个不可察觉的扰动来诱导高能源-延迟成本 在推理过程中。我们发现,通过最大化生成序列的长度,可以轻松地操纵高能源-延迟成本,这激发了我们提出verbose样本,包括verbose图像和视频。具体来说,我们提出了两个模式非特定的损失,包括延迟结束标记(EOS)的损失和增加每个生成标记的不确定性的损失。此外,提高多样性对于鼓励更长的响应很重要,从而增加了复杂性,这激发了我们下面模式特定损失。对于verbose图像,我们提出了一个标记多样性损失,以促进多样隐藏状态。对于verbose视频,我们提出了一个帧特征多样性损失,以增加帧之间的特征多样性。为了平衡这些损失,我们提出了一个时间加权调整算法。实验证明,我们的verbose样本可以大大扩展生成序列的长度。

URL

https://arxiv.org/abs/2404.16557

PDF

https://arxiv.org/pdf/2404.16557.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot