Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Further, these adversarial examples are found to be transferable from the source network in which they are crafted to a black-box target network. As the trend of using deep learning on embedded devices grows, it becomes relevant to study the transferability properties of adversarial examples among compressed networks. In this paper, we consider quantization as a network compression technique and evaluate the performance of transfer-based attacks when the source and target networks are quantized at different bitwidths. We explore how algorithm specific properties affect transferability by considering various adversarial example generation algorithms. Furthermore, we examine transferability in a more realistic scenario where the source and target networks may differ in bitwidth and other model-related properties like capacity and architecture. We find that although quantization reduces transferability, certain attack types demonstrate an ability to enhance it. Additionally, the average transferability of adversarial examples among quantized versions of a network can be used to estimate the transferability to quantized target networks with varying capacity and architecture.
深度神经网络(DNNs)以其易受对抗性样本攻击而闻名。此外,这些对抗性样本已被证明可以在其创建的网络中从源网络转移到目标网络,且这些源网络中的攻击在目标网络中是透明的。随着在嵌入设备上使用深度学习的趋势不断增长,研究在压缩网络之间对抗性样本的传输特性变得尤为重要。在本文中,我们将量化作为一种网络压缩技术,评估在不同位宽下基于传输的攻击的性能。我们考虑了各种攻击生成算法,以评估算法特定属性对传输特性的影响。此外,我们在一个更现实的情况中研究了源网络和目标网络在位宽和其他与模型相关的属性(如容量和架构)上的差异。我们发现,尽管量化减少了传输性,但某些攻击类型表现出增强传输性的能力。此外,量化版本之间 adversarial 样本的平均传输性可用于估计具有不同容量和架构的量化目标网络的传输性。
https://arxiv.org/abs/2405.09598
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder has to be learned that allows for efficient transmission of the input audio signal. This discrete representation is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. Furthermore, we propose a new causal network architecture for neural speech coding that shows good performance at very low computational complexity.
神经音频编码已成为一个生动的研究方向,因为它在非常低的比特率下承诺提供优秀的音频质量,这是经典编码技术无法实现的。在本文中,我们提出了基于投影标量量化(SQ)的简单替代VQ的量化方法,这些量化技术不需要额外的损失、调度参数或代码本存储,从而简化了神经音频编码器的训练。此外,我们提出了一种新的因果神经网络架构,用于神经语音编码,在非常低的计算复杂度下表现出良好的性能。
https://arxiv.org/abs/2405.08417
Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.
生成式人工智能(GAI)正以无与伦比的内容创作能力席卷世界。大型语言模型(LLMs)是这一运动的主力军。然而,LLMs的显著资源需求往往需要云托管,这引发了关于隐私、延迟和使用限制等问题。尽管边缘智能长期以来已通过在接近数据源的普遍边缘资源上实现实时AI计算来解决这些挑战,但大多数研究都集中在传统AI模型上,而没有解决LLM推理的独特特点,例如巨大的模型大小、自回归过程和自注意力机制。在本文中,我们提出了一个针对LLM推理的边缘智能优化问题。具体来说,通过在资源受限的边缘设备上部署批量技术和模型量化,我们形式化了一种基于Transformer解码器的LLM推理模型。此外,我们的方法旨在通过批量调度和通信和计算资源的无缝分配来最大化推理吞吐量,同时考虑边缘资源限制和用户的延迟和准确度需求。为解决这个NP困难问题,我们开发了一种最优的在线树搜索算法——边树Pruning(DFTSP),其具有可接受的时间复杂度。仿真结果表明,DFTSP在多样用户设置和量化技术的基准测试中超过了其他批注,并且与暴力搜索方法相比,减少了45%的时间复杂度。
https://arxiv.org/abs/2405.07140
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of two well-known post-training techniques, SmoothQuant and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of both techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets. We show that by applying GPTQ and SmoothQuant, and employing MX formats for quantizing models, we can achieve a significant reduction in the size of OPT models by up to 4x and LLaMA models by up to 3x with a negligible perplexity increase of 1-3%.
大语言模型(LLMs)通过在复杂语言建模任务中的卓越表现而脱颖而出,然而它们带来了显著的计算和存储挑战。本文探讨了量化在减轻这些挑战方面的潜力。我们系统地研究了两种已知后训练技术的联合应用,SmoothQuant 和 GPTQ,并提供关于它们相互作用的全面分析,以推动LLM的量化。通过使量化能够达到微缩(MX)格式,并将它们的应用范围扩展到初始的固定点格式目标之外,我们增强了这两种技术的多样性。我们证明了通过应用GPTQ和SmoothQuant,并使用MX格式对模型进行量化,我们可以将OPT模型的大小减少至原来的4倍,LLLaMA模型的规模减少至原来的3倍,且误差不至于增加1-3%。
https://arxiv.org/abs/2405.07135
Large language models (LLMs) have emerged and presented their general problem-solving capabilities with one model. However, the model size has increased dramatically with billions of parameters to enable such broad problem-solving capabilities. In addition, due to the dominance of matrix-matrix and matrix-vector multiplications in LLMs, the compute-to-model size ratio is significantly lower than that of CNNs. This shift pushes LLMs from a computation-bound regime to a memory-bound regime. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored for achieving the memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning for LLMs is not well-understood yet. Therefore, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{37}$) for Llama2-7B). To navigate such a vast design space, we formulate the design space and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
大规模语言模型(LLMs)已经出现并借助一种模型展示了其问题解决能力。然而,随着参数数量成倍增加,模型的规模大幅增加,以实现如此广泛的问题解决能力。此外,由于LLMs中矩阵-矩阵和矩阵-向量乘法的主导地位,计算-模型大小比CNNs要低得多。这一趋势将LLMs从计算密集型模式推向内存密集型模式。因此,优化LLMs的内存占用和流量是LLM today的一个重要优化方向。已经积极研究了量化技术和参数剪枝方法来实现内存占用和流量的优化。然而,LLMs中秩剪枝对模型的准确率-效率权衡尚不清楚。因此,我们研究了LLM中秩剪枝的准确率-效率权衡,特别是Tucker分解,在最近的语言模型上的表现。我们 formalize the design space of low-rank decomposition methods, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the design space and show that the decomposition design space is enormous (e.g., O($2^{37}$) for Llama2-7B). To navigate such a vast design space, we formulate the design space and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops, which range from 4%p to 10%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
https://arxiv.org/abs/2405.06626
Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context length increasing, becoming the bottleneck for deployment. In this paper, we present a strategy called SKVQ, which stands for sliding-window KV cache quantization, to address the issue of extremely low bitwidth KV cache quantization. To achieve this, SKVQ rearranges the channels of the KV cache in order to improve the similarity of channels in quantization groups, and applies clipped dynamic quantization at the group level. Additionally, SKVQ ensures that the most recent window tokens in the KV cache are preserved with high precision. This helps maintain the accuracy of a small but important portion of the KV cache.SKVQ achieves high compression ratios while maintaining accuracy. Our evaluation on LLMs demonstrates that SKVQ surpasses previous quantization approaches, allowing for quantization of the KV cache to 2-bit keys and 1.5-bit values with minimal loss of accuracy. With SKVQ, it is possible to process context lengths of up to 1M on an 80GB memory GPU for a 7b model and up to 7 times faster decoding.
大语言模型(LLMs)现在可以处理更长的标记序列,从而实现诸如书籍理解和生成长篇小说的复杂任务。然而,为了支持LLMs的大规模部署,需要大量的键值(KV)缓存,这成为部署的瓶颈。在本文中,我们提出了一个名为SKVQ的策略,它被称为滑动窗口KV缓存量化,以解决极低位宽KV缓存量化的问题。为了实现这一目标,SKVQ重新排列了KV缓存的通道,以提高量化组中通道的相似度,并在组级别应用截断动态量化。此外,SKVQ确保KV缓存中最新的窗口标记具有高精度的保留。这有助于保持KV缓存中一小部分高精度标记的准确性。 SKVQ在保持高压缩比率的同时实现高准确性。我们对LLM的评估表明,SKVQ超过了以前的量化方法,使得用2位密钥和1.5位值对KV缓存进行量化,同时最小化准确性的损失。使用SKVQ,可以在80GB内存的GPU上处理具有7b模型的上下文长度,最高可达7倍于之前的解码速度。
https://arxiv.org/abs/2405.06219
Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers have developed various model compression techniques such as model quantization and model pruning. Recently, there has been a surge in research of compression methods to achieve model efficiency while retaining the performance. Furthermore, more and more works focus on customizing the DNN hardware accelerators to better leverage the model compression techniques. In addition to efficiency, preserving security and privacy is critical for deploying DNNs. However, the vast and diverse body of related works can be overwhelming. This inspires us to conduct a comprehensive survey on recent research toward the goal of high-performance, cost-efficient, and safe deployment of DNNs. Our survey first covers the mainstream model compression techniques such as model quantization, model pruning, knowledge distillation, and optimizations of non-linear operations. We then introduce recent advances in designing hardware accelerators that can adapt to efficient model compression approaches. Additionally, we discuss how homomorphic encryption can be integrated to secure DNN deployment. Finally, we discuss several issues, such as hardware evaluation, generalization, and integration of various compression approaches. Overall, we aim to provide a big picture of efficient DNNs, from algorithm to hardware accelerators and security perspectives.
深度神经网络(DNNs)在许多人工智能(AI)任务中得到了广泛应用。然而,部署它们会带来显著的挑战,因为它们需要大量的内存、能源和计算成本。为解决这些挑战,研究人员开发了各种模型压缩技术,如模型量化、模型剪裁。近年来,越来越多的研究关注于通过保留模型性能的同时实现模型效率来压缩模型。此外,越来越多的研究专注于定制化DNN硬件加速器以更好地利用模型压缩技术。除了效率之外,保护安全和隐私对部署DNN至关重要。然而,相关研究领域的广泛和多样性可能使我們感到不知所措。因此,我们决定对最近的研究进行一次全面调查,以实现高性能、低成本和安全部署DNN的目标。我们的调查首先涵盖了主流的模型压缩技术,如模型量化、模型剪裁、知识提炼和非线性操作的优化。然后,我们介绍了针对有效模型压缩方法设计的硬件加速器。此外,我们还讨论了如何将同态加密集成到保护DNN部署的安全性中。最后,我们讨论了几个问题,如硬件评估、泛化以及各种压缩方法的集成。总体而言,我们旨在从算法到硬件加速器和安全性方面提供有关有效DNN的大致情况。
https://arxiv.org/abs/2405.06038
Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence, thanks to their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements of LLMs limit their widespread adoption. Quan- tization, a key compression technique, offers a viable solution to mitigate these demands by compressing and accelerating LLMs, albeit with poten- tial risks to model accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, the quantization configurations in these studies vary and may not be optimized for hard- ware compatibility. In this paper, we focus on identifying the most effective practices for quantizing LLMs, with the goal of balancing performance with computational efficiency. For a fair analysis, we develop a quantization toolkit LLMC, and design four crucial principles considering the inference efficiency, quantized accuracy, calibration cost, and modularization. By benchmarking on various models and datasets with over 500 experiments, three takeaways corresponding to calibration data, quantization algorithm, and quantization schemes are derived. Finally, a best practice of LLM PTQ pipeline is constructed. All the benchmark results and the toolkit can be found at this https URL.
近年来,大型语言模型(LLMs)的进步推动了我们朝着实现人工通用智能的方向发展,得益于它们非凡的涌现能力和推理能力。然而,LLMs的庞大计算和内存需求限制了它们的应用范围。量化是一种关键的压缩技术,通过压缩和加速LLMs,缓解这些需求,但可能对模型准确性造成潜在风险。许多研究试图通过量化最小化与量化相关的准确性损失。然而,这些研究中的量化配置各不相同,可能不适合硬件兼容性。在本文中,我们关注于确定量化LLMs的最有效实践,以实现性能与计算效率的平衡。为了进行公平的分析,我们开发了一个名为LLMC的量化工具包,并考虑了推理效率、量化精度、校准成本和模块化这四个关键原则。通过在各种模型和数据集上进行超过500个实验进行基准测试,我们得出了与校准数据、量化算法和量化方案相关的三个启示。最后,构建了LLM PTQ管道的最佳实践。所有基准结果和工具包都可以在https://url.cn/this_url_href找到。
https://arxiv.org/abs/2405.06001
Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at this https URL.
量化可以加速大型语言模型(LLM)的推理。超越INT8量化,研究社区正在积极探讨甚至更低的精度,例如INT4。然而,最先进的INT4量化技术仅加速低批量化、边缘LLM推理,无法在大型批量的云服务LLM上带来性能提升。我们发现一个关键问题:现有的INT4量化方法在GPU上进行减量化时,会导致显著的运行时开销(20-90%)。为了应对这一挑战,我们引入了QoQ,一种具有4位权重、8位激活和4位KV缓存的W4A8KV4量化算法。QoQ代表四八四,这是拉丁文中的表示。QoQ由QServe推理库实现,实现了测量的速度提升。推动QServe的關鍵洞察力是,LLM在GPU上的服务效率受到低吞吐量CUDA核心上操作的严重影响。基于这一洞察,QoQ算法引入了渐进量化,可以在W4A8 GEMM中实现低量化开销。此外,我们开发了SmoothAttention来有效减轻4位KV量化的准确性下降。在QServe系统中,我们进行了计算开销的权重重新排序,并利用寄存器级并行性降低量化延迟。此外,我们还使融合注意力和内存相关的开销,并利用KV4量化带来的性能提升。结果,QServe将LLama-3-8B在A100上的最大服务吞吐量提高了1.2倍,在L40S上的最高服务吞吐量为1.4倍,相比TensorRT-LLM。值得注意的是,QServe在L40S GPU上甚至可以达到比TensorRT-LLM在A100上更高的吞吐量。因此,QServe有效地将LLM服务的美元成本降低了3倍。代码可以从此链接获取:https://www.aclweb.org/anthology/QServe
https://arxiv.org/abs/2405.04532
Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose \emph{Trio-ViT} accordingly. Specifically, at the algorithm level, we develop a {tailored post-training quantization engine} taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. {Particularly, we can gain up to $\uparrow$$\mathbf{7.2}\times$ and $\uparrow$$\mathbf{14.6}\times$ FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{5.9}\times$ and $\uparrow$$\mathbf{2.0}\times$ DSP efficiency.} Codes will be released publicly upon acceptance.
受到自然语言处理(NLP)领域中Transformer的极大成功启发,Vision Transformers(ViTs)已经在各种计算机视觉任务中取得了显著的性能。然而,它们巨大的模型大小和密集的计算开销阻碍了ViTs在嵌入式设备上的部署,需要有效的模型压缩方法,例如量化。不幸的是,由于存在对硬件不友好且易受量化的非线性操作(特别是{Softmax}),完全量化ViTs的所有操作并非易事,导致要么准确性下降明显,要么硬件成本显著增加。为了应对与标准ViTs相关的挑战,我们把注意力放在了提高ViT的量化和加速上,不仅消除了麻烦的Softmax,还集成了线性注意力和低计算复杂性,并相应地提出了{Trio-ViT}。具体来说,在算法层面,我们开发了一个{定制后训练量化引擎},考虑了Softmax-free efficient ViTs的独特的激活分布,旨在提高量化精度。此外,在硬件层面,我们构建了一个专为高效ViT的特定卷积-Transformer混合架构而设计的加速器,从而提高了硬件效率。大量实验结果一致证明了我们Trio-ViT框架的有效性。{特别是在与最先进的ViT加速器相当的精度下,我们可以在最高达到$\uparrow$$\mathbf{7.2}\times$和$\uparrow$$\mathbf{14.6}\times$ FPS,以及$\uparrow$$\mathbf{5.9}\times$和$\uparrow$$\mathbf{2.0}\times$ DSP效率。}代码将在接受审核时公开发布。
https://arxiv.org/abs/2405.03882
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.
大规模语言模型(LLMs) revolutionized 自然语言处理(NLP),但它们的大小产生了计算瓶颈。我们提出了一种新方法来创建准确、稀疏的基本大语言模型,在稀疏度达到 70% 时实现对微调任务的完全准确性恢复。我们通过将 SparseGPT 一键修剪方法和 SlimPajama 数据集中的稀疏预训练方法相结合,在 LaMA-2 7B 模型上实现了这一目标。我们在 Cerebras CS-3 芯片上展示了由于稀疏度而产生的训练加速,这个加速与理论上的扩展速度非常接近。此外,我们还通过利用 Neural Magic 的 DeepSparse 引擎在 CPU 上实现 up to 3x 的推理加速,而在 GPU 上实现同样的加速需要 Neural Magic 的 nm-vllm 引擎,通过稀疏度实现上述增长。这些增长是通过稀疏度实现的,因此可以通过进一步的量化实现更多的增长。具体来说,我们在 CPU 上实现了稀疏量化 LaMA-2 模型总共的 8.6x 速度提升。我们在各种具有挑战性的任务中展示了这些结果,包括聊天、指令跟随、代码生成、算术推理和总结,以证明其普适性。这项工作为快速创建小而快速的 LLM 奠定了基础,同时不牺牲准确性。
https://arxiv.org/abs/2405.03594
Scale is often attributed as one of the factors that cause an increase in the performance of LLMs, resulting in models with billion and trillion parameters. One of the limitations of such large models is the high computational requirements that limit their usage, deployment, and debugging in resource-constrained scenarios. Two commonly used alternatives to bypass these limitations are to use the smaller versions of LLMs (e.g. Llama 7B instead of Llama 70B) and lower the memory requirements by using quantization. While these approaches effectively address the limitation of resources, their impact on model performance needs thorough examination. In this study, we perform a comprehensive evaluation to investigate the effect of model scale and quantization on the performance. We experiment with two major families of open-source instruct models ranging from 7 billion to 70 billion parameters. Our extensive zero-shot experiments across various tasks including natural language understanding, reasoning, misinformation detection, and hallucination reveal that larger models generally outperform their smaller counterparts, suggesting that scale remains an important factor in enhancing performance. We found that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization for numerous tasks and they serve as a better solution than using smaller models at high precision under similar memory requirements.
规模常常被视为导致LLM性能提高的一个因素,导致具有数十亿到万亿参数的模型。如此大的模型的一个局限是高计算需求,这限制了它们在资源受限场景中的使用、部署和调试。两种常用的绕过这些限制的方法是使用较小的LLM版本(例如,Llama 7B而不是Llama 70B)和使用量化来降低内存需求。虽然这些方法有效地解决了资源限制,但它们对模型性能的影响仍需深入研究。在本研究中,我们对模型规模和量化对性能进行全面评估。我们实验了两个主要的开源指令模型,从70亿参数到700亿参数。我们在各种任务中进行广泛的零散实验,包括自然语言理解、推理、信息检测和虚构,揭示更大模型通常优于较小模型的结论,表明规模仍然是提高性能的重要因素。我们发现,更大模型对精度和四舍五入的减少表现出非凡的弹性,在许多任务中,即使在4位量化下,仍能保持高精度,这使得使用较高精度的较小模型在类似内存要求下成为更好的解决方案。
https://arxiv.org/abs/2405.03146
Segment Anything Model (SAM) has achieved impressive performance in many computer vision tasks. However, as a large-scale model, the immense memory and computation costs hinder its practical deployment. In this paper, we propose a post-training quantization (PTQ) framework for Segment Anything Model, namely PTQ4SAM. First, we investigate the inherent bottleneck of SAM quantization attributed to the bimodal distribution in post-Key-Linear activations. We analyze its characteristics from both per-tensor and per-channel perspectives, and propose a Bimodal Integration strategy, which utilizes a mathematically equivalent sign operation to transform the bimodal distribution into a relatively easy-quantized normal distribution offline. Second, SAM encompasses diverse attention mechanisms (i.e., self-attention and two-way cross-attention), resulting in substantial variations in the post-Softmax distributions. Therefore, we introduce an Adaptive Granularity Quantization for Softmax through searching the optimal power-of-two base, which is hardware-friendly. Extensive experimental results across various vision tasks (instance segmentation, semantic segmentation and object detection), datasets and model variants show the superiority of PTQ4SAM. For example, when quantizing SAM-L to 6-bit, we achieve lossless accuracy for instance segmentation, about 0.5\% drop with theoretical 3.9$\times$ acceleration. The code is available at \url{this https URL}.
Segment Anything Model (SAM) 在许多计算机视觉任务中取得了令人印象深刻的性能。然而,作为一个大型模型,其巨大的内存和计算成本限制了其实际部署。在本文中,我们提出了一个后训练量化(PTQ)框架,称为PTQ4SAM。首先,我们研究了SAM量化中由于后线性激活的双极分布所导致的固有瓶颈。从每个张量级和通道级分析其特性,并提出了双极整合策略,利用等效的符号操作将双极分布转化为相对容易量化的正常分布。其次,SAM涵盖了多种关注机制(即自注意力和双边跨注意),导致后归一化分布具有很大的变化。因此,我们通过搜索最优的二进制基数来引入Adaptive Granularity Quantization for Softmax。在各种视觉任务(实例分割,语义分割和目标检测),数据集和模型变体上进行广泛的实验结果表明,PTQ4SAM具有优越性。例如,将SAM-L量化为6位时,我们实现了一个实例分割的无损失准确度,大约比理论3.9$\times$加速减少了0.5%。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2405.03144
Large language models (LLMs) have recently achieved state-of-the-art performance across various tasks, yet due to their large computational requirements, they struggle with strict latency and power demands. Deep neural network (DNN) quantization has traditionally addressed these limitations by converting models to low-precision integer formats. Yet recently alternative formats, such as Normal Float (NF4), have been shown to consistently increase model accuracy, albeit at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks to conclude most distributions follow a Student's t-distribution. We then derive a new theoretically optimal format, Student Float (SF4), with respect to this distribution, that improves over NF4 across modern LLMs, for example increasing the average accuracy on LLaMA2-7B by 0.76% across tasks. Using this format as a high-accuracy reference, we then propose augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, we explore the quality and performance frontier across 11 datatypes, including non-traditional formats like Additive-Powers-of-Two (APoT), by evaluating their model accuracy and hardware complexity. We discover a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area. For example, E2M1 with supernormal support increases the accuracy of Phi-2 by up to 2.19% with 1.22% area overhead, enabling more LLM-based applications to be run at four bits.
大语言模型(LLMs)最近在各种任务上实现了最先进的性能,然而由于它们巨大的计算需求,它们在严格延迟和功耗方面遇到困难。深度神经网络(DNN)量化一直通过将模型转换为低精度整数格式来解决这些限制。然而,最近有研究表明,例如Normal Float(NF4),采用其他格式可以 consistently提高模型的准确性,但代价是增加芯片面积。在这项工作中,我们首先对30个网络的LLM权重和激活进行大规模分析,得出大多数分布遵循学生t分布的结论。然后,我们根据这个分布,推导出一个新的理论最优格式——学生浮点(SF4),并证明了在现代LLM上,相对于NF4,该格式可以提高模型的平均准确性。接下来,我们使用这个格式作为高准确度参考,提出了两种超正常支持来提高模型的准确性。最后,我们通过评估各种数据类型的模型准确性和硬件复杂性,探索了质量与性能的前沿。我们发现了由INT4、E2M1和E2M1 with supernormal support组成的帕累托曲线,该曲线在模型准确性和芯片面积之间提供了一个连续的权衡。例如,E2M1 with supernormal support将Φ-2的准确性增加至2.19%,同时将面积复杂度降低1.22%,从而使更多的LLM基础应用能够在四进制下运行。
https://arxiv.org/abs/2405.03103
Motion diffusion models have recently proven successful for text-driven human motion generation. Despite their excellent generation performance, they are challenging to infer in real time due to the multi-step sampling mechanism that involves tens or hundreds of repeat function evaluation iterations. To this end, we investigate a motion latent consistency Training (MLCT) for motion generation to alleviate the computation and time consumption during iteration inference. It applies diffusion pipelines to low-dimensional motion latent spaces to mitigate the computational burden of each function evaluation. Explaining the diffusion process with probabilistic flow ordinary differential equation (PF-ODE) theory, the MLCT allows extremely few steps infer between the prior distribution to the motion latent representation distribution via maintaining consistency of the outputs over the trajectory of PF-ODE. Especially, we introduce a quantization constraint to optimize motion latent representations that are bounded, regular, and well-reconstructed compared to traditional variational constraints. Furthermore, we propose a conditional PF-ODE trajectory simulation method, which improves the conditional generation performance with minimal additional training costs. Extensive experiments on two human motion generation benchmarks show that the proposed model achieves state-of-the-art performance with less than 10\% time cost.
最近,基于文本的运动扩散模型已经在文本驱动的人体运动生成中取得了成功。然而,由于涉及多步采样的机制,在实时推断中很难推断出它们的良好生成性能。为此,我们研究了一个运动潜在一致性训练(MLCT)以减轻迭代推理中的计算和时间消耗。它将扩散管道应用于低维运动潜在空间,以减轻每个功能评估的计算负担。通过概率流普通微分方程(PF-ODE)理论解释扩散过程,MLCT在保持输出在PF-ODE轨迹上的一致性之间实现了极少的步长推断。特别是,我们引入了量化约束,以优化比传统变分约束更紧密、更规则且更好的运动潜在表示。此外,我们提出了一个有条件PF-ODE轨迹仿真方法,使得有最小训练成本的情况下提高条件生成性能。在两个人类运动生成基准上进行的大量实验证明,与传统方法相比,所提出的模型在不到10%的训练成本下实现了最先进的性能。
https://arxiv.org/abs/2405.02791
We establish the fundamental limits in the approximation of Lipschitz functions by deep ReLU neural networks with finite-precision weights. Specifically, three regimes, namely under-, over-, and proper quantization, in terms of minimax approximation error behavior as a function of network weight precision, are identified. This is accomplished by deriving nonasymptotic tight lower and upper bounds on the minimax approximation error. Notably, in the proper-quantization regime, neural networks exhibit memory-optimality in the approximation of Lipschitz functions. Deep networks have an inherent advantage over shallow networks in achieving memory-optimality. We also develop the notion of depth-precision tradeoff, showing that networks with high-precision weights can be converted into functionally equivalent deeper networks with low-precision weights, while preserving memory-optimality. This idea is reminiscent of sigma-delta analog-to-digital conversion, where oversampling rate is traded for resolution in the quantization of signal samples. We improve upon the best-known ReLU network approximation results for Lipschitz functions and describe a refinement of the bit extraction technique which could be of independent general interest.
我们通过研究深度 ReLU 神经网络对 Lipschitz 函数的逼近,确定了其精度限制。具体来说,我们识别出了三种范例,即欠拟合、过拟合和正确量化,这些范例作为网络权重精度的一个函数,是基于最小最大逼近误差的行为。这是通过求解非亚临界下和上界来实现的。值得注意的是,在正确量化的范例中,神经网络在逼近 Lipschitz 函数时具有记忆优化。深度网络在实现记忆优化方面具有固有的优势。我们还开发了深度精度权衡的概念,表明具有高精度权重的网络可以转换为具有低精度权重的函数等效的深度网络,同时保持记忆优化。这个想法与 sigma-delta 模拟-数字转换类似,其中超采样率被交换为在量化信号样本中的分辨率。我们通过提高已知 ReLU 网络对 Lipschitz 函数逼近的最佳结果来改善这个想法,并描述了一种可以独立具有广泛兴趣的位提取技术的改进。
https://arxiv.org/abs/2405.01952
Detection of changes in heterogeneous remote sensing images is vital, especially in response to emergencies like earthquakes and floods. Current homogenous transformation-based change detection (CD) methods often suffer from high computation and memory costs, which are not friendly to edge-computation devices like onboard CD devices at satellites. To address this issue, this paper proposes a new lightweight CD method for heterogeneous remote sensing images that employs the online all-integer pruning (OAIP) training strategy to efficiently fine-tune the CD network using the current test data. The proposed CD network consists of two visual geometry group (VGG) subnetworks as the backbone architecture. In the OAIP-based training process, all the weights, gradients, and intermediate data are quantized to integers to speed up training and reduce memory usage, where the per-layer block exponentiation scaling scheme is utilized to reduce the computation errors of network parameters caused by quantization. Second, an adaptive filter-level pruning method based on the L1-norm criterion is employed to further lighten the fine-tuning process of the CD network. Experimental results show that the proposed OAIP-based method attains similar detection performance (but with significantly reduced computation complexity and memory usage) in comparison with state-of-the-art CD methods.
检测异质遥感图像中的变化对地震和水灾等紧急情况至关重要。目前基于同质变换的变形检测(CD)方法通常导致计算和内存成本较高,这对卫星上的车载CD设备来说并不友好。为解决这个问题,本文提出了一种用于异质遥感图像的新型轻量级CD方法,该方法采用在线所有整数平展(OAIP)训练策略来有效地对CD网络进行微调,利用当前测试数据。所提出的CD网络由两个视觉几何组(VGG)子网络作为基本架构。在OAIP基于训练过程中,所有权重、梯度和中间数据都被量化为整数,以加速训练并减少内存消耗,其中每层模块指数缩放方案被用于减少由于量化引起的网络参数计算误差。其次,采用L1范数 criterion 的自适应滤波器级别剪枝方法进一步减轻了微调过程。实验结果表明,与最先进的CD方法相比,基于OAIP的轻量级方法在检测性能上具有相似的效果(但计算复杂性和内存消耗大大降低)
https://arxiv.org/abs/2405.01920
Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].
近年来,基于Transformer的自动语音识别(ASR)模型已经实现了词错误率(WER)低于4%,超过了人类注释者的工作准确率,然而它们需要大量的服务器资源,导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题,以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下,在设备级(边缘)ASR上,通过有效平衡能源消耗和准确性,提高了隐私,增强了性能,促进了可持续性。 本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度,我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现,将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗,同时性能降幅很小。模型大小和参数数量越大,并不能保证对噪声的鲁棒性,也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解,对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。
https://arxiv.org/abs/2405.01004
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
Whisper 是一个多任务、多语言的语音模型,覆盖 99 种语言。在它所覆盖的语言的范围内,它产生了可观的自动语音识别(ASR)结果,但在一些未涵盖的语言上,该模型仍然表现不佳,尤其是在较小的模型版本中,问题更加严重。在这项工作中,我们研究了其局限性,证明了说话者相关(性别、年龄)和模型相关(资源丰富性和模型大小)偏见的存在。尽管如此,我们证明了仅模型相关偏见通过量化被放大,影响更多的低资源语言和较小的模型。为了寻找更好的压缩方法,我们提出了 DistilWhisper,一种能够同时提高这些语言的ASR性能,同时保留多任务和多语言功能优势的方法。我们的方法涉及两个关键策略:通过语言特定的专家对Whisper-small进行轻量级模块化ASR微调,以及从Whisper-large-v2中进行知识蒸馏。这种双策略使我们能够在有效提高ASR性能的同时保持多任务和多语言预训练所带来的稳健性。结果表明,我们的方法比标准的微调或LoRA适配器更有效,在目标语言的测试集中提高性能,同时仅引入了微不足道的参数开销。
https://arxiv.org/abs/2405.00966
Machine learning applications on extremely low-power devices, commonly referred to as tiny machine learning (TinyML), promises a smarter and more connected world. However, the advancement of current TinyML research is hindered by the limited size and quality of pertinent datasets. To address this challenge, we introduce Wake Vision, a large-scale, diverse dataset tailored for person detection -- the canonical task for TinyML visual sensing. Wake Vision comprises over 6 million images, which is a hundredfold increase compared to the previous standard, and has undergone thorough quality filtering. Using Wake Vision for training results in a 2.41\% increase in accuracy compared to the established benchmark. Alongside the dataset, we provide a collection of five detailed benchmark sets that assess model performance on specific segments of the test data, such as varying lighting conditions, distances from the camera, and demographic characteristics of subjects. These novel fine-grained benchmarks facilitate the evaluation of model quality in challenging real-world scenarios that are often ignored when focusing solely on overall accuracy. Through an evaluation of a MobileNetV2 TinyML model on the benchmarks, we show that the input resolution plays a more crucial role than the model width in detecting distant subjects and that the impact of quantization on model robustness is minimal, thanks to the dataset quality. These findings underscore the importance of a detailed evaluation to identify essential factors for model development. The dataset, benchmark suite, code, and models are publicly available under the CC-BY 4.0 license, enabling their use for commercial use cases.
在极度低功耗设备上应用机器学习,通常称为微型机器学习(TinyML),有望创造一个更聪明、更连接的世界。然而,当前TinyML研究的进步受到相关数据集有限大小和质量的阻碍。为解决这个问题,我们引入了Wake Vision,一个大规模、多样化的人体检测数据集--TinyML视觉感知的经典任务。Wake Vision包括超过6000000张图像,这是前标准的100倍,经过了严格的质量筛选。使用Wake Vision进行训练,与 established基准相比,准确度提高了2.41%。除了数据集外,我们还提供了五个详细的基准集,评估模型在测试数据特定部分的表现,例如不同的光线条件、相机距离和受试者的 demographic特征。这些新颖的细粒度基准集有助于在忽视整体准确性而在挑战性的现实场景中评估模型质量。通过在基准上评估 MobileNetV2 TinyML 模型,我们发现输入分辨率在检测远距离受试者方面起着比模型宽度更重要的作用,而量化对模型鲁棒性的影响很小,得益于数据集的质量。这些发现强调了对模型开发的详细评估至关重要。数据集、基准集、代码和模型都可以在公共知识共享署名4.0许可下免费获取,从而使它们可用于商业用途。
https://arxiv.org/abs/2405.00892