The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
神经网络架构、量化精度和硬件加速器的协同设计为在性能与效率之间实现最佳平衡提供了一种有前景的方法,尤其是在资源受限的边缘设备上部署模型时。在这项工作中,我们提出了JAQ框架(Joint Architecture, Quantization and Accelerator Framework),它共同优化这三个关键维度。然而,在处理这三个维度的巨大搜索空间时,自动化设计过程面临着重大挑战,尤其是当追求极低比特量化时更是如此。具体来说,主要的挑战包括: 1. 软件端内存开销:低精度量化的感知训练可能会导致由于存储大量中间特征和隐式权重以进行反向传播而产生显著的记忆使用问题,这可能引起内存耗尽。 2. 硬件端搜索时间长:硬件参数的离散性质以及编译器优化与个别操作之间的复杂相互作用使得加速器的搜索过程非常耗费时间。 为了解决这些问题,JAQ通过通道稀疏量化(Channel-wise Sparse Quantization, CSQ)方案缓解了内存开销问题。这种方法在优化过程中有选择地将量化应用于模型中最敏感的部分。此外,JAQ设计了BatchTile机制,该机制利用硬件生成网络来编码所有可能的切片模式,从而加速最优编译器映射策略的搜索过程。 广泛的实验展示了JAQ的有效性,在ImageNet数据集上实现了比先前方法高约7%的Top-1准确率,并将每次迭代中硬件搜索的时间减少到了0.15秒。
https://arxiv.org/abs/2501.05339
We propose an efficient knowledge transfer approach for model-based reinforcement learning, addressing the challenge of deploying large world models in resource-constrained environments. Our method distills a high-capacity multi-task agent (317M parameters) into a compact 1M parameter model, achieving state-of-the-art performance on the MT30 benchmark with a normalized score of 28.45, a substantial improvement over the original 1M parameter model's score of 18.93. This demonstrates the ability of our distillation technique to consolidate complex multi-task knowledge effectively. Additionally, we apply FP16 post-training quantization, reducing the model size by 50% while maintaining performance. Our work bridges the gap between the power of large models and practical deployment constraints, offering a scalable solution for efficient and accessible multi-task reinforcement learning in robotics and other resource-limited domains.
我们提出了一种基于模型的强化学习中的高效知识转移方法,以解决在资源受限环境中部署大型世界模型的挑战。我们的方法将一个高容量的多任务代理(3.17亿参数)精简为一个紧凑型的100万参数模型,在MT30基准测试中取得了28.45的标准化分数,比原100万参数模型的18.93分有了显著提升。这表明我们的蒸馏技术能够有效地整合复杂的多任务知识。此外,我们还应用了FP16后训练量化技术,使模型体积减半的同时保持性能不变。 我们的工作弥合了大型模型的强大功能与实际部署限制之间的差距,为机器人及其他资源受限领域提供了一种可扩展的解决方案,实现了高效且易于访问的多任务强化学习。
https://arxiv.org/abs/2501.05329
Even if Application-Specific Integrated Circuits (ASIC) have proven to be a relevant choice for integrating inference at the edge, they are often limited in terms of applicability. In this paper, we demonstrate that an ASIC neural network accelerator dedicated to image processing can be applied to multiple tasks of different levels: image classification and compression, while requiring a very limited hardware. The key component is a reconfigurable, mixed-precision (3b/2b/1b) encoder that takes advantage of proper weight and activation quantizations combined with convolutional layer structural pruning to lower hardware-related constraints (memory and computing). We introduce an automatic adaptation of linear symmetric quantizer scaling factors to perform quantized levels equalization, aiming at stabilizing quinary and ternary weights training. In addition, a proposed layer-shared Bit-Shift Normalization significantly simplifies the implementation of the hardware-expensive Batch Normalization. For a specific configuration in which the encoder design only requires 1Mb, the classification accuracy reaches 87.5% on CIFAR-10. Besides, we also show that this quantized encoder can be used to compress image patch-by-patch while the reconstruction can performed remotely, by a dedicated full-frame decoder. This solution typically enables an end-to-end compression almost without any block artifacts, outperforming patch-based state-of-the-art techniques employing a patch-constant bitrate.
即使专用集成电路(ASIC)已经被证明是边缘设备集成推理任务的一个相关选择,它们在适用性方面通常受到限制。在这篇论文中,我们展示了专门为图像处理设计的ASIC神经网络加速器可以应用于多个不同层次的任务:如图像分类和压缩,并且只需要非常有限的硬件资源。 关键组件是一个可重构、混合精度(3b/2b/1b)的编码器,它利用适当的权重和激活量化与卷积层结构剪枝相结合的方法来降低硬件相关的限制(内存和计算)。我们引入了一个自动调整线性对称量化因子以执行量化级别均衡的技术,旨在稳定五元组和三元组权重训练。此外,提出的共享Bit-Shift归一化方法大大简化了硬件成本高昂的批量归一化的实现。 对于一个特定配置,在编码器设计仅需1Mb的情况下,CIFAR-10数据集上的分类准确率达到87.5%。另外,我们还展示了这种量化编码器可以逐块压缩图像,并且可以在远程通过专门的全帧解码器进行重建。此方案通常能够实现几乎无任何块状伪影的端到端压缩效果,优于使用固定比特率的区块技术的方法。
https://arxiv.org/abs/2501.05097
Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.
离散令牌的提取提供了高效且领域适应性强的语音特征。尽管这些特征在处理发音不准确和与正常声音严重不符的混乱语言方面尚未得到充分研究,但本论文提出了一种新的基于音素纯度引导(PPG)的离散令牌方法,用于构音障碍语音识别中的应用。该方法通过使用音素标签监督来规范标准K-means和VAE-VQ(变分自编码器-向量量化)基线模型中使用的最大似然和重构误差成本。 在UASpeech语料库上的实验表明,与基于非PPG的K-means或VAE-VQ令牌的标准TDNN混合系统以及端到端(E2E)Conformer系统的性能相比,从HuBERT模型提取的PPG离散令牌特征在不同的码本大小下,通过统计显著性的词错误率(WER)降低实现了更好的效果。具体而言,在包含16名构音障碍者的UASpeech测试集中,与混合系统和端到端系统的基线相比,PPG令牌分别带来了最高0.99%和1.77%的绝对改进,相对改进达到了3.21%和4.82%,这些结果具有统计显著性。最低词错误率为23.25%,通过结合使用不同特征令牌系统的方法实现。 此外,在音素纯度指标上也实现了持续改进。T-SNE(t-分布随机邻域嵌入)可视化进一步证明了在引入音素纯度指导后,K-means/VAE-VQ聚类之间的决策边界变得更加清晰和分离。
https://arxiv.org/abs/2501.04379
Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters.
尽管文本到图像的扩散模型在各种任务中被广泛使用,但其计算和内存需求限制了其实用性。为了解决这个问题,人们开始探索对扩散模型进行量化的方法。通过将权重和激活值压缩成较低位宽格式,这种方法可以减少内存使用量并降低计算成本。然而,现有的方法往往难以同时保持图像质量和文本与图像的对应关系,特别是在低位(小于8位)量化时。 本文从分布的角度分析了对文本到图像扩散模型进行量化所面临的挑战。我们的分析表明,激活值中的异常值在决定图像质量方面起着关键作用。此外,我们还发现了跨注意力得分中独特模式的存在,这对保持文本与图像的对应关系具有显著影响。为了解决这些挑战,我们提出了分布感知组量化(DGQ)方法,该方法能够识别并灵活处理像素级和通道级别的异常值以维持图像质量。此外,DGQ还采用针对特定提示的日志量化比例来保持文本与图像之间的对应关系。 我们的方法在MS-COCO和PartiPrompts等数据集上展现了卓越的性能表现,并且我们是首次成功地在不需额外调整权重量化的参数的情况下实现对文本到图像扩散模型的低比特量化。
https://arxiv.org/abs/2501.04304
To enhance perception in autonomous vehicles (AVs), recent efforts are concentrating on 3D object detectors, which deliver more comprehensive predictions than traditional 2D object detectors, at the cost of increased memory footprint and computational resource usage. We present a novel framework called UPAQ, which leverages semi-structured pattern pruning and quantization to improve the efficiency of LiDAR point-cloud and camera-based 3D object detectors on resource-constrained embedded AV platforms. Experimental results on the Jetson Orin Nano embedded platform indicate that UPAQ achieves up to 5.62x and 5.13x model compression rates, up to 1.97x and 1.86x boost in inference speed, and up to 2.07x and 1.87x reduction in energy consumption compared to state-of-the-art model compression frameworks, on the Pointpillar and SMOKE models respectively.
为了提升自动驾驶车辆(AV)的感知能力,近期的研究重点放在了3D物体检测器上。与传统的2D物体检测器相比,3D物体检测器提供了更全面的预测结果,但代价是增加了内存占用和计算资源的需求。为此,我们提出了一种名为UPAQ的新框架,该框架通过半结构化模式剪枝和量化技术,旨在提高基于LiDAR点云和摄像头的3D物体检测器在资源受限的嵌入式AV平台上的效率。 实验结果表明,在Jetson Orin Nano嵌入式平台上,与最先进的模型压缩框架相比,UPAQ针对Pointpillar模型实现了高达5.62倍的模型压缩率、1.97倍的推理速度提升和2.07倍的能量消耗减少;对于SMOKE模型,则分别达到了5.13倍的模型压缩率、1.86倍的推理加速以及1.87倍的能量节省。
https://arxiv.org/abs/2501.04213
This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage. Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base and HuBERT-large models suggest the resulting mixed-precision quantized models increased the lossless compression ratio by factors up to 1.7x and 1.9x over the respective uniform-precision and two-stage mixed-precision quantized baselines that perform precision learning and model parameters quantization in separate and disjointed stages, while incurring no statistically word error rate (WER) increase over the 32-bit full-precision models. The system compression time of wav2vec2.0-base and HuBERT-large models is reduced by up to 1.9 and 1.5 times over the two-stage mixed-precision baselines, while both produce lower WERs. The best-performing 3.5-bit mixed-precision quantized HuBERT-large model produces a lossless compression ratio of 8.6x over the 32-bit full-precision system.
本文提出了一种新颖的混合精度量化方法,专门用于语音基础模型。该方法将混合精度学习和量化模型参数估计紧密结合在一个单一的模型压缩阶段中。在LibriSpeech数据集上使用经过微调的wav2vec2.0-base和HuBERT-large模型进行的实验表明,得到的混合精度量化模型相较于分别执行精度学习和模型参数量化的独立且不连续步骤的统一精度和两阶段混合精度基线模型,在无损压缩比方面提高了高达1.7倍(对于wav2vec2.0-base)和1.9倍(对于HuBERT-large)。同时,与32位全精度模型相比,并没有统计上的词错误率(WER)增加。wav2vec2.0-base和HuBERT-large模型的系统压缩时间分别比两阶段混合精度基线减少了最多1.9倍和1.5倍,且两者都产生了更低的WER值。性能最佳的3.5位混合精度量化HuBERT-large模型相较于32位全精度系统的无损压缩比达到了8.6倍。
https://arxiv.org/abs/2501.03643
Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks, quickly becoming one of the most prevalent AI workloads. Yet the substantial memory requirement of LLMs significantly hinders their deployment for end users. Post-training quantization (PTQ) serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of LLMs. Although the traditional integer (INT) datatype has received widespread adoption in PTQ methods, floating-point (FP) quantization has emerged as a viable alternative thanks to its effectiveness in fitting LLM numerical distributions. However, the FP datatype in sign-magnitude binary representation contains both positive and negative zero, which constrains its representation capability, particularly under low precision (3 and 4 bits). In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR), which remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit LLM numerical distributions. Through careful selection of special values, RaZeR outperforms conventional asymmetric INT quantization while achieving high computational efficiency. We demonstrate that RaZeR can be seamlessly integrated with quantization algorithms for both weights and KV-cache, including advanced methods with clipping and transformations, and consistently achieve better model accuracy. Additionally, we implement a fast GEMV kernel with fused dequantization that efficiently converts the 4-bit RaZeR value to FP16 through novel bit-level manipulation. On modern GPUs, our evaluation shows that RaZeR improves the GEMV speed by up to 7.56$\times$ compared to the FP16 implementation, while achieving up to 2.72$\times$ speedup in the LLM decoding throughput.
大型语言模型(LLMs)在各种机器学习任务中展示了卓越的性能,迅速成为最流行的AI工作负载之一。然而,LLMs巨大的内存需求显著阻碍了它们为最终用户部署的能力。后训练量化(PTQ)作为缓解LLMs内存和计算需求的一种硬件效率最高的方法已被广泛应用。尽管传统的整数(INT)数据类型在PTQ方法中得到了广泛采用,但得益于其在适应LLM数值分布方面的有效性,浮点(FP)量化作为一种可行的替代方案已经出现。然而,在符号-绝对值二进制表示中的FP数据类型包含正零和负零,这限制了它的表达能力,特别是在低精度(3位和4位)情况下更为明显。 在本文中,我们扩展了基本的FP数据类型以执行冗余零重映射(RaZeR),该方法将负零FP编码重映射为一组预定义的特殊值,从而最大限度地利用FP量化编码,并更好地适应LLM数值分布。通过精心选择特殊值,RaZeR不仅超越传统的非对称整数量化方式,还能实现高效的计算效率。我们展示了RaZeR可以无缝集成到权重和KV缓存的量化算法中,包括带有剪辑和转换等高级方法,并能够持续获得更高的模型精度。 此外,我们实施了一个快速的GEMV内核,该内核融合了去量化的功能,通过新颖的位级操作高效地将4位RaZeR值转换为FP16。在现代GPU上,我们的评估表明与FP16实现相比,RaZeR提高了高达7.56倍的GEMV速度,同时实现了LLM解码吞吐量最多2.72倍的速度提升。 该研究证明了通过改进浮点量化技术来提高模型效率和性能的巨大潜力,并为未来的深度学习模型优化提供了新的方向。
https://arxiv.org/abs/2501.04052
Deep neural networks suffer from storing millions and billions of weights in memory post-training, making challenging memory-intensive models to deploy on embedded devices. The weight-sharing technique is one of the popular compression approaches that use fewer weight values and share across specific connections in the network. In this paper, we propose a multi-objective evolutionary algorithm (MOEA) based compression framework independent of neural network architecture, dimension, task, and dataset. We use uniformly sized bins to quantize network weights into a single codebook (lookup table) for efficient weight representation. Using MOEA, we search for Pareto optimal $k$ bins by optimizing two objectives. Then, we apply the iterative merge technique to non-dominated Pareto frontier solutions by combining neighboring bins without degrading performance to decrease the number of bins and increase the compression ratio. Our approach is model- and layer-independent, meaning the weights are mixed in the clusters from any layer, and the uniform quantization method used in this work has $O(N)$ complexity instead of non-uniform quantization methods such as k-means with $O(Nkt)$ complexity. In addition, we use the center of clusters as the shared weight values instead of retraining shared weights, which is computationally expensive. The advantage of using evolutionary multi-objective optimization is that it can obtain non-dominated Pareto frontier solutions with respect to performance and shared weights. The experimental results show that we can reduce the neural network memory by $13.72 \sim14.98 \times$ on CIFAR-10, $11.61 \sim 12.99\times$ on CIFAR-100, and $7.44 \sim 8.58\times$ on ImageNet showcasing the effectiveness of the proposed deep neural network compression framework.
深度神经网络在训练后需要存储数百万乃至数十亿的权重,这使得内存密集型模型难以部署到嵌入式设备上。权值共享技术是常用的一种压缩方法,它使用较少的权值并在特定连接中进行共享。本文提出了一种独立于神经网络架构、维度、任务和数据集的多目标进化算法(MOEA)压缩框架。我们通过将网络权重均匀量化到单个代码本(查找表)来实现高效的权重表示。利用MOEA,我们寻找帕累托最优的$k$个区间,以优化两个目标:性能和权值共享量。然后应用迭代合并技术,在不降低性能的情况下结合非支配帕累托前沿解中的相邻区间,以此减少区间的数量并提高压缩比率。 我们的方法不受模型或层的影响,即权重可以混合在任何一层的簇中,并且本工作中使用的均匀量化方法具有$O(N)$复杂度,而非非均匀量化方法(如k均值聚类)的$O(Nkt)$复杂度。此外,在使用进化多目标优化的过程中,我们采用的是每个簇中心作为共享权值,而不是重新训练共享权重,后者计算成本高昂。 通过这种方法,我们可以获得性能和权值共享量之间的非支配帕累托前沿解。实验结果表明,我们的框架可以在CIFAR-10数据集上将神经网络的内存需求减少$13.72 \sim 14.98 \times$,在CIFAR-100数据集上减少$11.61 \sim 12.99\times$,以及在ImageNet数据集上减少$7.44 \sim 8.58\times$,从而证明了所提出的深度神经网络压缩框架的有效性。
https://arxiv.org/abs/2501.03095
Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. We introduce a multidimensional evaluation framework that qualitatively assesses specific capability dimensions and conduct quantitative analyses on the step-by-step outputs of various quantization methods. Our results demonstrate that quantization differentially affects numerical computation and reasoning planning abilities, identifying key areas where quantized models experience performance degradation.
大型语言模型在复杂数学推理基准(如MATH)方面取得了显著进展,但其巨大的计算需求给实际部署带来了挑战。通过使用较低精度和位宽的表示方法进行量化已成为减少内存占用和计算成本的有效策略。在这项研究中,我们系统地评估了量化对数学推理任务的影响。我们引入了一个多维度评估框架,该框架从定性上评估特定的能力维度,并且对各种量化方法的逐步输出进行了定量分析。我们的结果显示,量化在数字计算和推理规划能力方面有不同的影响,识别出了量化模型性能下降的关键领域。
https://arxiv.org/abs/2501.03035
The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
对于生成式大型语言模型(LLM)需求的增加,给云计算数据中心的热管理和电力管理带来了挑战。传统的技术通常因为LLM推理过程中的细粒度、毫秒级执行阶段,每个阶段具有不同的性能、温度和功耗特征而显得不足。此外,LLM推理工作负载对各种配置参数(例如模型并行性、大小以及量化)敏感,这些参数之间存在性能、温度、电力及输出质量之间的权衡。另外,云环境经常将软件即服务(SaaS)与基础设施即服务(IaaS)的工作负载共置,但它们的可见性和灵活性程度不同。 我们提出了TAPAS框架,这是一个针对云中LLM推理集群而设计的热管理和电力感知系统。TAPAS增强了冷却和电源过度订阅的能力,在有效处理紧急情况(如冷却或电源故障)的同时减少了总拥有成本(TCO)。该系统利用历史温度和功耗数据以及SaaS工作负载的适应性,实现了以下功能:(1) 在冷却和电力限制内高效地放置新的GPU工作负载虚拟机;(2) 将LLM推理请求跨SaaS虚拟机进行路由;(3) 重新配置SaaS虚拟机以管理负载峰值及紧急情况。我们在一个大型的GPU集群上进行了评估,结果显示在显著减少了热管理和电力节流事件的同时提升了系统的效率。
https://arxiv.org/abs/2501.02600
A broad range of technologies rely on remote inference, wherein data acquired is conveyed over a communication channel for inference in a remote server. Communication between the participating entities is often carried out over rate-limited channels, necessitating data compression for reducing latency. While deep learning facilitates joint design of the compression mapping along with encoding and inference rules, existing learned compression mechanisms are static, and struggle in adapting their resolution to changes in channel conditions and to dynamic links. To address this, we propose Adaptive Rate Task-Oriented Vector Quantization (ARTOVeQ), a learned compression mechanism that is tailored for remote inference over dynamic links. ARTOVeQ is based on designing nested codebooks along with a learning algorithm employing progressive learning. We show that ARTOVeQ extends to support low-latency inference that is gradually refined via successive refinement principles, and that it enables the simultaneous usage of multiple resolutions when conveying high-dimensional data. Numerical results demonstrate that the proposed scheme yields remote deep inference that operates with multiple rates, supports a broad range of bit budgets, and facilitates rapid inference that gradually improves with more bits exchanged, while approaching the performance of single-rate deep quantization methods.
许多技术依赖于远程推理,其中获取的数据通过通信信道传输到远端服务器进行推断。参与实体之间的通信通常在限速通道上进行,这需要数据压缩以减少延迟。虽然深度学习可以促进压缩映射、编码和推理规则的联合设计,但现有的学习压缩机制是静态的,并且难以适应信道条件变化及动态链路的变化需求。为了解决这个问题,我们提出了一种自适应速率任务导向向量量化(ARTOVeQ),这是一种专为动态链接远程推理而设计的学习压缩机制。 ARTOVeQ基于设计嵌套码本以及采用渐进式学习算法来进行开发。我们展示了ARTOVeQ支持低延迟推理,并通过逐步细化原理逐渐改进,同时允许多种分辨率在传输高维数据时的并行使用。数值结果表明,该方案可以实现多速率操作的远程深度推断,支持广泛的比特预算范围,并且能够促进快速推理,随着交换更多比特而逐步改善性能,最终接近单速率深度量化方法的表现水平。
https://arxiv.org/abs/2501.02521
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point quantization and thus cannot well fit the LLM losses in this scenario. In contrast, while floating-point quantization training is more commonly implemented in production, the research on it has been relatively superficial. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in floating-point quantization training performance of LLM models. While presenting an accurate floating-point quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal floating-point quantization precision is directly proportional to the computational power, but within a wide computational power range, we estimate that the best cost-performance precision lies between 4-8 bits.
低精度训练被认为是一种有效降低训练和后续推理成本的策略。之前的精度缩放规律主要集中在整数量化上,对浮点量化的组成部分关注较少,因此无法很好地适应此场景下的大语言模型(LLM)损失。相比之下,尽管在实际生产中更常实施浮点量化训练,但对其的研究相对浅显。本文全面探讨了浮点量化目标、指数位、尾数位以及浮点量化训练计算粒度对LLM模型性能的影响。除了提出一个准确的浮点量化统一缩放规律外,我们还为社区提供了有价值的建议:(1) 指数位比尾数位对模型性能的贡献稍大一些。我们给出了不同位数下的最优指数-尾数比特比例,可供未来硬件制造商参考;(2) 我们发现了低精度LLM训练中的临界数据量形成机制。过量的超出临界数据大小的训练数据反而会导致LLM性能下降;(3) 最优浮点量化精度直接与计算能力成正比,但在广泛的计算能力范围内,我们估计最佳成本效益精度在4-8位之间。
https://arxiv.org/abs/2501.02423
We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.
我们提出了一种全面的方法,用于在车辆作为边缘设备中部署小型语言模型(SLMs)以充当功能调用代理。这种方法为传统的基于规则的系统提供了一个更灵活和稳健的替代方案,并通过利用SLM简化了车辆控制机制并增强了用户体验。考虑到车载硬件限制,我们采用了最先进的模型压缩技术,包括结构化剪枝、恢复和量化,确保在资源限制内保持可接受性能的同时,使模型适应性更强。 我们的工作重点是优化具有代表性的SLM——微软的Phi-3 mini,并概述了启用嵌入式模型的最佳实践,包括压缩、任务特定微调以及车辆集成。我们展示了即使在将模型大小显著减少(移除高达20亿参数)的情况下,我们的方法仍能保持模型处理复杂车载任务的能力,并且仍然准确高效。 此外,通过在一个轻量级的运行环境中执行该模型,我们实现了每秒11个标记的生成速度,从而使得在不使用硬件加速的情况下进行实时、设备端推理成为可能。我们的研究结果展示了SLMs在转变车辆控制系统方面的潜力,使用户与汽车之间的互动更加直观,并为驾驶体验带来提升。
https://arxiv.org/abs/2501.02342
The emergence of 5G and edge computing hardware has brought about a significant shift in artificial intelligence, with edge AI becoming a crucial technology for enabling intelligent applications. With the growing amount of data generated and stored on edge devices, deploying AI models for local processing and inference has become increasingly necessary. However, deploying state-of-the-art AI models on resource-constrained edge devices faces significant challenges that must be addressed. This paper presents an optimization triad for efficient and reliable edge AI deployment, including data, model, and system optimization. First, we discuss optimizing data through data cleaning, compression, and augmentation to make it more suitable for edge deployment. Second, we explore model design and compression methods at the model level, such as pruning, quantization, and knowledge distillation. Finally, we introduce system optimization techniques like framework support and hardware acceleration to accelerate edge AI workflows. Based on an in-depth analysis of various application scenarios and deployment challenges of edge AI, this paper proposes an optimization paradigm based on the data-model-system triad to enable a whole set of solutions to effectively transfer ML models, which are initially trained in the cloud, to various edge devices for supporting multiple scenarios.
5G和边缘计算硬件的出现,为人工智能带来了显著的变化,使边缘AI成为推动智能应用的关键技术。随着在边缘设备上生成和存储的数据量不断增加,部署用于本地处理和推理的AI模型变得越来越必要。然而,在资源受限的边缘设备上部署前沿的AI模型面临诸多挑战,这些问题亟待解决。本文提出了一种针对高效可靠边缘AI部署的优化三元组策略,包括数据、模型和系统层面的优化。首先,我们讨论通过数据清洗、压缩及增强等方法来使数据更适合边缘部署;其次,在模型层面上探索诸如剪枝、量化以及知识蒸馏等模型设计与压缩技术;最后,介绍框架支持和技术加速等系统级优化技巧以加快边缘AI工作流程的速度。基于对各种应用情景和边缘AI部署挑战的深入分析,本文提出了一个基于数据-模型-系统的三元组优化范式,旨在为一系列解决方案提供基础,有效将最初在云端训练的机器学习(ML)模型转移到不同类型的边缘设备上,以支持多种应用场景。
https://arxiv.org/abs/2501.03265
Cloud gaming is an advanced form of Internet service that necessitates local terminals to decode within limited resources and time latency. Super-Resolution (SR) techniques are often employed on these terminals as an efficient way to reduce the required bit-rate bandwidth for cloud gaming. However, insufficient attention has been paid to SR of compressed game video content. Most SR networks amplify block artifacts and ringing effects in decoded frames while ignoring edge details of game content, leading to unsatisfactory reconstruction results. In this paper, we propose a novel lightweight network called Coding Prior-Guided Super-Resolution (CPGSR) to address the SR challenges in compressed game video content. First, we design a Compressed Domain Guided Block (CDGB) to extract features of different depths from coding priors, which are subsequently integrated with features from the U-net backbone. Then, a series of re-parameterization blocks are utilized for reconstruction. Ultimately, inspired by the quantization in video coding, we propose a partitioned focal frequency loss to effectively guide the model's focus on preserving high-frequency information. Extensive experiments demonstrate the advancement of our approach.
云游戏是一种高级的互联网服务,要求本地终端在有限的资源和时间延迟下进行解码。超分辨率(SR)技术通常被用于这些终端上,作为一种有效的方法来降低云游戏中所需的比特率带宽。然而,对压缩的游戏视频内容的超分辨率处理并未得到足够的关注。大多数超分辨率网络会放大解码帧中的块效应和振铃效应,并忽视游戏内容的边缘细节,导致重建效果不佳。 在本文中,我们提出了一种新的轻量级网络——编码先验引导超分辨率(CPGSR),以解决压缩游戏视频内容中的超分辨率挑战。首先,我们设计了一个压缩域引导块(CDGB)来从编码先验中提取不同深度的特征,并随后将其与U-net骨干网的特征相结合。然后,利用一系列再参数化模块进行重建。最后,受到视频编码中量化过程的启发,我们提出了一种分区焦点频率损失,以有效指导模型关注保留高频信息。 大量的实验展示了我们的方法的进步性。
https://arxiv.org/abs/2501.01773
End-to-end image and video codecs are becoming increasingly competitive, compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques, such as their straightforward adaptation to perceptual distortion metrics and high performance in specific fields thanks to their learning ability. However, current state-of-the-art neural codecs do not fully exploit the benefits of vector quantization and the existence of the entropy gradient in decoding devices. In this paper, we propose to leverage these two properties (vector quantization and entropy gradient) to improve the performance of off-the-shelf codecs. Firstly, we demonstrate that using non-uniform scalar quantization cannot improve performance over uniform quantization. We thus suggest using predefined optimal uniform vector quantization to improve performance. Secondly, we show that the entropy gradient, available at the decoder, is correlated with the reconstruction error gradient, which is not available at the decoder. We therefore use the former as a proxy to enhance compression performance. Our experimental results show that these approaches save between 1 to 3% of the rate for the same quality across various pretrained methods. In addition, the entropy gradient based solution improves traditional codec performance significantly as well.
端到端的图像和视频编解码器与传统压缩技术相比变得越来越具有竞争力,后者是经过数十年人工工程努力开发出来的。这些可训练的编解码器在适应感知失真度量方面以及由于其学习能力而在特定领域的高表现上,相较于传统的技术有许多优势。然而,目前最先进的神经网络编解码器未能完全发挥向量量化和熵梯度在解码设备中存在的好处。 在这篇论文中,我们提出了利用这两个属性(向量量化和熵梯度)来改进现成的编解码器性能的方法。首先,我们证明了使用非均匀标量量化不能提高均匀量化的性能。因此,我们建议使用预定义的最佳均匀向量量化方法来改善性能。其次,我们展示了在解码设备中可用的熵梯度与无法获得的重建误差梯度之间存在相关性。因此,我们将前者作为代理来增强压缩性能。 我们的实验结果表明,这些方法可以在各种预训练方法中为相同的质量节省1%到3%的比特率。此外,基于熵梯度的方法也显著提高了传统编解码器的表现。
https://arxiv.org/abs/2501.01231
Large Language Models (LLMs) have achieved remarkable success, but their increasing size poses significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with fine-grained block-wise quantization emerging as a promising hardware-supported solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. To address this, we propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. Importantly, DialectFP4 ensures hardware efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. Furthermore, we propose a two-stage approach for online DialectFP4 activation quantization. BlockDialect achieves 11.40% (6.90%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with a comparable bit usage per data, while being only 5.89% (3.31%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
大型语言模型(LLMs)已经取得了显著的成功,但其不断增大的规模带来了内存使用和计算成本的挑战。对权重和激活进行量化可以解决这些问题,而细粒度块级别的量化作为一种有硬件支持的解决方案,能够有效缓解异常值问题。然而,现有的方法难以捕捉细微的区块数据分布特征。为此,我们提出了BlockDialect技术,这是一种基于细粒度混合格式的区块级方法,它为每个区块分配来自formatbook的最佳数位格式以实现更好的数据表示。此外,我们还引入了DialectFP4格式书,这是一个包含多种FP4变体(类似于方言)的集合,它们可以适应不同的数据分布特征。重要的是,DialectFP4通过选择与低精度整数算术兼容的可表示值来确保硬件效率。 为了实现在线量化,我们提出了一种两阶段的方法进行DialectFP4激活量化的应用。BlockDialect在使用相同的数据位的情况下,在LLaMA3-8B和LLaMA2-7B模型上与MXFP4格式相比分别实现了11.40%(6.90%)的准确性提升,即使在整个路径矩阵乘法量化时也只比全精度低5.89%(3.31%)。我们的工作专注于如何表示而非如何缩放数据,为能效高的LLM推理提供了一条有前景的道路。
https://arxiv.org/abs/2501.01144
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in this https URL.
近年来,基于自监督学习(SSL)预训练的基础模型在音乐信息学理解任务中取得了成功,这些任务包括音乐标签分类、乐器识别、音调检测等。本文提出了一种用于音乐理解的自监督音乐表示学习模型。与以往研究采用随机投影或现有神经编解码器的方法不同,我们提出的名为MuQ的模型是通过预测由梅尔残差向量量化(Mel-RVQ)生成的令牌来训练的。我们的Mel-RVQ利用了用于梅尔谱图量化的残差线性投影结构,这增强了目标提取的稳定性和效率,并带来了更好的性能。 在各种下游任务中的实验表明,MuQ模型仅使用0.9K小时开源数据进行预训练就超过了以往的自监督音乐表示学习模型。随着训练数据增加到超过160K小时并采用迭代训练方法,模型表现持续提升。为了进一步验证我们模型的优势,我们提出了基于对比学习的联合音乐文本嵌入模型MuQ-MuLan,在MagnaTagATune数据集上实现了零样本音乐标签分类任务中的最佳性能。 相关代码和检查点可以在提供的链接中获取:[https URL](请将方括号内的URL替换为实际地址)。
https://arxiv.org/abs/2501.01108
Diffusionmodels(DMs)havedemonstratedremarkableachievements in synthesizing images of high fidelity and diversity. However, the extensive computational requirements and slow generative speed of diffusion models have limited their widespread adoption. In this paper, we propose a novel post-training quantization for diffusion models (PQD), which is a time-aware optimization framework for diffusion models based on post-training quantization. The proposed framework optimizes the inference process by selecting representative samples and conducting time-aware calibration. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a few FID change on ImageNet for unconditional image generation. Our approach demonstrates compatibility and can also be applied to 512x512 text-guided image generation for the first time.
扩散模型(DMs)在合成高保真和多样化的图像方面取得了显著成就。然而,扩散模型的广泛计算需求和缓慢生成速度限制了它们的应用普及。为此,在本文中,我们提出了一种针对扩散模型的新颖的后训练量化方法(PQD),这是一种基于后训练量化的时序感知优化框架。该框架通过选择代表性样本并进行时序感知校准来优化推理过程。实验结果表明,我们的方法能够在无需重新训练的情况下直接将全精度扩散模型量化为8位或4位模型,并且在ImageNet无条件图像生成上保持了相当的性能(仅改变了少量FID值)。此外,我们提出的方法展示了其兼容性,首次应用于512x512文本引导的图像生成。
https://arxiv.org/abs/2501.00124