Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures. Despite its effectiveness and convenience, the reliability of PTQ methods in the presence of some extrem cases such as distribution shift and data noise remains largely unexplored. This paper first investigates this problem on various commonly-used PTQ methods. We aim to answer several research questions related to the influence of calibration set distribution variations, calibration paradigm selection, and data augmentation or sampling strategies on PTQ reliability. A systematic evaluation process is conducted across a wide range of tasks and commonly-used PTQ paradigms. The results show that most existing PTQ methods are not reliable enough in term of the worst-case group performance, highlighting the need for more robust methods. Our findings provide insights for developing PTQ methods that can effectively handle distribution shift scenarios and enable the deployment of quantized DNNs in real-world applications.
训练后量化(PTQ)是一种 popular 方法,用于压缩深度神经网络(DNNs),而不需要修改其原始架构或训练过程。尽管其效率和方便性,但在一些极端情况下,如分布 shift 和数据噪声,PTQ 方法的可靠性仍 largely 未被探索。本文首先研究了 various commonly-used PTQ methods 的各种问题。我们旨在回答 several 研究问题,与校准集合分布变异、校准范式选择以及数据增强或采样策略对 PTQ 可靠性的影响有关。一项系统化的评估过程涵盖了广泛的任务和 commonly-used PTQ 范式。结果表明,大多数现有 PTQ 方法在最坏的情况下群体表现不够可靠,突出了需要更多的稳健方法。我们的研究结果为开发能够有效处理分布 shift 情景和使量化 DNNs 在现实世界应用程序中部署的 PTQ 方法提供了 insights。
https://arxiv.org/abs/2303.13003
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at this https URL.
我们介绍了LMCodec,一个 causal 神经网络语音编码器,提供低比特率下的高质量语音。该系统的核心是一个 causal 卷积编码器,通过残留向量化将音频编码为精细到粗的代币层级,从而实现更少量的代码传输。LMCodec 训练了一个 Transformer 语言模型,以生成从粗代币到精细代币的预测,从而允许更少的代码传输。第二个 Transformer 预测了给定过去传输的代码的不确定性,并用于执行条件熵编码。一项MusHRA 主观测试进行了 conducted,表明质量在更高的比特率下与参考codec 相当。示例音频可用在这个 https URL 上。
https://arxiv.org/abs/2303.12984
In this paper, we introduce a new approach, called "Posthoc Interpretation via Quantization (PIQ)", for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. We evaluated our method through quantitative and qualitative studies and found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.
在本文中,我们介绍了一种新方法,称为“后算术解释法(PIQ)”,用于解释训练分类器做出的决策。我们的算法利用向量化将分类器表示转换为离散、类特异性的潜在空间。类特异性编码书作为瓶颈,迫使解释者专注于分类器认为 relevant 的输入数据的部分,以进行预测。我们通过量化和定性研究评估了我们的算法,并发现与文献中的多个其他解释方法相比,PIQ产生的解释更容易让参与者理解。
https://arxiv.org/abs/2303.12659
Recently, vision transformers (ViT) have replaced convolutional neural network models in numerous tasks, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers and optimize attention computation for linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. Combining quantization techniques and efficient hybrid transformer structures is crucial to maximize the acceleration of vision transformers on mobile devices. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, at first, we discover that the straightforward manner to apply the existing PTQ methods for ViT to efficient hybrid transformers results in a drastic accuracy drop due to the following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters (<5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid vision transformers (MobileViTv1 and MobileViTv2) with a significant margin (an average improvement of 7.75%) compared to existing PTQ methods (EasyQuant, FQ-ViT, and PTQ4ViT). We plan to release our code at this https URL.
最近,视觉转换器(ViT)已经取代了卷积神经网络模型在许多任务中的位置,包括分类、检测和分割。然而,ViT的高计算要求妨碍了它们的广泛实现。为了解决这一问题,研究人员提出了高效的混合Transformer架构,将卷积和Transformer层结合起来,并优化线性复杂性的注意力计算。此外,训练后量化也被广泛提议作为一种减轻计算要求的手段。将量化技术和高效的混合Transformer结构相结合是至关重要的,以最大限度地提高移动设备上的视觉转换器加速性能。然而,目前还没有任何先前的研究将量化应用于高效的混合视觉转换器。在本文中,我们首先发现,将现有PTQ方法(ViT)直接应用于高效的混合视觉转换器会导致精度急剧下降,因为这些挑战:(i)高度动态范围,(ii)零点 Overflow,(iii)多种归一化,(iv)模型参数限制(小于5000个)。为了克服这些挑战,我们提出了一种新的训练后量化方法,它是第一个将效率混合视觉转换器(MobileViTv1和MobileViTv2)量化的(与现有PTQ方法(EasyQuant,FQ-ViT,和PTQ4ViT)相比,平均提高7.75%)高效混合视觉转换器(MobileViTv1和MobileViTv2)的方法。我们计划在这个httpsURL上发布我们的代码。
https://arxiv.org/abs/2303.12557
Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically, benefitting from its data privacy and low computation costs. We argue that an overlooked problem of oscillation is in the PTQ methods. In this paper, we take the initiative to explore and present a theoretical proof to explain why such a problem is essential in PTQ. And then, we try to solve this problem by introducing a principled and generalized framework theoretically. In particular, we first formulate the oscillation in PTQ and prove the problem is caused by the difference in module capacity. To this end, we define the module capacity (ModCap) under data-dependent and data-free scenarios, where the differentials between adjacent modules are used to measure the degree of oscillation. The problem is then solved by selecting top-k differentials, in which the corresponding modules are jointly optimized and quantized. Extensive experiments demonstrate that our method successfully reduces the performance drop and is generalized to different neural networks and PTQ methods. For example, with 2/4 bit ResNet-50 quantization, our method surpasses the previous state-of-the-art method by 1.9%. It becomes more significant on small model quantization, e.g. surpasses BRECQ method by 6.61% on MobileNetV2*0.5.
Post-training quantization (PTQ)被广泛认为是在实践中最高效的压缩方法之一,因为它的数据隐私和低计算成本带来了优势。我们认为,PTQ方法中存在的一个被忽略的问题就是振荡。在本文中,我们采取了主动措施,探索并提出了理论证明,以解释为什么这样的问题在PTQ方法中至关重要。然后,我们试图通过引入一种原则性和普遍化的框架来解决这个问题。特别是,我们首先制定了PTQ方法中的振荡,并证明了这个问题是由模块能力差异引起的。为此,我们定义了在数据依赖和数据无关的情况下的模块能力(ModCap),并在其中相邻模块之间的差异用于衡量振荡程度。解决这个问题的方法是选择 top-k 差异,其中相应的模块一起优化和量化。广泛的实验表明,我们的方法成功地减少了性能下降,并可以应用于不同的神经网络和PTQ方法。例如,使用2/4位ResNet-50的量化方法,我们的方法和之前的最先进的方法相比提高了1.9%。在小型模型量化方面,它变得更加显著,例如在MobileNetV2*0.5中比BRECQ方法提高了6.61%。
https://arxiv.org/abs/2303.11906
The rising performance of deep neural networks is often empirically attributed to an increase in the available computational power, which allows complex models to be trained upon large amounts of annotated data. However, increased model complexity leads to costly deployment of modern neural networks, while gathering such amounts of data requires huge costs to avoid label noise. In this work, we study the ability of compression methods to tackle both of these problems at once. We hypothesize that quantization-aware training, by restricting the expressivity of neural networks, behaves as a regularization. Thus, it may help fighting overfitting on noisy data while also allowing for the compression of the model at inference. We first validate this claim on a controlled test with manually introduced label noise. Furthermore, we also test the proposed method on Facial Action Unit detection, where labels are typically noisy due to the subtlety of the task. In all cases, our results suggests that quantization significantly improve the results compared with existing baselines, regularization as well as other compression methods.
深度学习模型的性能不断提升往往Empirically归咎于可用的计算资源增加,这使得复杂的模型能够基于大量标注数据进行训练。然而,模型复杂性的增加会导致现代神经网络的昂贵部署,而收集这样数量的数据需要巨大的成本以避免标签噪声。在本文中,我们将研究压缩方法如何解决这两个问题同时存在。我们假设有意识训练可以通过限制神经网络的表达力来被视为正则化。因此,它可能有助于在噪声数据上避免过拟合,同时也允许模型在推理时进行压缩。我们首先在手动引入标签噪声的控制测试中验证这一假设。此外,我们还测试了提出的方法,用于面部行动单元检测,该任务通常由于任务的微妙性而存在标签噪声。在所有情况下,我们的结果表明,量化 significantly improve 结果 compared with existing baselines, regularization as well as other compression methods.
https://arxiv.org/abs/2303.11803
Data-free quantization (DFQ) recovers the performance of quantized network (Q) without accessing the original data, but generates the fake sample via a generator (G) by learning from full-precision network (P), which, however, is totally independent of Q, overlooking the adaptability of the knowledge from generated samples, i.e., informative or not to the learning process of Q, resulting into the overflow of generalization error. Building on this, several critical questions -- how to measure the sample adaptability to Q under varied bit-width scenarios? how to generate the samples with large adaptability to improve Q's generalization? whether the largest adaptability is the best? To answer the above questions, in this paper, we propose an Adaptive Data-Free Quantization (AdaDFQ) method, which revisits DFQ from a zero-sum game perspective upon the sample adaptability between two players -- a generator and a quantized network. Following this viewpoint, we further define the disagreement and agreement samples to form two boundaries, where the margin is optimized to address the over-and-under fitting issues, so as to generate the samples with adaptive adaptability to Q. Our AdaDFQ reveals: 1) the largest adaptability is NOT the best for sample generation to benefit Q's generalization; 2) the knowledge of the generated sample should not be informative to Q only, but also related to the category and distribution information of the training data for P. The theoretical and empirical analysis validate the advantages of AdaDFQ over the state-of-the-arts. Our code is available at https: this http URL.
无数据量化(DFQ)方法不需要访问原始数据,而是通过生成随机样本(G)从全精度网络(P)学习,但该方法忽略了生成样本的知识适应性,即是否对Q的学习过程 informative,从而导致泛化误差的溢出。基于这一基础,有几个关键问题需要回答——如何在不同的位宽场景中测量样本对Q的适应性?如何生成具有大适应性的样本以提高Q的泛化能力?最大的适应性是否是最好的?为了回答这些问题,在本文中,我们提出了一种自适应无数据量化方法(AdaDFQ),该方法回顾了无数据量化方法DFQ,从零和博弈的视角看待两个玩家——生成器和量化网络之间的样本适应性。遵循这一视角,我们进一步定义了不同意和同意样本组成了两个边界,其中边界的优化是为了解决Over-and-Underfitting问题,从而生成对Q具有自适应适应性的样本。我们的AdaDFQ方法揭示了:1)最大的适应性并不是用于生成样本以促进Q的泛化的最佳策略;2)生成样本的知识不应该仅局限于对Q的 informative,还与P的训练数据类别和分布信息相关。理论和实证分析验证了AdaDFQ方法相对于现有方法的优势。我们的代码现在可以在https://github.com/zhaodxs/AdaDFQ库中可用。
https://arxiv.org/abs/2303.06869
The combination of Neural Architecture Search (NAS) and quantization has proven successful in automatically designing low-FLOPs INT8 quantized neural networks (QNN). However, directly applying NAS to design accurate QNN models that achieve low latency on real-world devices leads to inferior performance. In this work, we find that the poor INT8 latency is due to the quantization-unfriendly issue: the operator and configuration (e.g., channel width) choices in prior art search spaces lead to diverse quantization efficiency and can slow down the INT8 inference speed. To address this challenge, we propose SpaceEvo, an automatic method for designing a dedicated, quantization-friendly search space for each target hardware. The key idea of SpaceEvo is to automatically search hardware-preferred operators and configurations to construct the search space, guided by a metric called Q-T score to quantify how quantization-friendly a candidate search space is. We further train a quantized-for-all supernet over our discovered search space, enabling the searched models to be directly deployed without extra retraining or quantization. Our discovered models establish new SOTA INT8 quantized accuracy under various latency constraints, achieving up to 10.1% accuracy improvement on ImageNet than prior art CNNs under the same latency. Extensive experiments on diverse edge devices demonstrate that SpaceEvo consistently outperforms existing manually-designed search spaces with up to 2.5x faster speed while achieving the same accuracy.
神经网络架构搜索(NAS)和量化组合已经证明在自动设计低Flops INT8量化神经网络(QNN)方面是成功的。然而,直接应用NAS设计在真实设备上实现低延迟的准确QNN模型会导致性能下降。在这项工作中,我们发现INT8延迟差是由于量化不友好的问题:先前搜索空间中的operator和配置(例如通道宽度)选择会导致不同的量化效率,并且可能会减缓INT8推理速度。为了解决这个问题,我们提出了SpaceEvo,一种自动方法,为每个目标硬件设计一个专门的、量化友好的搜索空间。SpaceEvo的关键思想是自动搜索硬件偏好的operator和配置,以构建搜索空间,并使用名为Q-T score的度量指标量化一个候选搜索空间是否量化友好。我们还在我们的发现搜索空间上训练了一个量化对所有设备的超级net,使搜索模型可以直接部署,而无需额外的培训和量化。我们发现的模型在多种延迟限制下建立了新的SOTAINT8量化准确性,在ImageNet上比先前的CNNs实现了10.1%的精度提高。在各种不同的边缘设备上进行广泛的实验表明,SpaceEvo consistently outperforms existing manually-designed搜索空间,并具有2.5倍更快的速度和相同的准确性。
https://arxiv.org/abs/2303.08308
Post-training quantization (\ptq) had been recently shown as a compromising method to reduce the memory consumption and/or compute cost for large language models. However, a comprehensive study about the effect of different quantization schemes, different model families, different \ptq methods, different quantization bit precision, etc, is still missing. In this work, we provide an extensive study on those components over tens of thousands of zero-shot experiments. Our results show that (1) Fine-grained quantization and \ptq methods (instead of naive round-to-nearest quantization) are necessary to achieve good accuracy and (2) Higher bits (e.g., 5 bits) with coarse-grained quantization is more powerful than lower bits (e.g., 4 bits) with very fine-grained quantization (whose effective bits is similar to 5-bits). We also present recommendations about how to utilize quantization for \llms with different sizes, and leave suggestions of future opportunities and system work that are not resolved in this work.
最近,Post-training quantization (ptq) 被证明是一种降低大型语言模型内存消耗和/或计算成本的妥协方法。然而,关于不同 quantizationScheme、不同模型家族、不同 ptq 方法、不同 quantization bit precision 等的不同效应的全面研究仍然缺失。在本文中,我们对数千次零样本实验中的这些组件进行了广泛的研究。我们的结果显示(1) 精细的量化和 ptq 方法(而不是简单的整数Round-to-nearest量化)是必要的,以实现良好的精度,(2) 粗粒度的量化的更高的位(例如 5 位)比精细的量化的更低的位(例如 4 位)更有威力(其有效位类似于 5 位)。我们还提出了如何对不同大小的 llms 利用量化的建议,并留下了本工作未解决的未来机会和系统工作的建议。
https://arxiv.org/abs/2303.08302
Model parameter regularization is a widely used technique to improve generalization, but also can be used to shape the weight distributions for various purposes. In this work, we shed light on how weight regularization can assist model quantization and compression techniques, and then propose range regularization (R^2) to further boost the quality of model optimization by focusing on the outlier prevention. By effectively regulating the minimum and maximum weight values from a distribution, we mold the overall distribution into a tight shape so that model compression and quantization techniques can better utilize their limited numeric representation powers. We introduce L-inf regularization, its extension margin regularization and a new soft-min-max regularization to be used as a regularization loss during full-precision model training. Coupled with state-of-the-art quantization and compression techniques, models trained with R^2 perform better on an average, specifically at lower bit weights with 16x compression ratio. We also demonstrate that R^2 helps parameter constrained models like MobileNetV1 achieve significant improvement of around 8% for 2 bit quantization and 7% for 1 bit compression.
模型参数Regularization是一种广泛应用的技术,以改善泛化能力,但也可用于形状权重分布的各种目的。在本文中,我们阐明了如何帮助模型量化和压缩技术,然后提出了范围Regularization(R^2),以通过重点预防异常点来进一步提高模型优化的质量。通过有效地管理分布中的最小和最大值权重值,我们塑造了整个分布的紧凑形状,使模型量化和编码技术更好地利用其有限的数字表示能力。我们引入了L-inf Regularization,其扩展margin regularization和新的弹性最小最大值 Regularization,并在全精度模型训练期间用作 Regularization 损失。与最先进的量化和压缩技术相结合,训练使用R^2的模型平均表现更好,特别是低比特权重并具有16x压缩比时。我们还证明R^2帮助约束参数的模型如MobileNetV1实现约8%的2比特量化和1比特压缩的显著改进。
https://arxiv.org/abs/2303.08253
Deep neural networks have been proven effective in a wide range of tasks. However, their high computational and memory costs make them impractical to deploy on resource-constrained devices. To address this issue, quantization schemes have been proposed to reduce the memory footprint and improve inference speed. While numerous quantization methods have been proposed, they lack systematic analysis for their effectiveness. To bridge this gap, we collect and improve existing quantization methods and propose a gold guideline for post-training quantization. We evaluate the effectiveness of our proposed method with two popular models, ResNet50 and MobileNetV2, on the ImageNet dataset. By following our guidelines, no accuracy degradation occurs even after directly quantizing the model to 8-bits without additional training. A quantization-aware training based on the guidelines can further improve the accuracy in lower-bits quantization. Moreover, we have integrated a multi-stage fine-tuning strategy that works harmoniously with existing pruning techniques to reduce costs even further. Remarkably, our results reveal that a quantized MobileNetV2 with 30\% sparsity actually surpasses the performance of the equivalent full-precision model, underscoring the effectiveness and resilience of our proposed scheme.
深度神经网络在多种任务中已被证明有效。然而,它们的高计算和内存成本使其在资源受限的设备上部署不可行。为了解决这一问题,提出了量化方案以减少内存占用并提高推理速度。虽然已经提出了许多量化方法,但它们缺乏对它们的系统性分析。为了填补这一差距,我们收集并改进了现有的量化方法,并提出了在训练后量化的最优指南。我们使用ResNet50和MobileNetV2等流行的模型,在ImageNet数据集上评估了我们提出的方法和两个模型的效果。通过遵循我们的指南,即使在直接量化模型到8位二进制数的情况下,也没有发生精度下降。基于指南的量化训练可以进一步改进较低位量化的准确性。此外,我们整合了一个多阶段微调策略,它与现有的修剪技术和谐共处,进一步减少了成本。令人惊讶地,我们的结果表明,一个量化MobileNetV2的 sparsity 为30\%的模型实际上超过了等价的全精度模型的性能,强调了我们提出的方案的有效性和韧性。
https://arxiv.org/abs/2303.07080
Many edge applications, such as collaborative robotics and spacecraft rendezvous, can benefit from 6D object pose estimation, but must do so on embedded platforms. Unfortunately, existing 6D pose estimation networks are typically too large for deployment in such situations and must therefore be compressed, while maintaining reliable performance. In this work, we present an approach to doing so by quantizing such networks. More precisely, we introduce a module-wise quantization strategy that, in contrast to uniform and mixed-precision quantization, accounts for the modular structure of typical 6D pose estimation frameworks. We demonstrate that uniquely compressing these modules outperforms uniform and mixed-precision quantization techniques. Moreover, our experiments evidence that module-wise quantization can lead to a significant accuracy boost. We showcase the generality of our approach using different datasets, quantization methodologies, and network architectures, including the recent ZebraPose.
许多边缘应用,如协作机器人和太空对接,可以从6D物体姿态估计中获得益处,但必须在嵌入式平台上实现。不幸的是,现有的6D姿态估计网络在这种情况中都太大,因此必须压缩,同时保持可靠的性能。在这项工作中,我们提出了一种方法,通过量化这些网络来实现。更准确地说,我们引入了一种模块级量化策略,与均匀和混合精度量化方法不同,它考虑了典型的6D姿态估计框架模块结构。我们证明, uniquely 压缩这些模块比均匀和混合精度量化方法更有效。此外,我们的实验证据表明,模块级量化可以带来显著的精度提升。我们使用不同数据集、量化方法和网络架构,包括最近的斑马姿态,展示了我们方法的通用性。我们展示了我们方法的不同寻常性,使用了包括最近推出的斑马姿态等多种数据集、量化方法和网络架构。
https://arxiv.org/abs/2303.06753
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. Predominant approaches learn the discrete representation either in a deterministic manner by selecting the best-matching token or in a stochastic manner by sampling from a predicted distribution. However, deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while stochastic quantization suffers from low codebook utilization and perturbed reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate above issues effectively by applying regularization from two perspectives. The first is a prior distribution regularization which measures the discrepancy between a prior token distribution and the predicted token distribution to avoid codebook collapse and low codebook utilization. The second is a stochastic mask regularization that introduces stochasticity during quantization to strike a good balance between inference stage misalignment and unperturbed reconstruction objective. In addition, we design a probabilistic contrastive loss which serves as a calibrated metric to further mitigate the perturbed reconstruction objective. Extensive experiments show that the proposed quantization framework outperforms prevailing vector quantization methods consistently across different generative models including auto-regressive models and diffusion models.
将图像量化为离散表示一直是统一生成模型中的 fundamental 问题。主要方法要么通过选择最佳匹配的代币以deterministic 方式学习离散表示,要么通过从预测分布中抽样以stochastic 方式学习。然而,deterministic 量化受到代码库崩溃和与推理阶段不匹配的严重问题,而stochastic 量化受到低代码库利用率和影响重建目标的Perturbed 重建目标的问题。本文提出了一种 regularized 向量量化框架,通过两个视角 effectively 缓解上述问题。第一个是先前分布 regularization,通过测量先前代币分布与预测代币分布之间的差异,避免代码库崩溃和低代码库利用率。第二个是随机掩膜 regularization,在量化过程中引入随机性,在推理阶段不匹配和未受影响的重建目标之间实现良好的平衡。此外,我们设计了probabilistic 对比损失,作为校准度量,进一步缓解 Perturbed 重建目标的问题。广泛的实验结果表明,提出的量化框架在包括自回归模型和扩散模型等多种生成模型中 consistently 优于流行的向量量化方法。
https://arxiv.org/abs/2303.06424
Learned image compression has exhibited promising compression performance, but variable bitrates over a wide range remain a challenge. State-of-the-art variable rate methods compromise the loss of model performance and require numerous additional parameters. In this paper, we present a Quantization-error-aware Variable Rate Framework (QVRF) that utilizes a univariate quantization regulator a to achieve wide-range variable rates within a single model. Specifically, QVRF defines a quantization regulator vector coupled with predefined Lagrange multipliers to control quantization error of all latent representation for discrete variable rates. Additionally, the reparameterization method makes QVRF compatible with a round quantizer. Exhaustive experiments demonstrate that existing fixed-rate VAE-based methods equipped with QVRF can achieve wide-range continuous variable rates within a single model without significant performance degradation. Furthermore, QVRF outperforms contemporary variable-rate methods in rate-distortion performance with minimal additional parameters.
学习的图像压缩表现出令人期望的压缩性能,但广泛的可变比特率仍然是一个挑战。最先进的可变速率方法牺牲了模型性能,并要求大量的额外参数。在本文中,我们提出了一个量化误差aware的可变速率框架(QVRF),它使用单个编码器来实现在一个模型中实现广泛的可变速率。具体来说,QVRF定义了一个量化控制器向量,与预先定义的拉普拉斯移量一起控制离散变量率的所有潜在表示的量化误差。此外,重参数化方法使QVRF与圆形量化器兼容。充分的实验结果表明,现有固定速率VAE-based方法配备了QVRF可以在一个模型中实现广泛的连续可变速率,而没有明显的性能下降。此外,QVRF在比率失真性能方面比 contemporary 可变速率方法出色,仅需要少量的额外参数。
https://arxiv.org/abs/2303.05744
Large Language Models (LLMs) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks. Unfortunately, the immense amount of computations and memory accesses required for LLM training makes them prohibitively expensive in terms of hardware cost, and thus challenging to deploy in use cases such as on-device learning. In this paper, motivated by the observation that LLM training is memory-bound, we propose a novel dynamic quantization strategy, termed Dynamic Stashing Quantization (DSQ), that puts a special focus on reducing the memory operations, but also enjoys the other benefits of low precision training, such as the reduced arithmetic cost. We conduct a thorough study on two translation tasks (trained-from-scratch) and three classification tasks (fine-tuning). DSQ reduces the amount of arithmetic operations by $20.95\times$ and the number of DRAM operations by $2.55\times$ on IWSLT17 compared to the standard 16-bit fixed-point, which is widely used in on-device learning.
大型语言模型(LLMs)在自然语言处理任务中表现出出色的性能。不幸的是,LLM训练需要大量的计算和内存访问,这使得它们在硬件成本方面非常昂贵,因此难以在设备 Learning 等场景中部署。本文受到观察LLM训练需要大量内存访问的启发,提出了一种名为Dynamic Stashing Quantization(DSQ)的新的动态分词策略,该策略特别关注减少内存操作,但也享受低精度训练的其他好处,如减少算术成本。我们对两个翻译任务(从头训练)和三个分类任务(微调)进行了深入研究。DSQ在IWSLT17上减少了算术操作的数量 $20.95 imes$,DRAM操作的数量 $2.55 imes$ compared to the standard 16-bit 浮点,这在设备 Learning 中被广泛使用。
https://arxiv.org/abs/2303.05295
Time series generation (TSG) studies have mainly focused on the use of Generative Adversarial Networks (GANs) combined with recurrent neural network (RNN) variants. However, the fundamental limitations and challenges of training GANs still remain. In addition, the RNN-family typically has difficulties with temporal consistency between distant timesteps. Motivated by the successes in the image generation (IMG) domain, we propose TimeVQVAE, the first work, to our knowledge, that uses vector quantization (VQ) techniques to address the TSG problem. Moreover, the priors of the discrete latent spaces are learned with bidirectional transformer models that can better capture global temporal consistency. We also propose VQ modeling in a time-frequency domain, separated into low-frequency (LF) and high-frequency (HF). This allows us to retain important characteristics of the time series and, in turn, generate new synthetic signals that are of better quality, with sharper changes in modularity, than its competing TSG methods. Our experimental evaluation is conducted on all datasets from the UCR archive, using well-established metrics in the IMG literature, such as Fréchet inception distance and inception scores. Our implementation on GitHub: \url{this https URL}.
时间序列生成(TSG)研究主要关注使用生成对抗网络(GANs)和循环神经网络(RNN)变体的组合。然而,训练GANs仍然存在基本限制和挑战。此外,RNN家族通常在远程时间步骤之间存在时间一致性问题。基于图像生成(IMG)领域取得成功的经验,我们提出了TimeVQVAE,这是第一项利用向量量化技术(VQ)来解决TSG问题的研究工作。此外,我们使用双向Transformer模型学习离散潜在空间的先验,更好地捕捉全局时间一致性。我们还提出了在时间-频率域中的VQ建模,将其分为低频(LF)和高频(HF)。这使我们能够保留时间序列的重要特征,并生成质量更好、模块化变化更剧烈的新合成信号,比其竞争的TSG方法更有效。我们的实验评估基于UCR仓库中的所有数据集,使用IMG文献中常用的指标,如傅里叶穿透距离和穿透得分。我们在他们的GitHub实现: url{this https URL}。
https://arxiv.org/abs/2303.04743
Deep neural networks (DNNs) are widely applied for nowadays 3D surface reconstruction tasks and such methods can be further divided into two categories, which respectively warp templates explicitly by moving vertices or represent 3D surfaces implicitly as signed or unsigned distance functions. Taking advantage of both advanced explicit learning process and powerful representation ability of implicit functions, we propose a novel 3D representation method, Neural Vector Fields (NVF). It not only adopts the explicit learning process to manipulate meshes directly, but also leverages the implicit representation of unsigned distance functions (UDFs) to break the barriers in resolution and topology. Specifically, our method first predicts the displacements from queries towards the surface and models the shapes as \textit{Vector Fields}. Rather than relying on network differentiation to obtain direction fields as most existing UDF-based methods, the produced vector fields encode the distance and direction fields both and mitigate the ambiguity at "ridge" points, such that the calculation of direction fields is straightforward and differentiation-free. The differentiation-free characteristic enables us to further learn a shape codebook via Vector Quantization, which encodes the cross-object priors, accelerates the training procedure, and boosts model generalization on cross-category reconstruction. The extensive experiments on surface reconstruction benchmarks indicate that our method outperforms those state-of-the-art methods in different evaluation scenarios including watertight vs non-watertight shapes, category-specific vs category-agnostic reconstruction, category-unseen reconstruction, and cross-domain reconstruction. Our code will be publicly released.
深度神经网络(DNN)现在广泛应用于3D表面重建任务,这些方法可以进一步分为两类,分别通过移动顶点来 explicit 地扭曲模板或通过 signed 或 unsigned 距离函数(UDF)间接地表示3D表面。利用先进的 explicit 学习过程和 implicit 函数的强大表示能力,我们提出了一种新3D表示方法,即神经网络向量场(NVF)。它不仅采用 explicit 学习过程直接操纵网格,而且利用unsigned 距离函数的 implicit 表示来打破分辨率和拓扑障碍。具体来说,我们的方法先预测从查询向表面移动的位移,并将它们建模为 extit{向量场}。不同于大多数基于UDF的方法,生成的向量场编码了距离和方向场,并在“脊”点处减轻歧义,使计算方向场变得简单且没有区别。没有区别的特性使我们能够通过Vector Quantization编码物体之间的前向信息,加速训练过程,并提高跨类别重建模型的泛化能力。在表面重建基准线的广泛实验表明,我们的方法在包括密封型和非密封型形状、特定类型和无关类型的重建、未知类型重建和跨域重建等多种评估场景下比最先进的方法表现更好。我们将公开发布我们的代码。
https://arxiv.org/abs/2303.04341
In this work, we present QuickSRNet, an efficient super-resolution architecture for real-time applications on mobile platforms. Super-resolution clarifies, sharpens, and upscales an image to higher resolution. Applications such as gaming and video playback along with the ever-improving display capabilities of TVs, smartphones, and VR headsets are driving the need for efficient upscaling solutions. While existing deep learning-based super-resolution approaches achieve impressive results in terms of visual quality, enabling real-time DL-based super-resolution on mobile devices with compute, thermal, and power constraints is challenging. To address these challenges, we propose QuickSRNet, a simple yet effective architecture that provides better accuracy-to-latency trade-offs than existing neural architectures for single-image super resolution. We present training tricks to speed up existing residual-based super-resolution architectures while maintaining robustness to quantization. Our proposed architecture produces 1080p outputs via 2x upscaling in 2.2 ms on a modern smartphone, making it ideal for high-fps real-time applications.
在本作品中,我们提出了 QuickSRNet,一个在移动设备平台上实时应用的高效超分辨率架构。超分辨率将图像澄清、锐利和拉伸到更高分辨率。例如,游戏和视频播放等应用以及电视、智能手机和虚拟现实头戴式显示器不断提高的显示能力,推动了高效超分辨率解决方案的需求。尽管现有的深度学习超分辨率方法在视觉质量方面取得了令人印象深刻的结果,但实现在具有计算、温度和功率限制的移动设备上实时深度学习超分辨率的方法仍然是具有挑战性的。为了应对这些挑战,我们提出了 QuickSRNet,一种简单但有效的架构,提供了比现有神经网络架构更好的精度和延迟权衡。我们提出了训练技巧,以加快现有残留基座超分辨率架构的速度,同时保持其对数的鲁棒性。我们提出的架构通过2倍拉伸在2.2毫秒的时间内在现代智能手机上产生1080p输出,使其成为高帧率实时应用的理想选择。
https://arxiv.org/abs/2303.04336
A popular track of network compression approach is Quantization aware Training (QAT), which accelerates the forward pass during the neural network training and inference. However, not much prior efforts have been made to quantize and accelerate the backward pass during training, even though that contributes around half of the training time. This can be partly attributed to the fact that errors of low-precision gradients during backward cannot be amortized by the training objective as in the QAT setting. In this work, we propose to solve this problem by incorporating the gradients into the computation graph of the next training iteration via a hypernetwork. Various experiments on CIFAR-10 dataset with different CNN network architectures demonstrate that our hypernetwork-based approach can effectively reduce the negative effect of gradient quantization noise and successfully quantizes the gradients to INT4 with only 0.64 accuracy drop for VGG-16 on CIFAR-10.
网络压缩方法的常见路径是量化意识到的训练(QAT),这加速了神经网络训练和推理的前端流程。然而,尽管在训练过程中量化和加速后端流程的贡献约占一半,但几乎没有先前的努力来量化和加速训练过程中后端流程。这部分地归咎于事实,即后端低精度梯度的错误在backward过程中无法像QAT设置那样被训练目标所抵消。在本研究中,我们建议通过引入梯度到下一个训练迭代的计算图中并通过超网络来实现。各种CIFAR-10数据集的不同卷积神经网络架构的实验表明,我们的超网络方法可以有效地减少梯度量化噪声的负面影响,并将梯度成功地量化到INT4,在CIFAR-10中只有VGG-16的精度下降达到了0.64。
https://arxiv.org/abs/2303.02347
Fixed-point (FXP) inference has proven suitable for embedded devices with limited computational resources, and yet model training is continually performed in floating-point (FLP). FXP training has not been fully explored and the non-trivial conversion from FLP to FXP presents unavoidable performance drop. We propose a novel method to train and obtain FXP convolutional keyword-spotting (KWS) models. We combine our methodology with two quantization-aware-training (QAT) techniques - squashed weight distribution and absolute cosine regularization for model parameters, and propose techniques for extending QAT over transient variables, otherwise neglected by previous paradigms. Experimental results on the Google Speech Commands v2 dataset show that we can reduce model precision up to 4-bit with no loss in accuracy. Furthermore, on an in-house KWS dataset, we show that our 8-bit FXP-QAT models have a 4-6% improvement in relative false discovery rate at fixed false reject rate compared to full precision FLP models. During inference we argue that FXP-QAT eliminates q-format normalization and enables the use of low-bit accumulators while maximizing SIMD throughput to reduce user perceived latency. We demonstrate that we can reduce execution time by 68% without compromising KWS model's predictive performance or requiring model architectural changes. Our work provides novel findings that aid future research in this area and enable accurate and efficient models.
固定点(FXP)推理已经被证明适用于具有有限计算资源嵌入设备,但模型训练仍然通常在浮点(FLP)上进行。FXP训练尚未完全探索,从FLP到FXP的的重大转换不可避免地会导致性能下降。我们提出了一种新的方法来训练和获得FXP卷积关键词定位(KWS)模型。我们结合了我们的方法和两个量化名称训练(QAT)技术——对模型参数的镇压权重分布和绝对余弦正则化,并提出了方法来扩展QAT对暂态变量,而以前的 paradigm 忽略了它们。在Google语音命令v2数据集上的实验结果显示,我们可以将模型精度降低到4位,而精度没有损失。此外,在一个我们的内部KWS数据集上,我们表明,我们的8位FXP-QAT模型在固定错误拒绝率下的相对错误发现率有4-6%的提高,与全精度FLP模型相比。在推理期间,我们指出FXP-QAT消除了q格式归约,并允许使用低位寄存器,同时最大限度地增加单指令多数据流吞吐量,以减少用户感知延迟。我们证明,我们可以减少执行时间68%,而不会牺牲KWS模型的预测性能或要求模型结构改变。我们的工作提供了新发现,有助于该领域的未来研究,并使准确高效的模型实现。
https://arxiv.org/abs/2303.02284