Most real-world applications that employ deep neural networks (DNNs) quantize them to low precision to reduce the compute needs. We present a method to improve the robustness of quantized DNNs to white-box adversarial attacks. We first tackle the limitation of deterministic quantization to fixed ``bins'' by introducing a differentiable Stochastic Quantizer (SQ). We explore the hypothesis that different quantizations may collectively be more robust than each quantized DNN. We formulate a training objective to encourage different quantized DNNs to learn different representations of the input image. The training objective captures diversity and accuracy via mutual information between ensemble members. Through experimentation, we demonstrate substantial improvement in robustness against $L_\infty$ attacks even if the attacker is allowed to backpropagate through SQ (e.g., > 50\% accuracy to PGD(5/255) on CIFAR10 without adversarial training), compared to vanilla DNNs as well as existing ensembles of quantized DNNs. We extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP), towards a unified analysis of different threat models by correlating the MI and accuracy.
大多数实际应用中使用深度神经网络(DNN)时,会将其量化为低精度以降低计算需求。我们提出了一种改进量化DNN以提高其对白盒攻击的鲁棒性的方法。我们首先通过引入不同的可导随机量化器(SQ)解决了确定性量化对固定“离散”的限制。我们探讨了不同的量化可能比每个量化DNN更具有鲁棒性的假设。我们定义了一个训练目标,以鼓励不同的量化DNN学习不同的输入图像表示。训练目标通过集合成员之间的互信息捕捉多样性和准确性。通过实验,我们在即使攻击者可以反向传播通过SQ(例如,在CIFAR10上,PGD(5/255)的准确度超过50\%的情况下)也显著提高了对$L_\infty$攻击的鲁棒性,与普通DNN以及现有的量化DNN相比。我们将该方法扩展到在攻击信息平面(AIP)上检测攻击和生成鲁棒性剖面。通过相关性分析MI和准确度,实现对不同威胁模型的统一分析。
https://arxiv.org/abs/2312.00105
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.
在3D空间中进行开放词汇查询是一个具有挑战性的任务,但对于场景理解任务(如目标定位和分割)来说,它是至关重要的。通过将语言特征嵌入到3D空间中,语言嵌入的图象表示取得了进展。然而,它们的有效性在训练和渲染过程中依赖的神经网络资源密集型网络中得到了极大的限制。尽管最近的三维高斯分布提供了高效且高质量的新视图合成,但直接将语言特征嵌入其中会导致记忆使用量过大和性能下降。在这项工作中,我们引入了语言嵌入3D高斯分布,一种为开放词汇查询任务设计的全新图象表示。我们采用了专门的量化方案来大幅减轻内存需求,并采用了一种新的嵌入方法,实现了更平滑且高精度的查询,对抗多视角特征不一致和高频归纳偏见。我们全面的实验结果表明,我们的表示在当前所有语言嵌入表示中具有最佳视觉质量和语言查询精度,同时保持单台桌面GPU上的实时渲染帧率。
https://arxiv.org/abs/2311.18482
3D Gaussian Splatting is a new method for modeling and rendering 3D radiance fields that achieves much faster learning and rendering time compared to SOTA NeRF methods. However, it comes with a drawback in the much larger storage demand compared to NeRF methods since it needs to store the parameters for several 3D Gaussians. We notice that many Gaussians may share similar parameters, so we introduce a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters. Then, we store the small codebook along with the index of the code for each Gaussian. Moreover, we compress the indices further by sorting them and using a method similar to run-length encoding. We do extensive experiments on standard benchmarks as well as a new benchmark which is an order of magnitude larger than the standard benchmarks. We show that our simple yet effective method can reduce the storage cost for the original 3D Gaussian Splatting method by a factor of almost $20\times$ with a very small drop in the quality of rendered images.
3D高斯平铺是一种新的建模和渲染3D辐射场的新方法,它比当前最先进的SOTA NeRF方法取得了更快的学习和渲染时间。然而,它的一个缺点是对比NeRF方法需要存储大量的参数,因此存储需求比NeRF方法要大得多。我们注意到许多高斯分布可能具有类似的参数,因此我们基于\kmeans算法引入了一种简单的向量量化方法来量化高斯参数。然后,我们还将小代码本与每个高斯参数的索引一起存储。此外,我们通过排序并使用类似于运行时编码的方法进一步压缩索引。我们在标准基准以及一个新的基准(比标准基准规模更大)上进行广泛的实验。我们证明了我们的简单而有效的方法可以将原始3D高斯平铺方法的存储成本降低近20倍,同时渲染图像的质量损失非常小。
https://arxiv.org/abs/2311.18159
We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.
我们提出了MoMask,一种新的面向文本的3D人动量生成框架。在MoMask中,采用分层量化方案来表示人动量作为具有高保真度的多层离散运动令牌。从基础层开始,通过向量量化获得运动令牌序列,然后按层生成残差令牌。接着,有两个不同的双向变换器。在基础层运动令牌上,我们指定了一个Masked Transformer,用于预测基于文本输入在训练阶段随机遮罩的运动令牌。在生成(即推理)阶段,从空序列开始,我们的Masked Transformer迭代地填充缺失的令牌;随后,Residual Transformer学会根据当前层的输出结果进行逐步预测下一层令牌。大量实验证明,MoMask在文本到动量生成任务上优于最先进的方法,在HumanML3D数据集上的FID为0.045(与T2M-GPT的0.141相比)和KIT-ML上的0.228(与0.514相比)。MoMask还可以轻松应用于相关任务,而无需进一步模型微调,例如文本指导的时域修复。
https://arxiv.org/abs/2312.00063
Banding, also known as staircase-like contours, frequently occurs in flat areas of images/videos processed by the compression or quantization algorithms. As undesirable artifacts, banding destroys the original image structure, thus degrading users' quality of experience (QoE). In this paper, we systematically investigate the banding image quality assessment (IQA) problem, aiming to detect the image banding artifacts and evaluate their perceptual visual quality. Considering that the existing image banding databases only contain limited content sources and banding generation methods, and lack perceptual quality labels (i.e. mean opinion scores), we first build the largest banding IQA database so far, named Banding Artifact Noticeable Database (BAND-2k), which consists of 2,000 banding images generated by 15 compression and quantization schemes. A total of 23 workers participated in the subjective IQA experiment, yielding over 214,000 patch-level banding class labels and 44,371 reliable image-level quality ratings. Subsequently, we develop an effective no-reference (NR) banding evaluator for banding detection and quality assessment by leveraging frequency characteristics of banding artifacts. A dual convolutional neural network is employed to concurrently learn the feature representation from the high-frequency and low-frequency maps, thereby enhancing the ability to discern banding artifacts. The quality score of a banding image is generated by pooling the banding detection maps masked by the spatial frequency filters. Experiments demonstrate that our banding evaluator achieves a remarkably high accuracy in banding detection and also exhibits high SRCC and PLCC results with the perceptual quality labels. These findings unveil the strong correlations between the intensity of banding artifacts and the perceptual visual quality, thus validating the necessity of banding quality assessment.
带状,也称为楼梯状轮廓,经常在经过压缩或量化算法的处理后的图像/视频平面上出现。作为不良伪影,带状破坏了原始图像结构,从而降低了用户的体验质量(QoE)。在本文中,我们系统地研究了带状图像质量评估(IQA)问题,旨在检测图像带状伪影并评估其感知视觉质量。考虑到现有的图像带状数据库仅包含有限的内容来源和带状生成方法,并且缺乏感知质量标签(即平均意见分数),我们首先构建了目前最大的带状IQA数据库,名为带状伪影显著数据库(BAND-2k),它由15种压缩和量化算法生成的2000个带状图像组成。共有23个工作者参与了主观IQA实验,产生了超过214,000个补丁级的带状类标签和44,371个可靠的图像级质量评分。随后,我们利用带状伪影特有特征的频率特性开发了一种有效的无参考(NR)带状评估器,用于带状检测和质量评估。采用双卷积神经网络同时学习从高频率和低频率图中提取特征表示,从而增强区分带状伪影的能力。带状图像的质量评分通过将带状检测图的掩码空间中的带状伪影进行池化得到。实验证明,我们的带状评估器在带状检测和感知质量标签方面具有非常高的准确度。这些发现揭示了带状伪影强度与感知视觉质量之间的强烈关联,从而验证了带状质量评估的必要性。
https://arxiv.org/abs/2311.17752
Recent advancements in real-time neural rendering using point-based techniques have paved the way for the widespread adoption of 3D representations. However, foundational approaches like 3D Gaussian Splatting come with a substantial storage overhead caused by growing the SfM points to millions, often demanding gigabyte-level disk space for a single unbounded scene, posing significant scalability challenges and hindering the splatting efficiency. To address this challenge, we introduce LightGaussian, a novel method designed to transform 3D Gaussians into a more efficient and compact format. Drawing inspiration from the concept of Network Pruning, LightGaussian identifies Gaussians that are insignificant in contributing to the scene reconstruction and adopts a pruning and recovery process, effectively reducing redundancy in Gaussian counts while preserving visual effects. Additionally, LightGaussian employs distillation and pseudo-view augmentation to distill spherical harmonics to a lower degree, allowing knowledge transfer to more compact representations while maintaining reflectance. Furthermore, we propose a hybrid scheme, VecTree Quantization, to quantize all attributes, resulting in lower bitwidth representations with minimal accuracy losses. In summary, LightGaussian achieves an averaged compression rate over 15x while boosting the FPS from 139 to 215, enabling an efficient representation of complex scenes on Mip-NeRF 360, Tank and Temple datasets. Project website: this https URL
近年来,通过基于点技术的实时神经渲染取得了重大进展,为3D表示的广泛采用奠定了基础。然而,像3D高斯平滑这样的基本方法在扩展SfM点数量时会产生大量存储开销,通常要求单场景的GB级磁盘空间,导致可扩展性挑战较大,从而阻碍了平铺效率。为了应对这个挑战,我们引入了LightGaussian,一种将3D高斯变换为更高效紧凑格式的全新方法。从网络剪枝的概念得到启发,LightGaussian识别对场景重建没有贡献的高斯,并采用剪枝和恢复过程,有效减少了高斯计数值的冗余,同时保留视觉效果。此外,LightGaussian采用蒸馏和伪视图增强来降低球谐波的级别,允许在保持反射率的同时实现知识传递到更紧凑的表示。此外,我们提出了一个混合方案,VecTree Quantization,对所有属性进行量化,从而实现位宽表示,最小化精度损失。总之,LightGaussian在平均压缩率超过15倍的同时将FPS从139提高到了215,使得在Mip-NeRF 360,Tank和Temple数据集上对复杂场景进行高效的表示。项目网站:https://this URL
https://arxiv.org/abs/2311.17245
Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point perations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack of adaptability to the target devices, as a hardware typically only support specific bit widths. Thus, to adapt to a variety of devices, a quantization method shall be flexible enough to find good accuracy v.s. speed trade-offs for every bit width and target device. To achieve this, we propose PIPE, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization. PIPE is backed off by strong theoretical guarantees and achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to ternary quantization).
深度神经网络(DNNs)在计算机视觉和自然语言处理领域随处可见,但代价高昂的推理成本。这个问题可以通过量化来解决,它包括将浮点运算转换为较低位宽格式。随着隐私权利的增加,我们集中精力于无需数据的方法。然而,这类技术由于不能适应目标设备而受到其缺乏适应性的限制,因为硬件通常只支持特定位宽。因此,为了适应各种设备,量化方法必须足够灵活,找到每个位宽和目标设备的准确性和速度之间的平衡。为了实现这一点,我们提出了PIPE,一种利用残差误差扩展、组稀疏和集成逼近的量化方法,以实现更好的并行。PIPE得到了强有力的理论保证,在每项基准应用(从视觉到自然语言处理任务)和架构(卷积神经网络,Transformer)以及位宽(从int8到三进制量化)上都取得了卓越的性能。
https://arxiv.org/abs/2311.15806
The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.
扩散模型,作为图像生成的普遍框架,在应用范围方面遇到了显著的挑战,主要因为其延长了推理时间和庞大的内存需求。有效的后训练量化(PTQ)是解决这些传统模型问题的关键。与传统模型不同,扩散模型高度依赖于时间步$t$以实现满意的多个轮次的去噪。通常,$t$从有限集$\{1,\ldots,T\}$中通过一些模块完全无关于采样数据编码成一个时间特征。然而,现有的PTQ方法没有对这些模块单独进行优化。它们采用不合适的重构目标和支持复杂的校准方法,导致时间特征和去噪轨迹的严重干扰,以及压缩效率低下。为了解决这些问题,我们提出了一个基于时间信息块的Temporal Feature Maintenance Quantization(TFMQ)框架。我们借助开创性的块设计,发明了时间信息注意的重构(TIAR)和有限集校准(FSC),以在有限的时间内对完整精确的时间特征进行对齐。配备了该框架,我们可以保留最多的时间信息,并确保端到端生成质量。在各种数据集和扩散模型上的广泛实验证明了我们卓越的结果。值得注意的是,我们的量化方法是前所未有的,第一次在4位权重量化下实现了与完整精确模型几乎相当的表现。此外,我们的方法几乎不需要额外的计算成本,并且通过LSUN-Bedrooms$256 \times 256$的2.0倍加速了量化时间。
https://arxiv.org/abs/2311.16503
Increasing the model capacity is a known approach to enhance the adversarial robustness of deep learning networks. On the other hand, various model compression techniques, including pruning and quantization, can reduce the size of the network while preserving its accuracy. Several recent studies have addressed the relationship between model compression and adversarial robustness, while some experiments have reported contradictory results. This work summarizes available evidence and discusses possible explanations for the observed effects.
增加模型的能力是增强深度学习网络对抗性的一种已知方法。另一方面,各种模型压缩技术(包括剪枝和量化)可以减小网络的大小,同时保留其准确性。几项最近的研究关注了模型压缩与对抗性之间的关系,而一些实验报道了矛盾的结果。本文总结了可用的证据,并讨论了观察到的影响可能的解释。
https://arxiv.org/abs/2311.15782
The focus of this study is on Unsupervised Continual Learning (UCL), as it presents an alternative to Supervised Continual Learning which needs high-quality manual labeled data. The experiments under the UCL paradigm indicate a phenomenon where the results on the first few tasks are suboptimal. This phenomenon can render the model inappropriate for practical applications. To address this issue, after analyzing the phenomenon and identifying the lack of diversity as a vital factor, we propose a method named Codebook for Unsupervised Continual Learning (CUCL) which promotes the model to learn discriminative features to complete the class boundary. Specifically, we first introduce a Product Quantization to inject diversity into the representation and apply a cross quantized contrastive loss between the original representation and the quantized one to capture discriminative information. Then, based on the quantizer, we propose an effective Codebook Rehearsal to address catastrophic forgetting. This study involves conducting extensive experiments on CIFAR100, TinyImageNet, and MiniImageNet benchmark datasets. Our method significantly boosts the performances of supervised and unsupervised methods. For instance, on TinyImageNet, our method led to a relative improvement of 12.76% and 7% when compared with Simsiam and BYOL, respectively.
本研究的核心是探讨无监督持续学习(UCL),作为监督持续学习的一种替代方案,它需要高质量的手动标注数据。在UCL范式下的实验表明,前几个任务的成果次优。这种现象可能导致模型不适用于实际应用。为解决这个问题,在分析现象并确定多样性不足是关键因素后,我们提出了一个名为Codebook for Unsupervised Continual Learning(CUCL)的方法,该方法通过促进模型学习类边界上的判别特征来完成分类任务。具体来说,我们首先引入了产品量化来注入多样性到表示中,并应用原始表示和量化表示之间的交叉量化对比损失来捕捉判别信息。然后,根据量化器,我们提出了一个有效的Codebook Rehearsal来解决灾难性遗忘。这项研究对CIFAR100、TinyImageNet和MiniImageNet基准数据集进行了广泛的实验。我们的方法显著提高了有监督和无监督方法的性能。例如,在TinyImageNet上,与Simsiam和BYOL相比,我们的方法分别导致了12.76%和7%的相对改善。
https://arxiv.org/abs/2311.14911
3D whole-body human mesh recovery aims to reconstruct the 3D human body, face, and hands from a single image. Although powerful deep learning models have achieved accurate estimation in this task, they require enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited edge devices. In this work, we propose a Binarized Dual Residual Network (BiDRN), a novel quantization method to estimate the 3D human body, face, and hands parameters efficiently. Specifically, we design a basic unit Binarized Dual Residual Block (BiDRB) composed of Local Convolution Residual (LCR) and Block Residual (BR), which can preserve full-precision information as much as possible. For LCR, we generalize it to four kinds of convolutional modules so that full-precision information can be propagated even between mismatched dimensions. We also binarize the face and hands box-prediction network as Binaried BoxNet, which can further reduce the model redundancy. Comprehensive quantitative and qualitative experiments demonstrate the effectiveness of BiDRN, which has a significant improvement over state-of-the-art binarization algorithms. Moreover, our proposed BiDRN achieves comparable performance with full-precision method Hand4Whole while using just 22.1% parameters and 14.8% operations. We will release all the code and pretrained models.
3D全身人体网格复原的目的是从一张图片中重构人体的3D身体、面部和手部。尽管在任务中已经取得了准确估计的强大的深度学习模型,但它们需要大量的内存和计算资源。因此,这些方法很难在资源受限的边缘设备上进行部署。在这项工作中,我们提出了一个二值化双残差网络(BiDRN),一种新的量化方法,用于高效估计3D人体身体、面部和手部参数。具体来说,我们设计了一个由局部卷积残差(LCR)和块残差(BR)组成的基单元BiDRB,可以保留尽可能多的全精度信息。对于LCR,我们将它扩展为四种卷积模块,以便即使在失调的维度之间也可以传递完整的精确信息。我们还对面部和手部盒子预测网络进行二值化,得到二值化盒子网络,可以进一步减少模型冗余。全面的定量和定性实验证明了BiDRN的有效性,其性能超过了最先进的二值化算法。此外,与全精度方法Hand4Whole相比,我们所提出的BiDRN在仅使用22.1%的参数和14.8%的操作的情况下实现了相当不错的性能。我们将发布所有代码和预训练模型。
https://arxiv.org/abs/2311.14323
Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in capturing complex 3D scenes with high fidelity. However, one persistent challenge that hinders the widespread adoption of NeRFs is the computational bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussisan-based representation and adopts the rasterization pipeline to render the images rather than volumetric rendering, achieving very fast rendering speed and promising image quality. However, a significant drawback arises as 3DGS entails a substantial number of 3D Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric attributes of Gaussian by vector quantization. In our extensive experiments, we consistently show over 10$\times$ reduced storage and enhanced rendering speed, while maintaining the quality of the scene representation, compared to 3DGS. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at this https URL.
神经辐射场(NeRFs)在捕捉复杂三维场景的高保真度方面表现出了巨大的潜力。然而,由于体积渲染导致的计算瓶颈,使得NeRFs的广泛采用受到了阻碍。另一方面,3D高斯平铺(3DGS)作为一种替代方法,利用3D高斯基表示,并采用光栅化过程而不是体积渲染来渲染图像,实现了快速渲染速度和良好的图像质量。然而,3DGS的一个显著缺点是,它需要大量的3D高斯来保持渲染图像的高保真度,这需要大量的内存和存储空间。为了应对这一关键问题,我们特别关注两个关键目标:不牺牲性能的情况下减少高斯点的数量,并压缩高斯属性,如视点相关的颜色和协方差。为此,我们提出了一个可学习掩码策略,在保留高保真度的同时显著减少了高斯点的数量。此外,我们通过利用网格基于的神经场来表示视点相关的颜色,而不是依赖于球面余弦,提出了一个紧凑但有效的表示方法。最后,我们通过向量量化来学习码簿,以简洁地表示高斯的基本几何属性。在广泛的实验中,我们 consistently证明了与3DGS相比,存储减少了10倍以上,渲染速度得到了显著提升,同时保持场景表示的质量。我们的工作为3D场景表示提供了全面框架,实现了高性能、快速训练、紧凑性和实时渲染。我们的项目页面可以在这个链接处找到。
https://arxiv.org/abs/2311.13681
Traditionally, IoT edge devices have been perceived primarily as low-power components with limited capabilities for autonomous operations. Yet, with emerging advancements in embedded AI hardware design, a foundational shift paves the way for future possibilities. Thus, the aim of the KDT NEUROKIT2E project is to establish a new open-source framework to further facilitate AI applications on edge devices by developing new methods in quantization, pruning-aware training, and sparsification. These innovations hold the potential to expand the functional range of such devices considerably, enabling them to manage complex Machine Learning (ML) tasks utilizing local resources and laying the groundwork for innovative learning approaches. In the context of 6G's transformative potential, distributed learning among independent agents emerges as a pivotal application, attributed to 6G networks' support for ultra-reliable low-latency communication, enhanced data rates, and advanced edge computing capabilities. Our research focuses on the mechanisms and methodologies that allow edge network-enabled agents to engage in collaborative learning in distributed environments. Particularly, one of the key issues within distributed collaborative learning is determining the degree of confidence in the learning results, considering the spatio-temporal locality of data sets perceived by independent agents.
传统上,IoT边缘设备被认为主要是一种功耗较低、功能有限的自動化组件。然而,随着嵌入式人工智能硬件设计的新兴进展,一个基础性的转变为未来的可能性铺平了道路。因此,KDT NEUROKIT2E项目的目标是建立一个新的人工智能框架,通过开发量化、剪枝感知训练和稀疏化等新方法,进一步推动边缘设备上AI应用的发展。这些创新具有很大的潜力,使这些设备能够管理复杂的机器学习(ML)任务,并利用本地资源进行创新学习,从而拓展其功能范围。在6G的变革性潜力下,独立代理之间的分布式学习成为了一个关键的应用,归功于6G网络支持超可靠低延迟通信、增强的数据速率和先进的边缘计算能力。我们的研究重点关注边缘网络使代理在分布式环境中进行协作学习的方法和机制。特别是,分布式协作学习中的一个关键问题是,在考虑独立代理对感知到的数据集的时空局部性时,学习结果的置信度。
https://arxiv.org/abs/2311.13356
Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation across T5, T0, and LLaMA-based models with 200M - 65B parameters, ComPEFT achieves compression ratios of 8x - 50x. In particular, we show that ComPEFT improves with scale - stronger models exhibit higher compressibility and better performance. For example, we show that ComPEFT applied to LLaMA outperforms QLoRA by 4.16% on MMLU with a storage size reduction of up to 26x. In addition, we show that the compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare it with other PEFT methods, and test ComPEFT's efficacy for compressing the residual of full-finetuning. Our code is available at this https URL.
参数高效的微调(PEFT)技术使将语言模型微调以创建“专家”模型成为可能,这些模型专门于新的任务或领域。最近的方法在模型合并和组合泛化中利用了这些专家模型,通过动态地组合模块来提高零/少样本泛化。尽管PEFT方法具有高效性,但专家模型的规模可能会使在像互联网这样的高延迟网络中检索专家模型变得困难,或者在单个GPU上服务多个专家变得昂贵。为解决这些问题,我们提出了ComPEFT,一种基于PEFT基模型的压缩残差(任务向量)的新方法。ComPEFT采用稀疏化和二进制量化来减小PEFT模块的规模,同时保留或提高模型性能。在T5、T0和LaMA-based模型上具有200M - 65B参数的广泛评估中,ComPEFT的压缩比达到8x - 50x。特别地,我们证明了ComPEFT随着规模的增加而提高 - 强大的模型具有更高的压缩率和更好的性能。例如,我们证明了ComPEFT在MLU上的存储大小减少4.16%时,对QLoRA具有优越的性能。此外,我们还证明了由ComPEFT压缩的专家模型具有很少样本的组合泛化能力,促进高效的通信和计算,并在合并时表现出增强的性能。最后,我们提供了不同方法组件的分析,将ComPEFT与其他PEFT方法进行了比较,并测试了ComPEFT在压缩完全微调残差方面的效果。我们的代码可在此处下载:https://www.aclweb.org/anthology/J4866
https://arxiv.org/abs/2311.13171
Post-Training Quantization (PTQ) is a powerful technique for model compression, reducing the precision of neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point quantization (FP8) in the context of PTQ for model inference. However, the exploration of floating-point formats smaller than 8 bits and their comparison with integer quantization remains relatively limited. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. Our work presents a novel PTQ design-space exploration, comparing minifloat and integer quantization schemes across a range of 3 to 8 bits for both weights and activations. We examine the applicability of various PTQ techniques to minifloats, including weight equalization, bias correction, SmoothQuant, gradient-based learned rounding, and the GPTQ method. Our experiments validate the effectiveness of low-precision minifloats when compared to their integer counterparts across a spectrum of accuracy-precision trade-offs on a set of reference deep learning vision workloads. Finally, we evaluate our results against an FPGA-based hardware cost model, showing that integer quantization often remains the Pareto-optimal option, given its relatively smaller hardware resource footprint.
Post-训练量化(PTQ)是一种强大的模型压缩技术,能够在不增加额外训练开销的情况下降低神经网络的精度。最近的工作已经研究了在PTQ的背景下采用8位浮点量化(FP8)来进行模型推理。然而,探讨浮点格式小于8位并与其与整数量化相比较仍然相对有限。在本文中,我们提出了minifloats,这是一种能够进一步降低模型内存足迹、延迟和能源成本的 reduced-precision 浮点格式。我们的工作开创了在3到8位浮点数范围内探索不同 PTQ 技术的设计空间,并研究了各种 PTQ 技术对 minifloats 的适用性,包括权重平衡、偏置校正、SmoothQuant、基于梯度的学习rounding 以及 GPTQ 方法。我们的实验验证了低精度浮点数在与整数浮点数相比时具有更高的精度-精度权衡,在参考的深度学习视觉工作负载上。最后,我们评估了我们的结果与基于FPGA的硬件成本模型,揭示了整数量化往往仍然是帕累托最优选项,因为它的硬件资源开销相对较小。
https://arxiv.org/abs/2311.12359
We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on adapting RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and moreover enables more aggressive quantization. For example, on the OpenAssistant benchmark LQ-LoRA is able to learn a 2.5-bit LLaMA-2 model that is competitive with a model finetuned with 4-bit QLoRA. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) is competitive with the original model in full precision.
我们提出了一个简单且记忆效率高的预训练语言模型适应方法。我们的方法使用迭代算法将每个预训练矩阵分解成高精度低秩组件和内存效率量化组件。在微调过程中,量化组件保持不变,只有低秩组件更新。我们提出了量化组件的整数线性规划形式,使得可以为每个矩阵分配总体目标内存预算动态配置量化参数(例如位宽,块大小)。此外,我们进一步研究了数据感知版本的算法,该算法使用拟合Fisher信息矩阵来加权在矩阵分解过程中的重构目标。在适应RoBERTa和LLaMA-2(7B和70B)的实验中,我们的低秩加量化矩阵分解方法(LQ-LoRA)超越了强大的QLoRA和GPTQ-LoRA基线,并且还能实现更积极的量化。例如,在OpenAssistant基准测试中,LQ-LoRA能够学习一个2.5-bit的LLaMA-2模型,与使用4-bit QLoRA进行微调的模型竞争相当。当在语言建模预训练数据集上进行微调时,LQ-LoRA还可以用于模型压缩;在这种情况下,我们的2.75-bit LLaMA-2-70B模型(平均包含低秩组件,需要27GB的GPU内存)与原始模型在完整精度上相当。
https://arxiv.org/abs/2311.12023
From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.
从经典的高性能计算(HPC)到深度学习,MatMul 是当今计算领域的核心。最近,Maddness 方法通过使用基于哈希的产品量化(PQ)索引将 MatMul 近似化,无需进行乘法运算。Stella Nera 是第一个 Maddness 加速器,它比直接在相同技术中实现的 MatMul 加速器实现了 15 倍以上的面积效率(GMAC/s/mm^2)和超过 25 倍的能量效率(TMAC/s/W)。哈希函数是一个决策树,允许实现高效的硬件,因为多倍累积操作用决策树传递和 LUT 查找替代。整个 Maddness MatMul 可以分解成允许有效实现的小计算单元和内存的部分,使它在 MatMul 任务上达到极端效率,同时保持通用的特点。在商业 14nm 技术并扩展到 3nm 时,我们实现了 161 TOp/s/W@0.55V 的能源效率(在 CIFAR-10 上)和超过 92.5% 的准确率(Top-1 准确率)。
https://arxiv.org/abs/2311.10207
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT' superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
尽管视觉Transformer(ViTs)具有可扩展的性能,但密集的计算成本(训练和推理)削弱了它们在工业应用中的地位。后训练量化(PTQ)通过在非常小的数据集上调整ViTs来解决成本问题,但在低位比特情况下,不幸的是性能下降更加严重。在本文中,我们引入了I&S-ViT,一种新颖的方法,以包容和稳定地调节ViTs的后训练量化。I&S-ViT首先指出ViTs后训练量化中的两个问题:(1)后激活的量化效率在流行的对数量化器中低下;(2)在层归一化的量化粒度中,粗糙和放大的损失地形。然后,I&S-ViT通过引入以下内容来解决这些问题:(1)一种新颖的平移不变对数量化器(SULQ),它包括一个平移机制 followed by uniform quantization 以实现包容域表示和精确分布逼近;(2)一种三阶段平滑优化策略(SOS),将通道级和层级的量化优势相结合,实现稳定学习。对不同视觉任务的全面评估证实了I&S-ViT相对于现有ViTs量化的优越性,尤其是在低位比特情况下。例如,I&S-ViT将3位ViT-B的性能提高了令人印象深刻的50.68%。
https://arxiv.org/abs/2311.10126
Pruning and quantization form the foundation of model compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated state-of-the-art performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples, to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of pruning and quantization methods, tasks, models, and datasets. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.
剪枝和量化是神经网络模型压缩的基础,为大型语言模型(LLMs)提供了高效的推理。最近,各种剪枝和量化技术在训练后环境中已经证明了最先进的性能。它们依赖于校准数据,一小部分未标记的示例,来生成层激活。然而,之前的工作并没有系统地研究过校准数据对模型压缩效果的影响。在本文中,我们第一个系统性研究了校准数据对LLM性能的影响。我们尝试了各种剪枝和量化方法、任务、模型和数据集。令人惊讶的是,我们发现在下游任务性能上有很大的差异,与现有工作建议的校准数据对模型的鲁棒性更高相矛盾。最后,我们提出了一系列关于有效使用校准数据进行LLM量化和剪枝的建议。
https://arxiv.org/abs/2311.09755
The large language model era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.
大规模语言模型时代呼吁更快速、更低成本的推理。先前的模型压缩主要集中在LLM上,通常采用软件中心化的方法,主要关注模拟量化性能。由于忽略了部署的可行性,这些方法在实际实践中通常无法啟用。它们曾经大幅度將量化位範圍推向低谷,以實現減少計算的目標,這可能不支持主流硬件,或者涉及複雜的算法,引入了額外的計算或記憶體存取開銷。我們認為,在構建量化算法時採用基於硬件的方法非常重要。在這一方面,我們被驱使在硬件意识的基础上構建我們的壓縮方法,同時在最大化硬件加速效益的同時消除不切實際的算法選擇。我們的 method, OdysseyLLM,帶有新的W4A8内核實現和結合的量化策略的綜合recipe。豐富的實驗證明了W4A8方法在比較Hugging Face FP16推理和TensorRT-LLM在FP16以及INT8的性能方面优越,而沒有实质性損害性能。
https://arxiv.org/abs/2311.09550