Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
Diffusion models have emerged as preeminent contenders in the realm of generative models. Distinguished by their distinctive sequential generative processes, characterized by hundreds or even thousands of timesteps, diffusion models progressively reconstruct images from pure Gaussian noise, with each timestep necessitating full inference of the entire model. However, the substantial computational demands inherent to these models present challenges for deployment, quantization is thus widely used to lower the bit-width for reducing the storage and computing overheads. Current quantization methodologies primarily focus on model-side optimization, disregarding the temporal dimension, such as the length of the timestep sequence, thereby allowing redundant timesteps to continue consuming computational resources, leaving substantial scope for accelerating the generative process. In this paper, we introduce TMPQ-DM, which jointly optimizes timestep reduction and quantization to achieve a superior performance-efficiency trade-off, addressing both temporal and model optimization aspects. For timestep reduction, we devise a non-uniform grouping scheme tailored to the non-uniform nature of the denoising process, thereby mitigating the explosive combinations of timesteps. In terms of quantization, we adopt a fine-grained layer-wise approach to allocate varying bit-widths to different layers based on their respective contributions to the final generative performance, thus rectifying performance degradation observed in prior studies. To expedite the evaluation of fine-grained quantization, we further devise a super-network to serve as a precision solver by leveraging shared quantization results. These two design components are seamlessly integrated within our framework, enabling rapid joint exploration of the exponentially large decision space via a gradient-free evolutionary search algorithm.
扩散模型已成为生成模型领域的主要竞争者。通过其独特的序列生成过程脱颖而出,这些过程具有数百甚至数千个时步,扩散模型从纯高斯噪声中逐渐重构图像,每个时步都需要对整个模型进行完整的推理。然而,这些模型固有的计算需求在面对部署方面具有挑战性,因此广泛使用量化来降低位宽以减少存储和计算开销。目前,量化方法主要关注模型侧优化,而忽略了时域维度,例如时步序列的长度,从而允许冗余时步继续消耗计算资源,为加速生成过程留下了广阔的余地。在本文中,我们引入了TMPQ-DM,该模型通过共同优化时步减少和量化来实现卓越的性能-效率权衡,解决了时域和模型优化方面的问题。在时步减少方面,我们设计了一个非均匀分组方案,针对去噪过程非均匀性的特点,从而减轻了时步的爆炸组合。在量化方面,我们采用了一种细粒度的层-wise方法,根据各个层对最终生成性能的贡献分配不同的位宽,从而纠正了之前研究中观察到的性能下降。为了加速细粒度量化评估,我们进一步设计了一个超网络,利用共享量化结果作为精度求解器。这两个设计组件无缝地整合在我们的框架中,通过梯度free进化搜索算法快速探索具有指数级大决策空间。
https://arxiv.org/abs/2404.09532
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset, thereby enabling energy-efficient embodied SNN deployments for autonomous agents.
近年来,自动驾驶车辆(AGVs)、无人机(UAVs)和移动机器人等自主 agent有效提高了人类在解决多样化任务中的生产力。然而,由于这些 agent通常由便携式电池供电,因此它们在长时间内运行时需要极其低功耗/能量。为解决这个问题,神经形态计算作为一种有前景的解决方案应运而生,其中仿生 Spiking Neural Networks (SNNs) 使用基于事件的数据转换预处理或事件相机中的尖峰来执行稀疏计算 efficiently。然而,针对自主 agent 的 SNN 部署的研究仍处于早期阶段。因此,尚未对 enabling efficient embodied SNN deployments for autonomous agents 的优化阶段进行系统地定义。为了实现这一目标,我们提出了一个名为 SNN4Agents 的 novel framework,它包括一个针对自主 agent 应用设计能量高效的 embodied SNN 的优化技术集合。我们的 SNN4Agents 使用权重量化、时钟步减少和注意力窗口减少来共同提高能源效率、降低内存足迹、优化处理延迟,同时保持高精度。在评估中,我们研究了基于事件的汽车识别用例,并探讨了准确性、延迟、内存和能量消耗之间的权衡。实验结果表明,与最先进的 NCARS 数据集相比,我们的框架可以在降低 68.75% 的内存和使用 68.75% 更快的速度和 4.03x 的能量效率提升的同时保持高精度(即 84.12% 的准确率),从而实现自主 agent 的高效 embodied SNN 部署。
https://arxiv.org/abs/2404.09331
Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10\% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.
将扩散模型融入图像压缩领域,有望产生真实和详细的重构,尤其是在极低比特率的情况下。以前的方法侧重于使用扩散模型作为具有条件信号量化误差稳健的表达编码器,然而以这种方式实现竞争力的结果需要对扩散模型进行昂贵的训练,并由于递归生成过程,导致推理时间较长。在这项工作中,我们将量化误差消除视为去噪任务,利用扩散来恢复在传输图像潜在中丢失的信息。我们的方法允许我们执行不到10%的完整扩散生成过程,并且不需要对扩散模型进行架构更改,使得基础模型可以作为强大的先验,无需额外对骨干模型进行微调。我们提出的编码在量化现实指标上优于以前的方法,而且我们验证,即使其他方法使用两倍的比特率,我们的重构仍然具有用户满意的质量。
https://arxiv.org/abs/2404.08580
This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.
本文探讨了是否像LLLM(大型语言模型)这样原本设计的解码器- only transformer(如LLaMA)可以适应计算机视觉领域。首先,我们使用标准的ViT逐步对齐LLaMA的架构,并发现直接应用一个简单的掩码到自注意力会导致注意力崩溃问题,从而导致网络训练失败。我们建议使用后序类标签技术将类标签置于图像标签之前,以克服这一挑战,从而实现高效的因果自注意。此外,我们还开发了一种逐渐引入随机的掩码以促进训练优化的软掩码策略。定制的模型被称为图像LLaMA(iLLaMA),在架构上与LLaMA类似,并通过提高注意图的排名实现直接监督学习。其因果自注意力提高了计算效率,通过提高注意图的排名学习复杂的表示。与仅编码器的模型相比,iLLaMA的性能略高,达到75.1%的ImageNet top-1准确率,仅使用5.7M个参数。将模型扩展到~310M,并在ImageNet-21K上进行预训练,进一步提高了准确率至86.0%。大量的实验证明了iLLaMA的可靠性质:标定、形状纹理偏差、量化和兼容性,以及ADE20K分割和CIFAR迁移学习。我们希望我们的研究可以激发对LLM视觉模型设计的全新观点。预训练模型和代码可在此处获取。
https://arxiv.org/abs/2404.06773
The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of layers. Recent works on reducing the cumulative latency through layer removing, however, lead to significant performance drop. Motivated by the similarity of inputs among adjacent layers, we propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency. We also introduce a bypassing technique to mitigate the effect of information loss. Empirical experiments of the proposed approach on the LLaMA models confirm that Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by up to 48.3% on the LLaMA-33B model, while maintaining a close level of performance.
快速发展的大规模语言模型在几乎所有自然语言处理任务上都实现了史无前例的性能。然而,大型语言模型的效果取决于参数的数量呈指数增长。极度计算复杂度导致的推理延迟极大地影响了用户体验。为了提高推理效率,诸如张量并行和量化等现有方法,旨在减少每层计算延迟,却忽略了层数累积的延迟。最近的工作通过层删除来减少累积延迟,然而,却导致性能显著下降。鉴于相邻层之间输入的相似性,我们提出了一个动机,要识别出伪独立层,这些层可以同时计算,从而显著降低推理延迟。我们还引入了一种绕过技术来减轻信息丢失的影响。对所提出的方法的 empirical 实验证实,在 LLaMA 模型上,通过并行计算伪独立层(CQIL)可以降低延迟高达 48.3%,同时保持接近的性能水平。
https://arxiv.org/abs/2404.06709
In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregation, we allow each RRH receives local feature vectors from all devices over the same resource blocks simultaneously by leveraging an over-the-air computation (AirComp) technique. Thereafter, these aggregated feature vectors are quantized and transmitted to a central processor (CP) for further aggregation and downstream inference tasks. Our aim in this work is to maximize the inference accuracy via a surrogate accuracy metric called discriminant gain, which measures the discernibility of different classes in the feature space. The key challenges lie on simultaneously suppressing the coupled sensing noise, AirComp distortion caused by hostile wireless channels, and the quantization error resulting from the limited capacity of fronthaul links. To address these challenges, this work proposes a joint transmit precoding, receive beamforming, and quantization error control scheme to enhance the inference accuracy. Extensive numerical experiments demonstrate the effectiveness and superiority of our proposed optimization algorithm compared to various baselines.
在本文中,我们提出了一个基于协作边缘人工智能的云无线接入网络(Cloud-RAN)架构。具体来说,分布式设备捕获实时噪音污染的感知数据样本并提取出噪音局部特征向量,然后在每个远程无线接入节点(RRH)上进行聚合,以抑制感知噪声。为了实现高效的上行特征聚合,我们利用过顶计算(AirComp)技术,允许每个RRH同时接收来自所有设备的同一资源块的局部特征向量。然后,这些聚合的特征向量进行量化并传输到中央处理器(CP)进行进一步的聚合和下游推理任务。我们希望通过使用一种称为差分增益的代理准确度度量方法,通过提高模型的性能来最大化推理准确性。 关键挑战在于同时抑制耦合感知噪声、由敌对无线信道引起的AirComp失真以及由于前馈链路有限容量引起的量化误差。为了应对这些挑战,本文提出了一种协作传输预编码、接收波束形成和量化误差控制方案,以提高推理准确性。大量的数值实验证明了与各种基线相比,我们提出的优化算法的有效性和优越性。
https://arxiv.org/abs/2404.06007
In approximate nearest neighbor search (ANNS) methods based on approximate proximity graphs, DiskANN achieves good recall-speed balance for large-scale datasets using both of RAM and storage. Despite it claims to save memory usage by loading compressed vectors by product quantization (PQ), its memory usage increases in proportion to the scale of datasets. In this paper, we propose All-in-Storage ANNS with Product Quantization (AiSAQ), which offloads the compressed vectors to storage. Our method achieves $\sim$10 MB memory usage in query search even with billion-scale datasets with minor performance degradation. AiSAQ also reduces the index load time before query search, which enables the index switch between muitiple billion-scale datasets and significantly enhances the flexibility of retrieval-augmented generation (RAG). This method is applicable to all graph-based ANNS algorithms and can be combined with higher-spec ANNS methods in the future.
基于近邻图的近邻搜索(ANNS)方法中,DiskANN在RAM和存储上均取得了良好的召回速度平衡,适用于大规模数据集。尽管它声称通过产品量化(PQ)加载压缩向量来节省内存使用量,但数据集规模越大,其内存使用量增加的比例就越大。在本文中,我们提出了All-in-Storage ANNS with Product Quantization(AiSAQ)方法,将压缩向量卸载到存储器中。我们的方法在即使有亿规模数据集的情况下,查询搜索的内存使用量也约为10MB,且性能略有下降。AiSAQ还减少了查询搜索前的索引负载时间,使得索引可以在多个亿规模数据集之间进行切换,从而显著增强了检索增强生成(RAG)的灵活性。这种方法适用于所有基于图的ANNS算法,未来可以与高规格的ANNS方法相结合。
https://arxiv.org/abs/2404.06004
ML is shifting from the cloud to the edge. Edge computing reduces the surface exposing private data and enables reliable throughput guarantees in real-time applications. Of the panoply of devices deployed at the edge, resource-constrained MCUs, e.g., Arm Cortex-M, are more prevalent, orders of magnitude cheaper, and less power-hungry than application processors or GPUs. Thus, enabling intelligence at the deep edge is the zeitgeist, with researchers focusing on unveiling novel approaches to deploy ANNs on these constrained devices. Quantization is a well-established technique that has proved effective in enabling the deployment of neural networks on MCUs; however, it is still an open question to understand the robustness of QNNs in the face of adversarial examples. To fill this gap, we empirically evaluate the effectiveness of attacks and defenses from (full-precision) ANNs on (constrained) QNNs. Our evaluation includes three QNNs targeting TinyML applications, ten attacks, and six defenses. With this study, we draw a set of interesting findings. First, quantization increases the point distance to the decision boundary and leads the gradient estimated by some attacks to explode or vanish. Second, quantization can act as a noise attenuator or amplifier, depending on the noise magnitude, and causes gradient misalignment. Regarding adversarial defenses, we conclude that input pre-processing defenses show impressive results on small perturbations; however, they fall short as the perturbation increases. At the same time, train-based defenses increase the average point distance to the decision boundary, which holds after quantization. However, we argue that train-based defenses still need to smooth the quantization-shift and gradient misalignment phenomenons to counteract adversarial example transferability to QNNs. All artifacts are open-sourced to enable independent validation of results.
机器学习(ML)正在从云端向边缘转移。在边缘计算中,减少了暴露的私隐数据,并在实时应用中实现了可靠的吞吐量保证。在部署的设备中,资源受限的微控制器(例如ARM Cortex-M)更为普遍,比应用处理器或GPU便宜得多,且功耗更低。因此,在边缘推动智能是潮流,研究人员集中精力揭示在受限设备上部署ANN的新方法。量化是一种经过验证的有效技术,证明可以将神经网络部署到MCU上。然而,理解量化在面临攻击时的鲁棒性仍然是一个未解之谜。为了填补这一空白,我们通过(全精度)ANN的攻击和防御评估了其在(受限)QNN上的效果。我们的评估包括针对TinyML应用的三个QNN,针对十个攻击和六个防御。通过这项研究,我们得出了一系列有趣的发现。首先,量化增加了决策边界点之间的距离,并导致某些攻击者估计的梯度爆炸或消失。其次,量化可以充当噪声衰减器或放大器,根据噪声幅度不同,导致梯度错位。关于攻击防御,我们得出结论,预处理防御在 small perturbations 上表现出惊人的效果;然而,当扰动增加时,它们的表现就不足了。同时,基于训练的防御会增加平均决策边界点距离,这是在量化后成立的。然而,我们认为基于训练的防御还需要平滑量化-转移和梯度错位现象,以对抗 adversarial example transferability to QNNs。所有成果都已公开开源,以促进独立验证结果。
https://arxiv.org/abs/2404.05688
With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. In this paper, we propose BinaryDM, a novel accurate quantization-aware training approach to push the weights of diffusion models towards the limit of 1-bit. Firstly, we present a Learnable Multi-basis Binarizer (LMB) to recover the representations generated by the binarized DM, which improves the information in details of representations crucial to the DM. Secondly, a Low-rank Representation Mimicking (LRM) is applied to enhance the binarization-aware optimization of the DM, alleviating the optimization direction ambiguity caused by fine-grained alignment. Moreover, a progressive initialization strategy is applied to training DMs to avoid convergence difficulties. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. As the first binarization method for diffusion models, BinaryDM achieves impressive 16.0 times FLOPs and 27.1 times storage savings with 1-bit weight and 4-bit activation, showcasing its substantial advantages and potential for deploying DMs on resource-limited scenarios.
随着扩散模型(DMs)的进步和计算需求的增加,量化成为获得紧凑和高效的低位位DM的实用解决方案。然而,高度离散的表示导致精确度降低,阻碍了扩散模型的量化达到超低位宽。在本文中,我们提出了BinaryDM,一种新颖的准确量化感知训练方法,将扩散模型的权重推向1位的极限。首先,我们提出了可学习的多基元二进制化器(LMB),以恢复二进制化DM生成的表示,这有助于提高对DM中关键表示信息。其次,低秩表示模拟(LRM)应用于增强DM的量化意识,减轻了微细对齐引起的优化方向不确定性。此外,我们还应用了逐步初始化策略来训练DM,以避免收敛困难。综合实验证明,BinaryDM在超低位宽的DM上实现了显著的准确性和效率提升,与当前的DM量化方法相比。作为扩散模型的第一种二进制化方法,BinaryDM在1个位宽和4个位宽的权重下,取得了令人印象深刻的16.0倍FLOPs和27.1倍存储节省,展示了其在有限资源场景下部署DM的显著优势和潜在。
https://arxiv.org/abs/2404.05662
Quantization is a promising technique for reducing the bit-width of deep models to improve their runtime performance and storage efficiency, and thus becomes a fundamental step for deployment. In real-world scenarios, quantized models are often faced with adversarial attacks which cause the model to make incorrect inferences by introducing slight perturbations. However, recent studies have paid less attention to the impact of quantization on the model robustness. More surprisingly, existing studies on this topic even present inconsistent conclusions, which prompted our in-depth investigation. In this paper, we conduct a first-time analysis of the impact of the quantization pipeline components that can incorporate robust optimization under the settings of Post-Training Quantization and Quantization-Aware Training. Through our detailed analysis, we discovered that this inconsistency arises from the use of different pipelines in different studies, specifically regarding whether robust optimization is performed and at which quantization stage it occurs. Our research findings contribute insights into deploying more secure and robust quantized networks, assisting practitioners in reference for scenarios with high-security requirements and limited resources.
量化是一种有前途的降低深度模型位宽以提高其运行性能和存储效率的技术,从而成为部署的基本步骤。在现实场景中,量化模型常常面临 adversarial 攻击,通过引入微小的扰动导致模型做出错误的推理。然而,最近的研究对这个话题的影响程度较小。更令人惊讶的是,现有研究在这个问题上甚至得出了不一致的结论,这促使我们对这个话题进行深入调查。在本文中,我们对量化管线的组件进行了第一次分析,分析了在 Post-Training Quantization 和 Quantization-Aware Training 的设置下,量化对模型鲁棒性的影响。通过我们的详细分析,我们发现这个不一致是由于不同研究使用了不同的管道,特别是在是否进行了鲁棒优化以及发生这种优化的时间点上存在差异。我们的研究结果为部署更安全、更鲁棒的量化网络提供了见解,并为具有高安全要求和有限资源场景的实践提供了参考。
https://arxiv.org/abs/2404.05639
Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity. Notable insights include: * The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train. * Prepending training data with domain names (e.g., this http URL) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.
缩放定律描述了语言模型的大小和其能力之间的关系。与之前的研究不同,我们估计模型所存储的知识比特数。我们关注用元组表示的事实知识,例如维基百科页面中的(USA,首都,华盛顿特区)。通过多个控制数据集,我们发现语言模型可以并且只能存储每个参数2个知识比特,即使量化到int8,并且这种知识可以灵活地为下游应用进行提取。因此,7B模型可以存储14B个知识比特,超过了根据我们的估计基于英文维基百科和教科书的总知识比特数。 更广泛地说,我们展示了训练持续时间、模型架构、量化、具有MoE的稀疏约束条件以及数据信号-噪声比如何影响模型的知识存储能力。值得注意的是:* GPT-2架构(具有旋转嵌入)在知识存储方面与LLaMA/Mistral架构相匹敌,特别是在较短的训练持续时间上。这是因为LLaMA/Mistral使用GatedMLP,这是一种更不稳定且难以训练的模型。* 在训练数据前附上领域名称(例如,此http URL)可以显著增加模型的知识容量。语言模型可以自主地识别和优先化知识丰富丰富的领域,优化其存储容量。
https://arxiv.org/abs/2404.05405
Model merging is a promising lightweight model empowerment technique that does not rely on expensive computing devices (e.g., GPUs) or require the collection of specific training data. Instead, it involves editing different upstream model parameters to absorb their downstream task capabilities. However, uncertified model merging can infringe upon the Intellectual Property (IP) rights of the original upstream models. In this paper, we conduct the first study on the robustness of IP protection methods in model merging scenarios. We investigate two state-of-the-art IP protection techniques: Quantization Watermarking and Instructional Fingerprint, along with various advanced model merging technologies, such as Task Arithmetic, TIES-MERGING, and so on. Experimental results indicate that current Large Language Model (LLM) watermarking techniques cannot survive in the merged models, whereas model fingerprinting techniques can. Our research aims to highlight that model merging should be an indispensable consideration in the robustness assessment of model IP protection techniques, thereby promoting the healthy development of the open-source LLM community.
模型合并是一种轻量级的模型增强技术,不依赖于昂贵的计算设备(例如,GPUs)或需要收集特定的训练数据。相反,它涉及编辑不同上游模型的参数,以吸收其下游任务能力。然而,未经授权的模型合并可能会侵犯原始上游模型的知识产权(IP)。在本文中,我们研究了模型合并场景中知识产权保护方法的保护效果。我们调查了两种最先进的知识产权保护技术:量化水印和指令指纹,以及各种先进的模型合并技术,如任务算术、TIES-MERGING等。实验结果表明,当前的大语言模型(LLM)水印技术在合并模型中无法生存,而模型指纹技术可以。我们的研究旨在强调,在评估模型知识产权保护技术的稳健性时,模型合并应该是一个不可或缺的考虑因素,从而促进开源LLM社区的的健康发展。
https://arxiv.org/abs/2404.05188
Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL.
深度量化方法在大型图像检索任务上表现出了高效性。然而,当前的模型在很大程度上依赖于真实数据,这阻碍了在有标签的场景中应用量化。一个更现实的需求是学习来自非正式标签的不可用上传图像,这些图像与业余用户提供的标签相关。尽管这些标签并不明显地揭示了标签,但它们实际上包含有关深度量化的有用语义信息。为此,我们提出了弱监督深度超球量化(WSDHQ),这是第一个从弱标签图像中学习深度量化的工作。具体来说,1)我们使用词向量来表示标签,并根据标签相关图增强其语义信息。2)为了更好地保留语义信息在量化代码中,并减少量化误差,我们通过采用设计巧妙的融合层和定制损失函数,在超球上共同学习和语义保持嵌入。大量实验证明,WSDHQ可以在弱监督的紧凑编码上实现最先进的性能。代码可在此处下载:https://url.cn/xyz6hU6
https://arxiv.org/abs/2404.04998
We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules for simpler training, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics.
我们介绍了一种名为Gull的多功能音频编码器。Gull是一种通用神经音频压缩和解压缩模型,可以应用于广泛的任务和应用,如实时通信、音频超分辨率等。Gull的关键组件包括:(1)通过基于子带建模方案的通用样本率建模,(2)基于传统音频编码器的增益形状表示,(3)用于简单训练的改进残差向量量化模块,(4)支持用户定义模型大小和复杂度的弹性解码网络,(5)在没有增加比特率的情况下内置音频超分辨率功能。我们比较Gull与现有的传统和神经音频编码器,并证明了Gull在主观和客观评价指标上能够实现与现有 codec 相当或更好的性能。
https://arxiv.org/abs/2404.04947
Compression techniques have been crucial in advancing machine learning by enabling efficient training and deployment of large-scale language models. However, these techniques have received limited attention in the context of low-resource language models, which are trained on even smaller amounts of data and under computational constraints, a scenario known as the "low-resource double-bind." This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on an exclusively low-resourced, small-data language model, AfriBERTa. Through a battery of experiments, we assess the effects of compression on performance across several metrics beyond accuracy. Our study provides evidence that compression techniques significantly improve the efficiency and effectiveness of small-data language models, confirming that the prevailing beliefs regarding the effects of compression on large, heavily parameterized models hold true for less-parameterized, small-data models.
压缩技术在推动机器学习方面发挥了关键作用,通过实现大规模语言模型的有效训练和部署,从而提高了机器学习的发展。然而,在低资源语言模型的背景下,这些技术在低资源双重约束(low-resource double-bind)场景中得到了 limited 的关注。本文研究了在仅有限资源和数据的情况下,对 AfriBERTa 等小数据低资源语言模型的压缩效果。通过一系列实验评估了压缩对多个性能指标的影响。我们的研究提供了证据,表明压缩技术显著提高了小数据语言模型的效率和效果,证实了关于压缩对大模型,高度参数化模型的影响,在参数较少,资源有限的小数据模型上,是正确的。
https://arxiv.org/abs/2404.04759
Large Language Models (LLMs) have become very popular and have found use cases in many domains, such as chatbots, auto-task completion agents, and much more. However, LLMs are vulnerable to different types of attacks, such as jailbreaking, prompt injection attacks, and privacy leakage attacks. Foundational LLMs undergo adversarial and alignment training to learn not to generate malicious and toxic content. For specialized use cases, these foundational LLMs are subjected to fine-tuning or quantization for better performance and efficiency. We examine the impact of downstream tasks such as fine-tuning and quantization on LLM vulnerability. We test foundation models like Mistral, Llama, MosaicML, and their fine-tuned versions. Our research shows that fine-tuning and quantization reduces jailbreak resistance significantly, leading to increased LLM vulnerabilities. Finally, we demonstrate the utility of external guardrails in reducing LLM vulnerabilities.
大语言模型(LLMs)已经变得非常受欢迎,并在许多领域找到了应用,如聊天机器人、自动任务完成代理等。然而,LLMs易受到不同类型的攻击,如越狱攻击、提示注入攻击和隐私泄露攻击等。基础LLMs通过对抗和调整训练来学习不生成恶意和有害内容。对于专业应用场景,这些基础LLM会受到微调或量化以获得更好的性能和效率。我们研究了下游任务(如微调和量化)对LLM漏洞的影响。我们测试了Mistral、Llama、MosaicML及其微调版本等基础模型。我们的研究显示,微调和量化显著降低了越狱抵抗力,导致LLM漏洞增多。最后,我们证明了外部防护措施在减少LLM漏洞方面的效用。
https://arxiv.org/abs/2404.04392
If our noise-canceling headphones can understand our audio environments, they can then inform us of important sound events, tune equalization based on the types of content we listen to, and dynamically adjust noise cancellation parameters based on audio scenes to further reduce distraction. However, running multiple audio understanding models on headphones with a limited energy budget and on-chip memory remains a challenging task. In this work, we identify a new class of neural network accelerators (e.g., NE16 on GAP9) that allows network weights to be quantized to different common (e.g., 8 bits) and uncommon bit-widths (e.g., 3 bits). We then applied a differentiable neural architecture search to search over the optimal bit-widths of a network on two different sound event detection tasks with potentially different requirements on quantization and prediction granularity (i.e., classification vs. embeddings for few-shot learning). We further evaluated our quantized models on actual hardware, showing that we reduce memory usage, inference latency, and energy consumption by an average of 62%, 46%, and 61% respectively compared to 8-bit models while maintaining floating point performance. Our work sheds light on the benefits of such accelerators on sound event detection tasks when combined with an appropriate search method.
如果我们的消噪音耳机可以理解我们的音频环境,它们就可以告诉我们重要的事件声音,根据我们听的内容调整均衡,并根据音频场景动态调整降噪参数,从而进一步减少干扰。然而,在有限能源预算和芯片内存储器的耳机上运行多个音频理解模型仍然具有挑战性。在这项工作中,我们识别出一种新的神经网络加速器(例如,GAP9上的NE16)允许网络权重以不同的常见(例如8位)和罕见位宽(例如3位)进行量化。然后,我们应用了不同的神经网络架构搜索来搜索在两个不同的音频事件检测任务上的网络的最佳位宽。我们还进一步评估了我们的量化模型在实际硬件上的效果,结果表明,与8位模型相比,我们平均降低了62%、46%和61%的内存使用量、推理延迟和能耗。我们的工作揭示了在结合适当的搜索方法时,为声音事件检测任务提供这种加速器的益处。
https://arxiv.org/abs/2404.04386
We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathtt{OutEffHop}$) and use it to address the outlier-induced challenge of quantizing gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism ($\text{Softmax}_1$): it is an approximation of the memory retrieval process of $\mathtt{OutEffHop}$. Methodologically, this allows us to debut novel outlier-efficient Hopfield layers a powerful attention alternative with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of the standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the proposed model's efficacy across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT and STanHop-Net), benchmarking against state-of-the-art methods including $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$. Notably, $\mathtt{OutEffHop}$ achieves on average $\sim$22+\% reductions in both average kurtosis and maximum infinity norm of model outputs accross 4 models.
我们提出了一个名为$\mathtt{OutEffHop}$的异类效率现代Hopfield模型,并使用它来解决基于巨变换器的模型的异类诱导挑战。我们的主要贡献是一种新的相关记忆模型,促进异类有效的相关记忆检索。有趣的是,这个记忆模型表现出了一个基于模型的异类有效的注意机制($\text{Softmax}_1$)的模型基解释释:它是$\mathtt{OutEffHop}$记忆检索过程的近似。方法论上,这使我们能够用具有卓越后量化性能的新型异类有效的Hopfield层替代强大的注意选项。在理论上是,异类效率现代Hopfield模型保留了并提高了标准现代Hopfield模型的有益特性,包括固定点收敛和指数存储容量。实证上,我们在大型Transformer基模型和Hopfield基模型(包括BERT,OPT,ViT和STanHop-Net)上展示了所提出的模型的有效性,并与其他最先进的包括$\mathtt{Clipped_Softmax}$和$\mathtt{Gated_Attention}$的方法进行了比较。值得注意的是,$\mathtt{OutEffHop}$在4个模型上平均减少了约22%的模型的平均kurtosis和最大无穷范数。
https://arxiv.org/abs/2404.03828
We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.
我们考虑语言模型准确量化的问题,其中权重和激活都被均匀量化为4位/参数,这是GPU硬件固有支持的最低位宽格式。在这个背景下,关键挑战是激活量化:已知语言模型包含异常 channel,其平均值通常比其他通道高出 orders of magnitude,这使得使用已知方法进行低位宽量化不准确。我们系统地研究了这种现象,并发现这些异常 channel 在训练早期就出现了,并且它们在具有残差流的层中更常见。然后我们提出了一种简单的策略,通过量化感知训练(QAT)对层的输入进行 regularization,并通过激活峰度 regularization 对层的输出进行 regularization。我们证明了通过 regularize both inputs 和 outputs,可以防止模型在输入量化上“迁移”难度,这使得后训练量化(PTQ)的权重更加困难。当与权重 PTQ 相结合时,我们证明了我们的方法可以获得一个竞争力与标准精度 W16A16 基线的 W4A4 模型。
https://arxiv.org/abs/2404.03605