Large language models (LLMs) are rapidly emerging in Artificial Intelligence (AI) applications, especially in the fields of natural language processing and generative AI. Not limited to text generation applications, these models inherently possess the opportunity to leverage prompt engineering, where the inputs of such models can be appropriately structured to articulate a model's purpose explicitly. A prominent example of this is intent-based networking, an emerging approach for automating and maintaining network operations and management. This paper presents semantic routing to achieve enhanced performance in LLM-assisted intent-based management and orchestration of 5G core networks. This work establishes an end-to-end intent extraction framework and presents a diverse dataset of sample user intents accompanied by a thorough analysis of the effects of encoders and quantization on overall system performance. The results show that using a semantic router improves the accuracy and efficiency of the LLM deployment compared to stand-alone LLMs with prompting architectures.
大语言模型(LLMs)在人工智能(AI)应用中正在迅速崛起,尤其是在自然语言处理和生成式人工智能领域。这些模型不仅限于文本生成应用,还具有利用提示工程的机会,使模型的输入适当地结构化以明确表达其目的。一个显著的例子是基于意图的网络,这是一种自动化和维护网络操作和管理的新兴方法。本文介绍了语义路由以实现LLM辅助意图基于管理的增强性能和5G核心网络的编排。这项工作建立了一个端到端的意图提取框架,并提供了丰富的用户意图数据集以及编码器和量化对整体系统性能的影响的深入分析。结果表明,使用语义路由器可以提高LLM部署的准确性和效率,与单独使用提示结构的LLM相比。
https://arxiv.org/abs/2404.15869
Large Language Models (LLMs) have showcased exceptional performance across a wide array of Natural Language Processing (NLP) tasks. Fine-tuning techniques are commonly utilized to tailor pre-trained models to specific applications. While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted to limited performance, especially on multi-task. On the other hand, Mix-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance across multiple NLP tasks while maintaining a reduced parameter count. However, the resource requirements of these MoEs still challenging, particularly for consumer-grade GPUs only have limited VRAM. To address these challenge, we propose MixLoRA, an innovative approach aimed at constructing a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model through fine-tuning, employing a commonly used top-k router. Unlike other LoRA based MoE methods, MixLoRA enhances model performance by utilizing independently configurable attention-layer LoRA adapters, supporting the use of LoRA and its variants for the construction of experts, and applying auxiliary load balance loss to address the imbalance problem of the router. In experiments, MixLoRA achieves commendable performance across all evaluation metrics in both single-task and multi-task learning scenarios. Implemented within the m-LoRA framework, MixLoRA enables parallel fine-tuning of multiple mixture-of-experts models on a single 24GB consumer-grade GPU without quantization, thereby reducing GPU memory consumption by 41\% and latency during the training process by 17\%.
大语言模型(LLMs)在广泛的自然语言处理(NLP)任务中展现了卓越的表现。为了将预训练模型定制到特定应用,通常会使用微调技术来调整预训练模型。与像LoRA这样的方法有效地解决GPU内存限制并在微调过程中取得了显著进展不同,这些方法的应用范围通常受到限制,尤其是在多任务处理情况下。另一方面,像Mix-of-Expert(MoE)模型,如Mixtral 8x7B,在多个NLP任务上表现出色,同时保持参数数量减少。然而,这些MoE模型的资源需求仍然很高,尤其是对于仅具有有限VRAM的消费级GPU来说。为解决这些挑战,我们提出了MixLoRA,一种旨在基于LoRA构建资源高效的稀疏MoE模型的创新方法。MixLoRA通过在预训练密度模型的高级网络模块中插入多个LoRA基专家,并使用常用的top-k路由器进行微调,从而在输入网络中实现对LoRA和其变体的利用。与其他基于LoRA的MoE方法不同,MixLoRA通过独立配置注意层LoRA适配器来增强模型性能,支持使用LoRA及其变体构建专家,并应用辅助负载平衡损失来解决路由器不平衡问题。在实验中,MixLoRA在单任务和多任务学习场景下的所有评估指标都取得了卓越的成绩。借助m-LoRA框架,MixLoRA可以在单个24GB消费级GPU上并行微调多个混合专家模型,从而在训练过程中减少41%的GPU内存消耗和17%的延迟。
https://arxiv.org/abs/2404.15159
In federated learning, particularly in cross-device scenarios, secure aggregation has recently gained popularity as it effectively defends against inference attacks by malicious aggregators. However, secure aggregation often requires additional communication overhead and can impede the convergence rate of the global model, which is particularly challenging in wireless network environments with extremely limited bandwidth. Therefore, achieving efficient communication compression under the premise of secure aggregation presents a highly challenging and valuable problem. In this work, we propose a novel uplink communication compression method for federated learning, named FedMPQ, which is based on multi shared codebook product quantization.Specifically, we utilize updates from the previous round to generate sufficiently robust codebooks. Secure aggregation is then achieved through trusted execution environments (TEE) or a trusted third party (TTP).In contrast to previous works, our approach exhibits greater robustness in scenarios where data is not independently and identically distributed (non-IID) and there is a lack of sufficient public data. The experiments conducted on the LEAF dataset demonstrate that our proposed method achieves 99% of the baseline's final accuracy, while reducing uplink communications by 90-95%
在联邦学习尤其是在跨设备场景中,安全聚合最近 gained popularity,因为它有效地防御了恶意聚合器的推理攻击。然而,安全聚合通常需要额外的通信开销,并可能阻碍全局模型的收敛速度,尤其是在无线网络环境中,带宽极其有限。因此,在实现安全聚合前提下实现高效的通信压缩是一项非常具有挑战性和价值的问题。 在这项工作中,我们提出了一种名为FedMPQ的新颖的跨设备通信压缩方法,基于多共享码簿量化。具体来说,我们利用前一轮的更新生成足够健壮的码簿。然后通过可信执行环境(TEE)或可信第三方(TTP)实现安全聚合。 与之前的工作相比,我们的方法在数据不独立且不均匀分布(非IID)场景以及缺乏充分公共数据的情况下表现出更大的稳健性。LEAF数据集的实验结果表明,与基线相比,我们提出的方法在准确度上实现了99%的提高,同时将上下文通信减少了90-95%。
https://arxiv.org/abs/2404.13575
This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the insufficient representation power of limited codewords in capturing faithful details. We propose a novel dual-stream framework, HyrbidFlow, which combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme low bitrates. The codebook-based stream benefits from the high-quality learned codebook priors to provide high quality and clarity in reconstructed images. The continuous feature stream targets at maintaining fidelity details. To achieve the ultra low bitrate, a masked token-based transformer is further proposed, where we only transmit a masked portion of codeword indices and recover the missing indices through token generation guided by information from the continuous feature stream. We also develop a bridging correction network to merge the two streams in pixel decoding for final image reconstruction, where the continuous stream features rectify biases of the codebook-based pixel decoder to impose reconstructed fidelity details. Experimental results demonstrate superior performance across several datasets under extremely low bitrates, compared with existing single-stream codebook-based or continuous-feature-based LIC methods.
本文研究了在低比特率下进行学习图像压缩(LIC)的具有挑战性的问题。以前基于传输量化连续特征的LIC方法通常由于量化损失严重而导致模糊和噪声的重建。而以前基于学习码本的LIC方法在捕捉准确细节方面具有不足的表示能力,因此通常会导致低质量的重建。我们提出了一个新型的双流框架HyrbidFlow,它结合了基于连续特征和基于码本的流,以实现低比特率下的高感知质量和高保真度。基于码本的流利用高质量的学习码本先验来提供高质量和清晰度在重构图像中。连续特征流的目标是保持保真度细节。为了实现超低比特率,我们进一步提出了一个掩码标记的Transformer,其中我们仅传输码字索引的掩码部分,并通过连续特征流的标记来恢复缺失的索引。我们还开发了一个平滑修复网络,用于在像素解码中合并这两个流,以便进行最终图像重构。基于连续流特征的码字解码器的偏置被平滑修复网络中的连续流纠正。实验结果表明,在极低比特率下,与现有的单流码本或连续特征 based LIC 方法相比,具有卓越的性能。
https://arxiv.org/abs/2404.13372
The sim-to-real gap poses a significant challenge in RL-based multi-agent exploration due to scene quantization and action discretization. Existing platforms suffer from the inefficiency in sampling and the lack of diversity in Multi-Agent Reinforcement Learning (MARL) algorithms across different scenarios, restraining their widespread applications. To fill these gaps, we propose MAexp, a generic platform for multi-agent exploration that integrates a broad range of state-of-the-art MARL algorithms and representative scenarios. Moreover, we employ point clouds to represent our exploration scenarios, leading to high-fidelity environment mapping and a sampling speed approximately 40 times faster than existing platforms. Furthermore, equipped with an attention-based Multi-Agent Target Generator and a Single-Agent Motion Planner, MAexp can work with arbitrary numbers of agents and accommodate various types of robots. Extensive experiments are conducted to establish the first benchmark featuring several high-performance MARL algorithms across typical scenarios for robots with continuous actions, which highlights the distinct strengths of each algorithm in different scenarios.
模拟-现实差距在基于强化学习的多智能体探索中提出了一个重大的挑战,由于场景量化和解码动作的离散化,现有的平台在采样效率和多智能体强化学习(MARL)算法在不同场景下的多样性方面存在低效,限制了它们在各个领域的广泛应用。为了填补这些空白,我们提出了MAexp,一个通用的多智能体探索平台,整合了最先进的MARL算法和代表性的场景。此外,我们还使用点云来表示我们的探索场景,导致高保真度环境映射和采样速度约比现有平台快40倍。此外,配备了基于注意力的多智能体目标生成器和单智能体运动规划器,MAexp可以与任意数量的智能体一起工作,并可以容纳各种类型的机器人。为了确定机器人连续行动场景中多个高性能MARL算法的第一个基准,我们进行了大量实验。这些实验突出了每个算法在不同场景中的独特优势。
https://arxiv.org/abs/2404.12824
The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.
Stable Diffusion (SD) 的密集计算负担对于文本到图像生成应用来说是一个显著的障碍。为了解决这个问题,最近的研究集中在减少抽样步骤的方法,例如潜在一致性模型(LCM)以及采用架构优化,包括剪枝和知识蒸馏。与现有方法不同,我们以紧凑的 SD 变体 BK-SDM 为开端。我们观察到,直接将 LCM 应用于 BK-SDM 通常会导致不满意的结果。这导致我们开发了两种策略:(1)利用主导生成模型的高质量图像文本对,(2)针对 LCM 设计一个高级去蒸馏过程。通过我们对量化、分析和离线部署的深入探索,我们仅在两步之内迅速生成了高质量的照片现实主义图像,延迟时间在资源受限的边缘设备上不到一秒。
https://arxiv.org/abs/2404.11925
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.
量化降低内存使用、计算要求和延迟,通过使用更少的比特来表示模型权重和激活。在这项工作中,我们研究了量化神经网络的泛化特性,尽管这对模型性能有着深刻的意义,但这个特性并未受到太多的关注。特别是,我们为神经网络的量化开发了一个理论模型,并证明了量化作为一种正则化形式的作用。第二,为了研究最近工作连接损失函数的尖度与泛化之间的关系,我们推导了量化模型的泛化近界,基于量化噪声的数量。然后,我们通过在CIFAR-10、CIFAR-100和ImageNet数据集上使用卷积和Transformer基模型训练超过2000个模型来验证我们的假设。
https://arxiv.org/abs/2404.11769
Vision Transformers (ViT) have marked a paradigm shift in computer vision, outperforming state-of-the-art models across diverse tasks. However, their practical deployment is hampered by high computational and memory demands. This study addresses the challenge by evaluating four primary model compression techniques: quantization, low-rank approximation, knowledge distillation, and pruning. We methodically analyze and compare the efficacy of these techniques and their combinations in optimizing ViTs for resource-constrained environments. Our comprehensive experimental evaluation demonstrates that these methods facilitate a balanced compromise between model accuracy and computational efficiency, paving the way for wider application in edge computing devices.
Vision Transformers (ViT) 已经在计算机视觉领域取得了范式转移,超越了在各种任务上 state-of-the-art 模型的表现。然而,它们的实际部署受到高计算和内存需求的阻碍。为了解决这个问题,本研究评估了四种主要模型压缩技术:量化、低秩近似、知识蒸馏和剪枝。我们系统地分析和比较这些技术以及它们在优化 ViT 时实现的最佳平衡点。我们全面的实验评估证明,这些方法促进在资源受限的环境中实现模型准确性和计算效率的平衡,为在边缘计算设备上更广泛的应用铺平道路。
https://arxiv.org/abs/2404.10407
Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance.
归纳偏见在解离表示学习中的关键作用在于缩小不明确解集。在这项工作中,我们考虑为神经网络自编码器赋予来自文献中的三个选择性归纳偏见:通过量化将数据压缩到类似于网格状的潜在空间,以及在潜在之间实现集体独立,以及对任何潜在对其他潜在如何确定数据生成的最小功能影响。原则上,这些归纳偏见是深刻互补的:它们最直接地指定潜在空间的性质、编码器和解码器的属性。然而,在实践中,简单地组合现有的技术实例这些归纳偏见往往无法带来显著的益处。为了解决这个问题,我们提出了三种适应技术,简化学习问题,为关键正则化项分配稳定不变性,以及遏制退化激励。所得到的模型Tripod在四个图像解离基准测试中实现了最先进的结果。我们还验证了Tripod在它的原始形式上明显优于 naive 版本,而且它的三个“腿”对于最佳性能都是必要的。
https://arxiv.org/abs/2404.10282
In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks and on a set of downstream tasks for text classification. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance while ensuring data compression, marking a significant advancement in the field of data quantization.
在本文中,我们基于Kashin表示原理提出了一种数据量化算法。这种方法的关键在于将任何给定的向量、矩阵或张量分解成两个因子。第一个因子保持一个较小的无穷范数,而第二个因子在乘以正交矩阵时表现出类似的约束范数。令人惊讶的是,在分解后,因子中的entry points 很好地集中在几个尖点上,这使我们能够有效地用相应的聚类器来代替它们,从而实现量化目的。我们研究了所提出的算法的理论性质,并在next-word prediction任务和一系列下游任务(如文本分类)上对压缩算法进行了严谨的评估。我们的研究结果表明,Kashin量化在保证数据压缩的同时,实现了与竞争或卓越模型性能相当的量化质量,这标志着数据量化领域的重要进展。
https://arxiv.org/abs/2404.09737
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
Diffusion models have emerged as preeminent contenders in the realm of generative models. Distinguished by their distinctive sequential generative processes, characterized by hundreds or even thousands of timesteps, diffusion models progressively reconstruct images from pure Gaussian noise, with each timestep necessitating full inference of the entire model. However, the substantial computational demands inherent to these models present challenges for deployment, quantization is thus widely used to lower the bit-width for reducing the storage and computing overheads. Current quantization methodologies primarily focus on model-side optimization, disregarding the temporal dimension, such as the length of the timestep sequence, thereby allowing redundant timesteps to continue consuming computational resources, leaving substantial scope for accelerating the generative process. In this paper, we introduce TMPQ-DM, which jointly optimizes timestep reduction and quantization to achieve a superior performance-efficiency trade-off, addressing both temporal and model optimization aspects. For timestep reduction, we devise a non-uniform grouping scheme tailored to the non-uniform nature of the denoising process, thereby mitigating the explosive combinations of timesteps. In terms of quantization, we adopt a fine-grained layer-wise approach to allocate varying bit-widths to different layers based on their respective contributions to the final generative performance, thus rectifying performance degradation observed in prior studies. To expedite the evaluation of fine-grained quantization, we further devise a super-network to serve as a precision solver by leveraging shared quantization results. These two design components are seamlessly integrated within our framework, enabling rapid joint exploration of the exponentially large decision space via a gradient-free evolutionary search algorithm.
扩散模型已成为生成模型领域的主要竞争者。通过其独特的序列生成过程脱颖而出,这些过程具有数百甚至数千个时步,扩散模型从纯高斯噪声中逐渐重构图像,每个时步都需要对整个模型进行完整的推理。然而,这些模型固有的计算需求在面对部署方面具有挑战性,因此广泛使用量化来降低位宽以减少存储和计算开销。目前,量化方法主要关注模型侧优化,而忽略了时域维度,例如时步序列的长度,从而允许冗余时步继续消耗计算资源,为加速生成过程留下了广阔的余地。在本文中,我们引入了TMPQ-DM,该模型通过共同优化时步减少和量化来实现卓越的性能-效率权衡,解决了时域和模型优化方面的问题。在时步减少方面,我们设计了一个非均匀分组方案,针对去噪过程非均匀性的特点,从而减轻了时步的爆炸组合。在量化方面,我们采用了一种细粒度的层-wise方法,根据各个层对最终生成性能的贡献分配不同的位宽,从而纠正了之前研究中观察到的性能下降。为了加速细粒度量化评估,我们进一步设计了一个超网络,利用共享量化结果作为精度求解器。这两个设计组件无缝地整合在我们的框架中,通过梯度free进化搜索算法快速探索具有指数级大决策空间。
https://arxiv.org/abs/2404.09532
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset, thereby enabling energy-efficient embodied SNN deployments for autonomous agents.
近年来,自动驾驶车辆(AGVs)、无人机(UAVs)和移动机器人等自主 agent有效提高了人类在解决多样化任务中的生产力。然而,由于这些 agent通常由便携式电池供电,因此它们在长时间内运行时需要极其低功耗/能量。为解决这个问题,神经形态计算作为一种有前景的解决方案应运而生,其中仿生 Spiking Neural Networks (SNNs) 使用基于事件的数据转换预处理或事件相机中的尖峰来执行稀疏计算 efficiently。然而,针对自主 agent 的 SNN 部署的研究仍处于早期阶段。因此,尚未对 enabling efficient embodied SNN deployments for autonomous agents 的优化阶段进行系统地定义。为了实现这一目标,我们提出了一个名为 SNN4Agents 的 novel framework,它包括一个针对自主 agent 应用设计能量高效的 embodied SNN 的优化技术集合。我们的 SNN4Agents 使用权重量化、时钟步减少和注意力窗口减少来共同提高能源效率、降低内存足迹、优化处理延迟,同时保持高精度。在评估中,我们研究了基于事件的汽车识别用例,并探讨了准确性、延迟、内存和能量消耗之间的权衡。实验结果表明,与最先进的 NCARS 数据集相比,我们的框架可以在降低 68.75% 的内存和使用 68.75% 更快的速度和 4.03x 的能量效率提升的同时保持高精度(即 84.12% 的准确率),从而实现自主 agent 的高效 embodied SNN 部署。
https://arxiv.org/abs/2404.09331
Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10\% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.
将扩散模型融入图像压缩领域,有望产生真实和详细的重构,尤其是在极低比特率的情况下。以前的方法侧重于使用扩散模型作为具有条件信号量化误差稳健的表达编码器,然而以这种方式实现竞争力的结果需要对扩散模型进行昂贵的训练,并由于递归生成过程,导致推理时间较长。在这项工作中,我们将量化误差消除视为去噪任务,利用扩散来恢复在传输图像潜在中丢失的信息。我们的方法允许我们执行不到10%的完整扩散生成过程,并且不需要对扩散模型进行架构更改,使得基础模型可以作为强大的先验,无需额外对骨干模型进行微调。我们提出的编码在量化现实指标上优于以前的方法,而且我们验证,即使其他方法使用两倍的比特率,我们的重构仍然具有用户满意的质量。
https://arxiv.org/abs/2404.08580
This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.
本文探讨了是否像LLLM(大型语言模型)这样原本设计的解码器- only transformer(如LLaMA)可以适应计算机视觉领域。首先,我们使用标准的ViT逐步对齐LLaMA的架构,并发现直接应用一个简单的掩码到自注意力会导致注意力崩溃问题,从而导致网络训练失败。我们建议使用后序类标签技术将类标签置于图像标签之前,以克服这一挑战,从而实现高效的因果自注意。此外,我们还开发了一种逐渐引入随机的掩码以促进训练优化的软掩码策略。定制的模型被称为图像LLaMA(iLLaMA),在架构上与LLaMA类似,并通过提高注意图的排名实现直接监督学习。其因果自注意力提高了计算效率,通过提高注意图的排名学习复杂的表示。与仅编码器的模型相比,iLLaMA的性能略高,达到75.1%的ImageNet top-1准确率,仅使用5.7M个参数。将模型扩展到~310M,并在ImageNet-21K上进行预训练,进一步提高了准确率至86.0%。大量的实验证明了iLLaMA的可靠性质:标定、形状纹理偏差、量化和兼容性,以及ADE20K分割和CIFAR迁移学习。我们希望我们的研究可以激发对LLM视觉模型设计的全新观点。预训练模型和代码可在此处获取。
https://arxiv.org/abs/2404.06773
The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of layers. Recent works on reducing the cumulative latency through layer removing, however, lead to significant performance drop. Motivated by the similarity of inputs among adjacent layers, we propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency. We also introduce a bypassing technique to mitigate the effect of information loss. Empirical experiments of the proposed approach on the LLaMA models confirm that Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by up to 48.3% on the LLaMA-33B model, while maintaining a close level of performance.
快速发展的大规模语言模型在几乎所有自然语言处理任务上都实现了史无前例的性能。然而,大型语言模型的效果取决于参数的数量呈指数增长。极度计算复杂度导致的推理延迟极大地影响了用户体验。为了提高推理效率,诸如张量并行和量化等现有方法,旨在减少每层计算延迟,却忽略了层数累积的延迟。最近的工作通过层删除来减少累积延迟,然而,却导致性能显著下降。鉴于相邻层之间输入的相似性,我们提出了一个动机,要识别出伪独立层,这些层可以同时计算,从而显著降低推理延迟。我们还引入了一种绕过技术来减轻信息丢失的影响。对所提出的方法的 empirical 实验证实,在 LLaMA 模型上,通过并行计算伪独立层(CQIL)可以降低延迟高达 48.3%,同时保持接近的性能水平。
https://arxiv.org/abs/2404.06709
In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregation, we allow each RRH receives local feature vectors from all devices over the same resource blocks simultaneously by leveraging an over-the-air computation (AirComp) technique. Thereafter, these aggregated feature vectors are quantized and transmitted to a central processor (CP) for further aggregation and downstream inference tasks. Our aim in this work is to maximize the inference accuracy via a surrogate accuracy metric called discriminant gain, which measures the discernibility of different classes in the feature space. The key challenges lie on simultaneously suppressing the coupled sensing noise, AirComp distortion caused by hostile wireless channels, and the quantization error resulting from the limited capacity of fronthaul links. To address these challenges, this work proposes a joint transmit precoding, receive beamforming, and quantization error control scheme to enhance the inference accuracy. Extensive numerical experiments demonstrate the effectiveness and superiority of our proposed optimization algorithm compared to various baselines.
在本文中,我们提出了一个基于协作边缘人工智能的云无线接入网络(Cloud-RAN)架构。具体来说,分布式设备捕获实时噪音污染的感知数据样本并提取出噪音局部特征向量,然后在每个远程无线接入节点(RRH)上进行聚合,以抑制感知噪声。为了实现高效的上行特征聚合,我们利用过顶计算(AirComp)技术,允许每个RRH同时接收来自所有设备的同一资源块的局部特征向量。然后,这些聚合的特征向量进行量化并传输到中央处理器(CP)进行进一步的聚合和下游推理任务。我们希望通过使用一种称为差分增益的代理准确度度量方法,通过提高模型的性能来最大化推理准确性。 关键挑战在于同时抑制耦合感知噪声、由敌对无线信道引起的AirComp失真以及由于前馈链路有限容量引起的量化误差。为了应对这些挑战,本文提出了一种协作传输预编码、接收波束形成和量化误差控制方案,以提高推理准确性。大量的数值实验证明了与各种基线相比,我们提出的优化算法的有效性和优越性。
https://arxiv.org/abs/2404.06007
In approximate nearest neighbor search (ANNS) methods based on approximate proximity graphs, DiskANN achieves good recall-speed balance for large-scale datasets using both of RAM and storage. Despite it claims to save memory usage by loading compressed vectors by product quantization (PQ), its memory usage increases in proportion to the scale of datasets. In this paper, we propose All-in-Storage ANNS with Product Quantization (AiSAQ), which offloads the compressed vectors to storage. Our method achieves $\sim$10 MB memory usage in query search even with billion-scale datasets with minor performance degradation. AiSAQ also reduces the index load time before query search, which enables the index switch between muitiple billion-scale datasets and significantly enhances the flexibility of retrieval-augmented generation (RAG). This method is applicable to all graph-based ANNS algorithms and can be combined with higher-spec ANNS methods in the future.
基于近邻图的近邻搜索(ANNS)方法中,DiskANN在RAM和存储上均取得了良好的召回速度平衡,适用于大规模数据集。尽管它声称通过产品量化(PQ)加载压缩向量来节省内存使用量,但数据集规模越大,其内存使用量增加的比例就越大。在本文中,我们提出了All-in-Storage ANNS with Product Quantization(AiSAQ)方法,将压缩向量卸载到存储器中。我们的方法在即使有亿规模数据集的情况下,查询搜索的内存使用量也约为10MB,且性能略有下降。AiSAQ还减少了查询搜索前的索引负载时间,使得索引可以在多个亿规模数据集之间进行切换,从而显著增强了检索增强生成(RAG)的灵活性。这种方法适用于所有基于图的ANNS算法,未来可以与高规格的ANNS方法相结合。
https://arxiv.org/abs/2404.06004
ML is shifting from the cloud to the edge. Edge computing reduces the surface exposing private data and enables reliable throughput guarantees in real-time applications. Of the panoply of devices deployed at the edge, resource-constrained MCUs, e.g., Arm Cortex-M, are more prevalent, orders of magnitude cheaper, and less power-hungry than application processors or GPUs. Thus, enabling intelligence at the deep edge is the zeitgeist, with researchers focusing on unveiling novel approaches to deploy ANNs on these constrained devices. Quantization is a well-established technique that has proved effective in enabling the deployment of neural networks on MCUs; however, it is still an open question to understand the robustness of QNNs in the face of adversarial examples. To fill this gap, we empirically evaluate the effectiveness of attacks and defenses from (full-precision) ANNs on (constrained) QNNs. Our evaluation includes three QNNs targeting TinyML applications, ten attacks, and six defenses. With this study, we draw a set of interesting findings. First, quantization increases the point distance to the decision boundary and leads the gradient estimated by some attacks to explode or vanish. Second, quantization can act as a noise attenuator or amplifier, depending on the noise magnitude, and causes gradient misalignment. Regarding adversarial defenses, we conclude that input pre-processing defenses show impressive results on small perturbations; however, they fall short as the perturbation increases. At the same time, train-based defenses increase the average point distance to the decision boundary, which holds after quantization. However, we argue that train-based defenses still need to smooth the quantization-shift and gradient misalignment phenomenons to counteract adversarial example transferability to QNNs. All artifacts are open-sourced to enable independent validation of results.
机器学习(ML)正在从云端向边缘转移。在边缘计算中,减少了暴露的私隐数据,并在实时应用中实现了可靠的吞吐量保证。在部署的设备中,资源受限的微控制器(例如ARM Cortex-M)更为普遍,比应用处理器或GPU便宜得多,且功耗更低。因此,在边缘推动智能是潮流,研究人员集中精力揭示在受限设备上部署ANN的新方法。量化是一种经过验证的有效技术,证明可以将神经网络部署到MCU上。然而,理解量化在面临攻击时的鲁棒性仍然是一个未解之谜。为了填补这一空白,我们通过(全精度)ANN的攻击和防御评估了其在(受限)QNN上的效果。我们的评估包括针对TinyML应用的三个QNN,针对十个攻击和六个防御。通过这项研究,我们得出了一系列有趣的发现。首先,量化增加了决策边界点之间的距离,并导致某些攻击者估计的梯度爆炸或消失。其次,量化可以充当噪声衰减器或放大器,根据噪声幅度不同,导致梯度错位。关于攻击防御,我们得出结论,预处理防御在 small perturbations 上表现出惊人的效果;然而,当扰动增加时,它们的表现就不足了。同时,基于训练的防御会增加平均决策边界点距离,这是在量化后成立的。然而,我们认为基于训练的防御还需要平滑量化-转移和梯度错位现象,以对抗 adversarial example transferability to QNNs。所有成果都已公开开源,以促进独立验证结果。
https://arxiv.org/abs/2404.05688
With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. In this paper, we propose BinaryDM, a novel accurate quantization-aware training approach to push the weights of diffusion models towards the limit of 1-bit. Firstly, we present a Learnable Multi-basis Binarizer (LMB) to recover the representations generated by the binarized DM, which improves the information in details of representations crucial to the DM. Secondly, a Low-rank Representation Mimicking (LRM) is applied to enhance the binarization-aware optimization of the DM, alleviating the optimization direction ambiguity caused by fine-grained alignment. Moreover, a progressive initialization strategy is applied to training DMs to avoid convergence difficulties. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. As the first binarization method for diffusion models, BinaryDM achieves impressive 16.0 times FLOPs and 27.1 times storage savings with 1-bit weight and 4-bit activation, showcasing its substantial advantages and potential for deploying DMs on resource-limited scenarios.
随着扩散模型(DMs)的进步和计算需求的增加,量化成为获得紧凑和高效的低位位DM的实用解决方案。然而,高度离散的表示导致精确度降低,阻碍了扩散模型的量化达到超低位宽。在本文中,我们提出了BinaryDM,一种新颖的准确量化感知训练方法,将扩散模型的权重推向1位的极限。首先,我们提出了可学习的多基元二进制化器(LMB),以恢复二进制化DM生成的表示,这有助于提高对DM中关键表示信息。其次,低秩表示模拟(LRM)应用于增强DM的量化意识,减轻了微细对齐引起的优化方向不确定性。此外,我们还应用了逐步初始化策略来训练DM,以避免收敛困难。综合实验证明,BinaryDM在超低位宽的DM上实现了显著的准确性和效率提升,与当前的DM量化方法相比。作为扩散模型的第一种二进制化方法,BinaryDM在1个位宽和4个位宽的权重下,取得了令人印象深刻的16.0倍FLOPs和27.1倍存储节省,展示了其在有限资源场景下部署DM的显著优势和潜在。
https://arxiv.org/abs/2404.05662