Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.
基于电流变压器的模仿学习方法引入了离散的动作表示,并在由此产生的潜在代码上训练自回归变换器解码器。然而,初始量化破坏了动作空间的连续结构,从而限制了生成模型的能力。我们提出了一种无量化的替代方法,该方法利用生成无限词汇量变换器(GIVT)作为自回归变换器的直接、连续策略参数化方式。这简化了模仿学习流程,并在多种流行的模拟机器人任务中实现了最先进的性能。通过仔细研究采样算法来增强我们的策略执行,进一步提高了结果质量。
https://arxiv.org/abs/2503.14259
This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech this http URL code will be released to facilitate future research.
这项研究专注于全身协同语言手势的生成。现有的方法通常采用自回归模型,配合向量量化令牌来生成手势,这种方法会导致信息损失,并且会损害所产生手势的真实感。为了应对这一挑战,受现实世界人类动作连续性的启发,我们提出了MAG(Multi-modal Aligned Gesture),这是一种新型多模态对齐框架,用于高质量、多样化的协同语言手势合成,无需依赖离散标记化。 具体而言,(1) 我们引入了一种运动-文本-音频对齐变分自动编码器(MTA-VAE),利用预训练的WavCaps的文本和音频嵌入来增强与动作在语义和节奏上的对齐性,最终生成更真实的手势。(2) 在此基础上,我们提出了一种多模态掩码自回归模型(MMAG),该模型通过扩散过程实现连续运动嵌入中的自回归建模,并且不需要向量量化。为了进一步确保多模态一致性,MMAG融合了一个混合粒度的音频-文本融合块作为扩散过程中的条件。 在两个基准数据集上的大量实验表明,MAG在定量和定性上都达到了最先进的性能,能够生成高度真实和多样化的协同语言手势。我们将在未来发布相关代码以促进进一步的研究。
https://arxiv.org/abs/2503.14040
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.
我们介绍了对齐探测(aligned probing),这是一种新颖的可解释性框架,它根据语言模型(LMs)的输出和其内部表示(内部结构)来对其行为进行对齐。利用这一框架,我们考察了超过20种不同类型的OLMo、Llama 和 Mistral 模型,在此基础上首次建立了关于毒性问题的行为视角与内部视角之间的桥梁。 我们的研究结果表明,语言模型在输入和后续输出的毒性水平上强烈编码信息,特别是在较低层表现得尤为明显。关注于独特语言模型之间差异的方式提供了关联性和因果性证据:当这些模型对输入毒性进行强烈编码时,它们生成更少毒性的输出。此外,我们还强调了毒性特征的独特多样性——不同属性如威胁等会带来不同的模型行为和内部结构变化。 最后,通过四个案例研究(包括去毒化、多提示评估、模型量化以及预训练动态)进一步凸显对齐探测的实际影响,并提供具体的见解。我们的研究成果有助于更全面地理解语言模型,在毒性问题的上下文中及其它方面皆是如此。
https://arxiv.org/abs/2503.13390
As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.
随着大规模语言模型(LLM)的发展,模型压缩对于边缘部署和可访问性来说至关重要。仅权重量化可以减小模型的大小,但在较低位宽下会导致性能下降。此外,标准微调与量化的模型不兼容,并且替代方法通常无法达到全微调的效果。在本文中,我们提出了一种简单而有效的压缩范式ClusComp,它将权重矩阵聚类到代码本中并进行块状微调。 ClusComp 具有以下优点: 1. 在2-4位量化时表现出色; 2. 推动压缩至1位,并且在极低比特方法中超越它们,同时仅需最小的微调量; 3. 实现高效的微调,甚至优于现有的基于量化的方案,并与全FP16微调相媲美。 值得注意的是,ClusComp 支持在单个A6000-48GB GPU上压缩和微调70B LLM。
https://arxiv.org/abs/2503.13089
Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.
推测性解码(SD)作为一种方法已出现,用于加速大型语言模型(LLM)的推理过程,而不牺牲相对于16位模型推理的任何准确性。在典型的SD设置中,核心理念是使用一个全精度的小而快的模型作为“草稿”来生成接下来的几个令牌,并用“目标”大模型验证这些由草稿生成的令牌。这种方法的有效性很大程度上取决于草稿生成令牌的接受率以及草稿模型相对于目标模型的相对令牌吞吐量。 然而,一个高效的SD流水线需要对草稿模型进行预训练和与目标模型对齐,这使得它不适合用于插件式的LLM推理。在本工作中,我们提出使用MXFP4模型作为草稿来进行插件式加速,因为MXFP4仅权重量化(WOQ)直接将BF16目标模型的权重转换为MXFP4格式。实际上,我们的解决方案相比于BF16基线可以提供高达2倍的速度提升。 然后,为了进一步提高速度,我们探索了另一个机会:MXFP4草稿令牌生成本身可以通过使用更小的草稿进行推测性解码来加速。我们将这种方法称为ML-SpecQD(多级推测性量化草稿),因为它递归地应用推测以加速草稿令牌生成过程。 通过结合多级推测性解码和MXFP4量化解稿,我们在不牺牲精度的前提下超越了现有的推测性解码技术,相比BF16基线可以提供高达2.72倍的速度提升。
https://arxiv.org/abs/2503.13565
Effective training and debriefing are critical in high-stakes, mission-critical environments such as disaster response, military simulations, and industrial safety, where precision and minimizing errors are paramount. The traditional post-training analysis relies on manually reviewing 2D videos, a time-consuming process that lacks comprehensive situational awareness. To address these limitations, we introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing. ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions. To enable deployment in resource-constrained environments, we apply quantization and model pruning, reducing the model size by 74% while maintaining robust accuracy (mAP drop of only 1.5%, from 0.865 to 0.850) and improving inference speed. We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions, recorded across various real-world training sessions and environmental conditions. Additionally, ACT360 integrates 360AIE (Action Insight Explorer), a web-based interface for automatic action detection, retrieval, and textual summarization using large language models (LLMs), significantly enhancing post-incident analysis efficiency. ACT360 serves as a generalized framework for mission-critical debriefing, incorporating EAC, spatial attention, summarization, and model optimization. These innovations apply to any training environment requiring lightweight action detection and structured post-exercise analysis.
有效的培训和事后分析对于灾难响应、军事模拟和工业安全等高风险、任务关键型环境至关重要,在这些环境中,精确性和减少错误是首要考虑因素。传统的训练后分析依赖于手动审查二维视频,这是一个耗时的过程,并且缺乏全面的情境感知能力。为了解决这些问题,我们引入了ACT360系统,该系统利用360度视频和机器学习来实现自动动作检测和结构化事后分析。 ACT360整合了360YOWO模型,这是一种增强的You Only Watch Once (YOLO) 模型,加入了空间注意力机制和等距矩形感知卷积(EAC),以减轻全景视频失真的问题。为了在资源受限环境中部署ACT360系统,我们应用量化和模型剪枝技术,在保持鲁棒准确性(mAP仅下降1.5%,从0.865降至0.850)的同时将模型大小减少了74%,并提升了推理速度。 我们在一个公开的、包含55个标签为七种关键操作动作的360度视频数据集上验证了我们的方法,这些视频记录于各种现实世界的训练场景和环境条件下。此外,ACT360还整合了360AIE(Action Insight Explorer),这是一个基于网络的界面,利用大型语言模型 (LLMs) 实现自动动作检测、检索及文本摘要功能,从而显著提高了事后分析效率。 作为任务关键型事后分析的一个通用框架,ACT360集成了EAC、空间注意力机制、总结和模型优化技术。这些创新适用于任何需要轻量级动作检测和结构化事后续分析的训练环境。
https://arxiv.org/abs/2503.12852
3D Gaussian Splatting (3DGS) enables rapid differentiable rendering for 3D reconstruction and novel view synthesis, leading to its widespread commercial use. Consequently, copyright protection via watermarking has become critical. However, because 3DGS relies on millions of Gaussians, which require gigabytes of storage, efficient transfer and storage require compression. Existing 3DGS watermarking methods are vulnerable to quantization-based compression, often resulting in the loss of the embedded watermark. To address this challenge, we propose a novel watermarking method that ensures watermark robustness after model compression while maintaining high rendering quality. In detail, we incorporate a quantization distortion layer that simulates compression during training, preserving the watermark under quantization-based compression. Also, we propose a learnable watermark embedding feature that embeds the watermark into the anchor feature, ensuring structural consistency and seamless integration into the 3D scene. Furthermore, we present a frequency-aware anchor growing mechanism to enhance image quality in high-frequency regions by effectively identifying Guassians within these regions. Experimental results confirm that our method preserves the watermark and maintains superior image quality under high compression, validating it as a promising approach for a secure 3DGS model.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)能够实现快速可微渲染,广泛应用于三维重建和新颖视图合成领域,并因此得到了商业上的广泛应用。然而,这使得通过水印进行版权保护变得至关重要。由于3DGS依赖于数百万个高斯分布,这些数据需要占用数十GB的存储空间,因此在传输和存储时必须进行压缩处理以提高效率。现有的3DGS水印技术容易受到基于量化(quantization-based)压缩的影响,在压缩后往往会丢失嵌入的水印信息。 为解决上述挑战,我们提出了一种新的水印方法,该方法能够在模型经过压缩后仍然保持水印的鲁棒性,并同时确保高质量渲染效果。具体来说,我们在训练过程中引入了一个模拟量化压缩失真的层,以在实际量化压缩时能够保留水印。此外,我们还设计了一种可学习的水印嵌入特征,将水印信息嵌入到锚点特征中,保证结构的一致性并能无缝地融入三维场景。另外,我们提出了一种频域感知的锚点增长机制,在高频区域有效地识别高斯分布,从而提升图像质量。 实验结果证实了我们的方法在高压缩率下能够保留水印,并且还能保持卓越的图像品质,这证明了该方法是构建安全3DGS模型的一个有前景的技术。
https://arxiv.org/abs/2503.12836
We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to build a versatile motion prior that can be adapted to a wide range of challenging control tasks. Specifically, we build a discrete latent model to capture distinctive posterior distribution without collapse, and simultaneously augment the sampled vector with the continuous residuals to generate high-quality, smooth motion without jittering. We further incorporate Residual Vector Quantization, which not only maximizes the capacity of the discrete motion prior, but also efficiently abstracts the action space during the task learning phase. We demonstrate that our agent can produce diverse yet smooth motions simply by traversing the learned motion prior through unconditional motion generation. Furthermore, our model robustly satisfies sparse goal conditions with highly expressive natural motions, including head-mounted device tracking and motion in-betweening at irregular intervals, which could not be achieved with existing latent representations.
我们提出了一种多功能的潜在表示方法,使物理模拟的角色能够高效地利用运动先验。为了构建一个在多个任务中共享的强大运动嵌入,物理控制器应采用易于探索且能生成高质量运动的丰富潜在空间。为此,我们建议整合连续和离散潜在表示来构建一种适应广泛挑战性控制任务的多功能运动先验。具体而言,我们建立了一个离散潜在模型以捕获独特后验分布而不发生塌陷,并同时通过添加连续残差向量来生成高质量、平滑且无抖动的运动。此外,我们将残差向量量化技术纳入其中,这不仅最大化了离散运动先验的能力,还能够在任务学习阶段高效地抽象动作空间。 我们展示了代理仅通过遍历所学的运动先验即可产生多样而平滑的运动,并在无条件运动生成过程中实现这一点。此外,我们的模型能够以高度表现力的自然运动方式稳健满足稀疏目标条件,包括头戴式设备跟踪和不规则间隔的运动插值,这些是现有潜在表示无法实现的功能。
https://arxiv.org/abs/2503.12814
The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images. We systematically benchmark three AE models with varying compression levels and evaluate their reconstruction ability using pathology foundation models. We introduce a fine-tuning strategy to further enhance reconstruction fidelity that optimizes a pathology-specific learned perceptual metric. We validate our approach on downstream tasks, including segmentation, patch classification, and multiple instance learning, showing that replacing images with AE-compressed reconstructions leads to minimal performance degradation. Additionally, we propose a K-means clustering-based quantization method for AE latents, improving storage efficiency while maintaining reconstruction quality. We provide the weights of the fine-tuned autoencoders at this https URL.
在数字病理学中,高分辨率的全切片图像(Whole Slide Images, WSI)的数量不断增加,这带来了显著的数据存储、传输和计算效率方面的挑战。标准压缩方法如JPEG可以减小文件大小,但往往无法保留下游任务所需的精细表型细节。在这项工作中,我们重新利用了为潜在扩散模型设计的自编码器(Autoencoders, AEs),将其作为病理图像的一种高效学习压缩框架。我们系统地对三种不同压缩级别的AE模型进行了基准测试,并使用病理基础模型对其重建能力进行了评估。我们引入了一种微调策略,通过优化特定于病理学的学习感知度量来进一步提高重建的准确性。我们在下游任务(包括分割、切片分类和多实例学习)上验证了我们的方法,结果表明用AE压缩后的图像替换原始图像会导致性能下降很小甚至没有下降。此外,我们还提出了一种基于K-means聚类的量化方法以改进AE潜在变量的存储效率,同时保持重建质量。 我们提供了微调过的自编码器权重,可在以下网址获取:[此处提供URL]。
https://arxiv.org/abs/2503.11591
Quantization-Aware Training (QAT) is one of the prevailing neural network compression solutions. However, its stability has been challenged for yielding deteriorating performances as the quantization error is inevitable. We find that the sharp landscape of loss, which leads to a dramatic performance drop, is an essential factor that causes instability. Theoretically, we have discovered that the perturbations in the feature would bring a flat local minima. However, simply adding perturbations into either weight or feature empirically deteriorates the performance of the Full Precision (FP) model. In this paper, we propose Feature-Perturbed Quantization (FPQ) to stochastically perturb the feature and employ the feature distillation method to the quantized model. Our method generalizes well to different network architectures and various QAT methods. Furthermore, we mathematically show that FPQ implicitly regularizes the Hessian norm, which calibrates the smoothness of a loss landscape. Extensive experiments demonstrate that our approach significantly outperforms the current State-Of-The-Art (SOTA) QAT methods and even the FP counterparts.
量化感知训练(Quantization-Aware Training,QAT)是神经网络压缩的一种常用解决方案。然而,由于不可避免的量化误差导致性能下降,其稳定性受到了挑战。我们发现损失函数的陡峭地形是造成这种不稳定性的关键因素之一,因为这会导致性能急剧下滑。理论上,我们已经发现特征中的扰动会带来平坦的局部极小值区域。然而,在实践中,简单地在全精度(Full Precision, FP)模型的权重或特征中加入扰动会恶化其性能。 为此,我们在本文提出了特征扰动量化(Feature-Perturbed Quantization,FPQ),该方法通过随机化扰动特征,并将特征蒸馏技术应用于量化后的模型。我们的方法能够很好地泛化到不同的网络架构和各种QAT方法上。此外,我们从数学上证明了FPQ隐式地正则化了Hessian范数,从而调节了损失函数地形的平滑度。 广泛的实验结果表明,相较于当前最先进的(State-Of-The-Art, SOTA)QAT方法,甚至是全精度模型,我们的方法在性能上有显著提升。
https://arxiv.org/abs/2503.11159
Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias -- where errors in noise estimation accumulate over iterations -- and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.
平坦极小值(Flat Minima)在监督学习中已知能够增强泛化能力和鲁棒性,但在生成模型中的应用却鲜有研究。本工作系统地探讨了损失函数表面平坦度在生成模型中的作用,特别是在扩散模型上的理论和实证分析。我们建立了这样的理论主张:更平坦的极小值能提高对目标先验分布扰动的稳健性,从而带来诸如减少曝光偏差(噪声估计误差在迭代中累积)等好处,并显著提升量化鲁棒性,在强量化约束下仍保持生成性能。 进一步地,我们发现Sharpness-Aware Minimization (SAM) 方法可以显式地控制平坦度的程度,有效地增加了扩散模型中的平坦度。而其他众所周知的方法如Stochastic Weight Averaging (SWA) 和Exponential Moving Average (EMA),这些方法通过集成间接促进平坦度的提升,在实践中效果较差。 通过对CIFAR-10、LSUN Tower和FFHQ等数据集进行大量实验,我们证明了在扩散模型中,平坦极小值不仅改善了生成性能,还提高了整体鲁棒性。
https://arxiv.org/abs/2503.11078
We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.
我们介绍了一种名为OuroMamba的数据无关后训练量化(DFQ)方法,这是针对基于Mamba的视觉模型(VMMs)的第一种此类技术。我们在使VMM支持DFQ时识别出两个关键挑战:(1) VMM的递归状态转换限制了捕捉长距离交互的能力,并导致语义较弱的人造数据生成;(2) VMM激活在时间步上表现出动态异常变化,使得现有的静态PTQ(后训练量化)技术无效。为了解决这些挑战,OuroMamba提出了一种两阶段框架:(1) OuroMamba-Gen用于生成语义丰富且有意义的人造数据,它通过隐含状态空间中的邻域交互来对Patch级别的VMM特征进行对比学习;(2) OuroMamba-Quant采用混合精度量化,并在推理时使用轻量级动态异常检测。具体而言,我们提出了一种基于阈值的激活异常通道选择策略,在每次时间步中都会更新该策略。广泛的实验表明,在视觉和生成任务方面,我们的数据无关OuroMamba超越了现有的数据驱动PTQ技术,无论是在多种量化设置下均达到了最佳性能。此外,我们实现了高效的GPU内核以实现最高达2.36倍的实用延迟加速。代码将很快发布。
https://arxiv.org/abs/2503.10959
Tomato maturity plays a pivotal role in optimizing harvest timing and ensuring product quality, but current methods struggle to achieve high accuracy along computational efficiency simultaneously. Existing deep learning approaches, while accurate, are often too computationally demanding for practical use in resource-constrained agricultural settings. In contrast, simpler techniques fail to capture the nuanced features needed for precise classification. This study aims to develop a computationally efficient tomato classification model using the ResNet-18 architecture optimized through transfer learning, pruning, and quantization techniques. Our objective is to address the dual challenge of maintaining high accuracy while enabling real-time performance on low-power edge devices. Then, these models were deployed on an edge device to investigate their performance for tomato maturity classification. The quantized model achieved an accuracy of 97.81%, with an average classification time of 0.000975 seconds per image. The pruned and auto-tuned model also demonstrated significant improvements in deployment metrics, further highlighting the benefits of optimization techniques. These results underscore the potential for a balanced solution that meets the accuracy and efficiency demands of modern agricultural production, paving the way for practical, real-world deployment in resource-limited environments.
番茄成熟度在优化收获时间和确保产品质量方面起着关键作用,但目前的方法难以同时实现高精度和计算效率。现有的深度学习方法虽然准确,但在资源受限的农业环境中往往计算需求过大,实用性差;而较简单的技术又无法捕捉到精确分类所需的细微特征。本研究旨在通过迁移学习、修剪和量化等技术优化ResNet-18架构,开发一个既高效又能保持高精度的番茄成熟度分类模型。 我们的目标是同时解决在低功耗边缘设备上实现高性能与保持高准确性之间的挑战。我们将这些模型部署到边缘设备中以评估其在番茄成熟度分类中的性能。量化后的模型达到了97.81%的准确率,每张图片平均分类时间为0.000975秒;经过修剪和自动调优后的模型也在部署指标上显示出显著改进,进一步突显了优化技术的优势。 这些结果强调了一种平衡解决方案的巨大潜力,该方案能够满足现代农业生产对精度和效率的需求,并为资源受限环境中的实际应用铺平道路。
https://arxiv.org/abs/2503.10940
Deep Neural Networks (DNNs) have become an integral part of our daily lives, especially in vision-related applications. However, the conventional lossy image compression algorithms are primarily designed for the Human Vision System (HVS), which can non-trivially compromise the DNNs' validation accuracy after compression, as noted in \cite{liu2018deepn}. Thus developing an image compression algorithm for both human and machine (DNNs) is on the horizon. To address the challenge mentioned above, in this paper, we first formulate the image compression as a multi-objective optimization problem which take both human and machine prespectives into account, then we solve it by linear combination, and proposed a novel distortion measure for both human and machine, dubbed Human and Machine-Oriented Error (HMOE). After that, we develop Human And Machine Oriented Soft Decision Quantization (HMOSDQ) based on HMOE, a lossy image compression algorithm for both human and machine (DNNs), and fully complied with JPEG format. In order to evaluate the performance of HMOSDQ, finally we conduct the experiments for two pre-trained well-known DNN-based image classifiers named Alexnet \cite{Alexnet} and VGG-16 \cite{simonyan2014VGG} on two subsets of the ImageNet \cite{deng2009imagenet} validation set: one subset included images with shorter side in the range of 496 to 512, while the other included images with shorter side in the range of 376 to 384. Our results demonstrate that HMOSDQ outperforms the default JPEG algorithm in terms of rate-accuracy and rate-distortion performance. For the Alexnet comparing with the default JPEG algorithm, HMOSDQ can improve the validation accuracy by more than $0.81\%$ at $0.61$ BPP, or equivalently reduce the compression rate of default JPEG by $9.6\times$ while maintaining the same validation accuracy.
深度神经网络(DNN)已经成为我们日常生活中不可或缺的一部分,特别是在与视觉相关的应用中。然而,传统的有损图像压缩算法主要是为人类视觉系统(HVS)设计的,在经过压缩后可能会显著影响DNNs的验证准确性,如文献\cite{liu2018deepn}所述。因此,开发一种既能满足人类又能满足机器(DNNs)需求的图像压缩算法已经成为一个重要方向。 为了应对上述挑战,本文首先将图像压缩问题表述为一个多目标优化问题,同时考虑了人眼和机器视角的需求,然后通过线性组合的方法解决了这个问题,并提出了一种新的用于衡量人眼和机器双重标准的失真度量方法——人类与机器导向误差(HMOE)。在此基础上,我们根据HMOE开发了一个名为“人类与机器导向软决策量化”(HMOSDQ)的新图像压缩算法,该算法既满足了人类视觉的需求也适用于深度神经网络,并且完全兼容JPEG格式。 为了评估HMOSDQ的性能,我们在ImageNet \cite{deng2009imagenet}验证集的两个子集中进行实验,使用两个著名的基于DNN的图像分类器——Alexnet \cite{Alexnet}和VGG-16 \cite{simonyan2014VGG}。其中一个子集包含了短边长度在496到512之间的图像,另一个则包含短边长度在376到384之间的图像。我们的实验结果显示,HMOSDQ在比特率和准确度、压缩效率与失真度方面的性能均优于默认的JPEG算法。 对于Alexnet模型,在0.61 BPP(每像素位)的情况下,使用HMOSDQ可以比默认JPEG算法提高验证准确性超过$0.81\%$;或者等效地,保持相同的准确率的同时,将默认JPEG算法的压缩率降低9.6倍。
https://arxiv.org/abs/2503.10912
Vector Quantization (VQ) techniques face significant challenges in codebook utilization, limiting reconstruction fidelity in image modeling. We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components. The global codebook employs a lightweight transformer for concurrent updates of all code vectors, while the local codebook maintains precise feature representation through deterministic selection. This complementary approach is trained from scratch without requiring pre-trained knowledge. Experimental evaluation across multiple standard benchmark datasets demonstrates state-of-the-art reconstruction quality while using a compact codebook of size 512 - half the size of previous methods that require pre-training. Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks. These results establish Dual Codebook VQ as an efficient paradigm for high-fidelity image reconstruction with significantly reduced computational requirements.
向量量化(VQ)技术在代码本利用方面面临重大挑战,这限制了图像建模中的重建保真度。我们引入了一种双代码本机制,通过将表示划分为互补的全局和局部组件有效解决了这一限制问题。全局代码本采用轻量级变压器同时更新所有码向量,而局部代码本则通过确定性选择来保持精确的功能表示。这种互补方法从头开始训练,无需预训练知识。在多个标准基准数据集上的实验评估表明,我们的方法在使用大小仅为512的紧凑代码本的情况下实现了最先进的重建质量——这比需要预训练的方法的一半还小。我们的方法在各种图像领域中显著提高了FID(Fréchet Inception Distance)指标,在场景和人脸重建任务中尤其表现出色。这些结果确立了双代码本VQ作为高效高保真图像重建范式的地位,同时大幅减少了计算需求。
https://arxiv.org/abs/2503.10832
Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: this https URL
自动化语音识别(ASR)模型因其在字幕生成、语音翻译和实时转录等应用中的作用而日益受到重视。本文研究了Whisper及其两个变体:一个优化于实时语音流处理,另一个则适用于离线转录。值得注意的是,这些模型被发现会产生幻觉内容,从而降低转录音的可靠性。此外,更大的模型版本会导致更高的延迟,并且在资源受限设备上的部署面临挑战。这项研究分析了三个Whisper模型之间的相似性和差异性,并对其独特的功能进行定性考察。 接下来,本文通过量化模型量化对延迟的影响来评估其在边缘设备部署中的可行性。使用开源的LibriSpeech数据集,该论文评估了三种量化方法(INT4、INT5和INT8)下whispercpp的词错误率(WER)及延迟分析结果。结果显示,量化减少了19%的延迟,并将模型大小减小了45%,同时保持了转录准确性。这些发现提供了对不同Whisper模型的最佳使用案例以及边缘设备部署可能性的见解。 所有代码、数据集和实现细节均可以在一个公开的GitHub仓库中获取:[此链接](this https URL)
https://arxiv.org/abs/2503.09905
Visual Mamba networks (ViMs) extend the selective space state model (Mamba) to various vision tasks and demonstrate significant potential. Vector quantization (VQ), on the other hand, decomposes network weights into codebooks and assignments, significantly reducing memory usage and computational latency to enable ViMs deployment on edge devices. Although existing VQ methods have achieved extremely low-bit quantization (e.g., 3-bit, 2-bit, and 1-bit) in convolutional neural networks and Transformer-based networks, directly applying these methods to ViMs results in unsatisfactory accuracy. We identify several key challenges: 1) The weights of Mamba-based blocks in ViMs contain numerous outliers, significantly amplifying quantization errors. 2) When applied to ViMs, the latest VQ methods suffer from excessive memory consumption, lengthy calibration procedures, and suboptimal performance in the search for optimal codewords. In this paper, we propose ViM-VQ, an efficient post-training vector quantization method tailored for ViMs. ViM-VQ consists of two innovative components: 1) a fast convex combination optimization algorithm that efficiently updates both the convex combinations and the convex hulls to search for optimal codewords, and 2) an incremental vector quantization strategy that incrementally confirms optimal codewords to mitigate truncation errors. Experimental results demonstrate that ViM-VQ achieves state-of-the-art performance in low-bit quantization across various visual tasks.
视觉马曼巴网络(ViMs)扩展了选择性空间状态模型(Mamba)到各种视觉任务,并展示了其巨大的潜力。另一方面,向量量化(VQ)将网络权重分解为代码本和分配项,显著减少了内存使用和计算延迟,从而使得ViMs能够在边缘设备上部署。尽管现有的VQ方法已经实现了卷积神经网络和基于Transformer的网络中的极低比特量化(例如3位、2位及1位),但直接应用于ViMs会导致不令人满意的准确度。我们发现了几个关键挑战:1)ViMs中Mamba基块的权重包含大量异常值,这显著放大了量化误差;2)当应用于ViMs时,最新的VQ方法面临过度内存消耗、长时间校准过程和在搜索最优码字时性能不佳的问题。 在这篇论文中,我们提出了ViM-VQ,一种专为ViMs设计的有效后训练向量量化方法。ViM-VQ包括两个创新组件:1)一种快速的凸组合优化算法,该算法能够高效地更新凸组合和凸包以搜索最优码字;2)一种增量式向量量化策略,通过逐步确认最优码字来缓解截断误差。 实验结果表明,无论在何种视觉任务中,ViM-VQ都能实现低比特量化下的最先进的性能。
https://arxiv.org/abs/2503.09509
The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.
语言模型(LMs)的快速崛起扩展了自然语言处理的能力,从文本生成到复杂决策的应用范围不断扩大。尽管最先进的语言模型通常拥有数百亿个参数,并主要部署在数据中心中,但最近的趋势显示,人们对紧凑型模型的关注日益增加,这些模型的参数一般不超过100亿,并且可以通过量化和其他模型压缩技术实现。这一转变为边缘设备上的LM铺平了道路,提供了诸如增强隐私、减少延迟和提高数据主权等潜在好处。然而,即使是较小规模模型所固有的复杂性以及边缘硬件计算资源的限制,也引发了关于在云外执行LM推理时实际权衡的重要问题。 为了应对这些挑战,我们对基于代表性的CPU设备和GPU加速边缘设备上的生成式语言模型推理进行了全面评估。我们的研究测量了各种设备配置下的关键性能指标,包括内存使用、推理速度和能耗。此外,我们还考察了吞吐量-能耗的权衡、成本考虑以及可用性,并结合定性模型表现进行评估。虽然量化有助于减轻内存开销,但它并不能完全消除资源瓶颈问题,尤其是在大型模型中。 我们的研究结果揭示了在实际世界部署中必须考虑的记忆和能源约束,提供了关于模型大小、推理性能和效率之间权衡的具体见解。边缘设备上的语言模型探索仍处于初级阶段。我们希望这项研究能够为未来的相关研究奠定基础,指导模型的精炼、推理效率的提升以及以边缘为中心的人工智能系统的进步。
https://arxiv.org/abs/2503.09114
Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3$\times$ speedup over the 8-bit compressed model by reducing memory access.
向量量化(VQ)作为一种突出的权重压缩技术已经崭露头角,它在多种模型中展示了比均匀量化更低的量化误差,尤其是在极端压缩场景下。然而,在微调过程中,由于压缩格式的限制,被分配到同一代码词的权重只能沿相同方向更新,这使得许多量化的权重被迫朝与其局部梯度信息相反的方向移动,从而限制了其有效性。为解决这个问题,我们提出了一种新的VQ范式——符号分裂向量量化(Sign-Splitting VQ, SSVQ),该方法将权重的符号位从码本中解耦出来。具体而言,我们的方法包括提取未压缩权重的符号位,并对所有正值权重进行聚类和压缩处理。随后引入了用于符号位的潜在变量,并且同时优化符号和码本。此外,我们还实施了一种渐进式冻结策略来确保可学习符号的训练稳定性。 在各种现代模型和任务上进行了广泛的实验表明,SSVQ相比传统的VQ,在压缩与精度之间的权衡方面实现了显著改进。另外,我们在硬件加速器上验证了我们的算法,结果显示SSVQ通过减少内存访问达到了比8位压缩模型快3倍的速度提升。
https://arxiv.org/abs/2503.08668
3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have been proposed, they fail to efficiently utilize existing bitstreams in on-demand applications due to their lack of progressivity, leading to a waste of resource. To address this issue, we propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians (or anchors) to enable effective progressivity for on-demand applications. Specifically, for quantity, we introduce a progressive masking strategy that incrementally incorporates new anchors while refining existing ones to enhance fidelity. For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. Furthermore, to compact the incremental bitstreams, we leverage existing quantization results to refine probability prediction, improving entropy coding efficiency across progressive levels. Overall, PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods. Code available at: this http URL.
3D高斯点置(3D Gaussian Splatting,简称3DGS)在新型视图合成中实现了令人印象深刻的渲染保真度和速度。然而,其庞大的数据量给实际应用带来了重大挑战。尽管已经提出了许多压缩技术,但由于缺乏渐进性,它们无法有效地利用现有比特流,在按需应用场景中造成了资源浪费。为了解决这个问题,我们提出了一种名为PCGS(3D高斯点置的渐进式压缩)的方法,它能够自适应地控制高斯数量和质量,以实现有效的渐进性,从而支持按需应用。具体来说,在数量方面,我们引入了一种渐进掩码策略,该策略逐步添加新锚点并优化现有锚点以增强保真度;在质量方面,我们提出了一种渐进量化方法,通过逐渐减小量化步长来实现对高斯属性的更精细建模。此外,为了压缩增量比特流,我们将现有的量化结果用于改进概率预测,从而在整个渐进层次上提高熵编码效率。总体而言,PCGS实现了渐进性,并保持了与当前最先进的非渐进方法相当的压缩性能。代码可在以下链接获取:[此HTTP URL](请将括号内的文本替换为实际提供的URL地址)。
https://arxiv.org/abs/2503.08511