Diffusion models represent the cutting edge in image generation, but their high memory and computational demands hinder deployment on resource-constrained devices. Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations. However, standard PTQ methods struggle with outliers, and achieving higher compression often requires transforming model weights and activations before quantization. In this work, we propose HadaNorm, a novel linear transformation that extends existing approaches and effectively mitigates outliers by normalizing activations feature channels before applying Hadamard transformations, enabling more aggressive activation quantization. We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, achieving superior efficiency-performance trade-offs when compared to state-of-the-art methods.
扩散模型在图像生成领域代表了最先进的技术,但其高内存和计算需求阻碍了它们在资源受限设备上的部署。后训练量化(PTQ)通过减少矩阵运算的位宽提供了一种有前景的解决方案。然而,标准的PTQ方法难以处理异常值,并且要实现更高的压缩率通常需要在量化前转换模型权重和激活。在这项工作中,我们提出了HadaNorm,这是一种新颖的线性变换方法,它扩展了现有的方法并在应用哈达玛变换之前通过归一化激活特征通道有效地缓解了异常值问题,从而允许更激进的激活量化。我们证明,与现有最佳方法相比,HadaNorm在变压器块的各种组件中一致地减少了量化误差,并实现了更好的效率-性能权衡。
https://arxiv.org/abs/2506.09932
The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.
段落翻译如下: Segment Anything Model 2(SAM2)作为一种可提示的图像和视频分割基础方法,已经获得了广泛关注。然而,其高昂的计算和内存消耗给资源受限场景的应用带来了严峻挑战。在本文中,我们提出了一种针对高效SAM2的有效低比特量化方法,称为Q-SAM2。为了解决量化过程中权重和激活分布中的奇异值所导致的性能下降问题,Q-SAM2引入了两项创新技术贡献。 首先,我们介绍了一种线性层校准方法,用于在低比特环境下初始化SAM2,并通过最小化小批量图像上的弗罗贝尼乌斯范数来重新定位权重分布以优化量化效果。其次,我们提出了一种量化感知训练(QAT)管道,该管道应用剪辑操作抑制异常值,并允许网络在训练过程中适应量化阈值。 我们的全面实验表明,Q-SAM2能够在显著提高效率的同时提供高度准确的推理结果。无论是定量还是定性结果都显示,与现有的最先进的通用量化方案相比,特别是对于极低比特(如2位)量化的情况下,我们的Q-SAM2表现更优。尽管该校准技术是为量化感知训练而设计的,但它在非训练后量化中也表现出色,相较于未进行校准模型,在mIoU准确性上提升了高达66%。
https://arxiv.org/abs/2506.09782
Regardless the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose SLED, a method that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server efficiently batches and verifies the tokens utilizing a more precise target model. This approach supports device heterogeneity and reduces server-side memory footprint by avoiding the need to deploy multiple target models. Our initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server indicate substantial benefits: significantly reduced latency, improved energy efficiency, and increased concurrent inference sessions, all without sacrificing model accuracy.
无论设备能力如何进步,由于受限于设备内存和功耗限制,在边缘设备上高效推理先进大型语言模型(LLM)仍然充满挑战。现有策略如激进的量化、剪枝或远程推理等方法在提高效率的同时会牺牲准确性或者导致成本大幅增加。本文提出了一种新方法,它利用投机解码技术,这是一种此前主要被视为加速自回归生成式LLM解码的技术,并将其作为专门针对边缘计算的独特方式,通过协调异构设备之间的计算来实现。我们提出了SLED(Speculative Lightweight Edge Decoding)这一方法,允许轻量级的边缘设备使用多种草稿模型本地生成多个候选词,而一个单独共享的边缘服务器则能够高效地批量处理和验证这些词,利用更精确的目标模型进行确认。这种方法支持设备异构性,并通过避免部署多个目标模型的需求来减少服务器端的内存占用。我们使用Jetson Orin Nano、Raspberry Pi 5以及RTX 6000边缘服务器进行了初步实验,结果显示了显著的好处:延迟大幅降低,能源效率提升,同时增加了并发推理会话的数量,这一切都无需牺牲模型准确性。
https://arxiv.org/abs/2506.09397
This paper presents the deployment and performance evaluation of a quantized YOLOv4-Tiny model for real-time object detection in aerial emergency imagery on a resource-constrained edge device the Raspberry Pi 5. The YOLOv4-Tiny model was quantized to INT8 precision using TensorFlow Lite post-training quantization techniques and evaluated for detection speed, power consumption, and thermal feasibility under embedded deployment conditions. The quantized model achieved an inference time of 28.2 ms per image with an average power consumption of 13.85 W, demonstrating a significant reduction in power usage compared to its FP32 counterpart. Detection accuracy remained robust across key emergency classes such as Ambulance, Police, Fire Engine, and Car Crash. These results highlight the potential of low-power embedded AI systems for real-time deployment in safety-critical emergency response applications.
本文介绍了在资源受限的边缘设备 Raspberry Pi 5 上部署并评估量化后的 YOLOv4-Tiny 模型,以实现实时空中应急图像中的物体检测。YOLOv4-Tiny 模型使用 TensorFlow Lite 的后训练量化技术被量化为 INT8 精度,并在嵌入式部署条件下对其检测速度、功耗和热管理能力进行了评估。量化后的模型达到了每张图片 28.2 毫秒的推理时间以及平均 13.85 W 的功耗,显示出相比其 FP32 版本显著减少了能耗。同时,在救护车、警车、消防车及车祸等关键应急类别中保持了稳定的检测准确性。这些结果突显了低功耗嵌入式 AI 系统在实时部署于安全临界紧急响应应用中的潜力。
https://arxiv.org/abs/2506.09300
This paper presents a lightweight and energy-efficient object detection solution for aerial imagery captured during emergency response situations. We focus on deploying the YOLOv4-Tiny model, a compact convolutional neural network, optimized through post-training quantization to INT8 precision. The model is trained on a custom-curated aerial emergency dataset, consisting of 10,820 annotated images covering critical emergency scenarios. Unlike prior works that rely on publicly available datasets, we created this dataset ourselves due to the lack of publicly available drone-view emergency imagery, making the dataset itself a key contribution of this work. The quantized model is evaluated against YOLOv5-small across multiple metrics, including mean Average Precision (mAP), F1 score, inference time, and model size. Experimental results demonstrate that the quantized YOLOv4-Tiny achieves comparable detection performance while reducing the model size from 22.5 MB to 6.4 MB and improving inference speed by 44\%. With a 71\% reduction in model size and a 44\% increase in inference speed, the quantized YOLOv4-Tiny model proves highly suitable for real-time emergency detection on low-power edge devices.
本文提出了一种轻量且节能的对象检测解决方案,专门用于紧急响应情况下捕获的航拍图像。我们专注于部署经过后训练量化到INT8精度优化后的YOLOv4-Tiny模型,这是一种紧凑型卷积神经网络。该模型是在一个定制的、针对空中应急场景的数据集上进行训练的,这个数据集中包含10,820张注释过的图片,涵盖了关键的紧急情况。与依赖公开可用数据集的先前工作不同,由于缺乏可获得的无人机视角下的紧急图像资源,我们自己创建了这一数据集,这使得该数据集本身成为了本文的重要贡献之一。 通过多个指标(包括平均精度均值(mAP)、F1分数、推理时间和模型大小)对量化后的模型与YOLOv5-small进行了评估。实验结果表明,量化后的YOLOv4-Tiny在检测性能上达到了与后者相当的水平,同时将模型大小从22.5MB减少到了6.4MB,并提升了44%的推理速度。凭借71%的模型体积缩减和44%的推理速度提升,量化后的YOLOv4-Tiny模型被证明非常适合在低功耗边缘设备上进行实时紧急情况检测。
https://arxiv.org/abs/2506.09299
We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based adaptive pooling method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.
我们研究了用于总结转换器嵌入模型输出的池化方法的设计,主要动机是强化学习和视觉应用。这项工作考虑的是,在输入向量的一个子集中包含执行下游任务所需的信息(信号)的同时,其余部分则是干扰信息(噪声)。通过将池化视为矢量量化,并以最小化信号损失为目标,我们展示了常用的聚合转换器输出的方法——AvgPool、MaxPool 和 ClsToken 在输入的信噪比(SNR)波动时容易导致性能崩溃。然后我们证明了一种基于注意力机制的自适应池化方法可以在任何 SNR 下逼近最优矢量量化器,并且其误差范围可以通过推导得出。我们的理论结果首先通过在设计用于隔离 SNR 问题的合成数据集上的监督实验进行验证,随后推广到标准的关系推理、多智能体强化学习以及具有噪声观测值的视觉基准测试中,在这些任务中使用自适应池化的转换器表现出了更好的鲁棒性。
https://arxiv.org/abs/2506.09215
As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ)$-$a novel progressive quantization framework (FP16$\rightarrow$INT4$\rightarrow$INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the quantization error introduced by subsequent INT2 quantization. Next, UPQ applies Distill-QAT to enable INT2 instruction-tuned LLMs to generate responses consistent with their original FP16 counterparts by minimizing the generalized Jensen-Shannon divergence (JSD) between the two. To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data, while achieving state-of-the-art performances on MMLU and IFEval$-$two of the most representative benchmarks for evaluating instruction-tuned LLMs.
随着大型语言模型(LLMs)的快速扩展,在资源受限设备上的部署面临着重大挑战,因此对于极低比特量化(如2位)的兴趣日益增长。虽然先前的研究表明,与4位版本相比,2位的大规模模型在准确性和延迟方面都是帕累托最优的,但这些进展仅限于预训练的LLMs,并未扩展到指令调优的模型上。为了弥合这一差距,我们提出了统一渐进量化(UPQ)——一种新颖的渐进量化框架(FP16→INT4→INT2),该框架将分块后训练量化(PTQ)与基于蒸馏的量化感知训练(Distill-QAT)结合起来,用于INT2指令调优LLMs的量化。UPQ首先使用分块式PTQ将FP16指令调优模型量化到INT4,从而显著减少随后INT2量化所引入的量化误差。接下来,UPQ应用Distill-QAT,使INT2指令调优的LLMs能够生成与其原始FP16版本一致的响应,通过最小化两者之间的广义Jensen-Shannon散度(JSD)来实现这一点。 据我们所知,我们是第一个展示UPQ可以在不依赖专有后训练数据的情况下将开源指令调优的LLMs量化到INT2,并在MMLU和IFEval——两个评估指令调优LLMs最具代表性的基准测试中取得最佳性能的研究团队。
https://arxiv.org/abs/2506.09104
This paper presents a keyword spotting (KWS) system implemented on the NXP MCXN947 microcontroller with an integrated Neural Processing Unit (NPU), enabling real-time voice interaction on resource-constrained devices. The system combines MFCC feature extraction with a CNN classifier, optimized using Quantization Aware Training to reduce model size with minimal accuracy drop. Experimental results demonstrate a 59x speedup in inference time when leveraging the NPU compared to CPU-only execution, achieving 97.06% accuracy with a model size of 30.58 KB, demonstrating the feasibility of efficient, low-power voice interfaces on embedded platforms.
本文介绍了一种在NXP MCXN947微控制器上实现的关键词识别(KWS)系统,该微控制器集成了神经处理单元(NPU),可在资源受限设备上实现实时语音交互。该系统结合了MFCC特征提取和CNN分类器,并通过量化感知训练进行优化,以减少模型大小同时保持较高的精度。实验结果表明,与仅使用CPU执行相比,在利用NPU时推理时间加快了59倍,且在30.58 KB的模型大小下达到了97.06%的准确率,证明了在嵌入式平台上实现高效、低功耗语音接口的可能性。
https://arxiv.org/abs/2506.08911
The increasing complexity of AI models requires flexible hardware capable of supporting diverse precision formats, particularly for energy-constrained edge platforms. This work presents PARV-CE, a SIMD-enabled, multi-precision MAC engine that performs efficient multiply-accumulate operations using a unified data-path for 4/8/16-bit fixed-point, floating point, and posit formats. The architecture incorporates a layer adaptive precision strategy to align computational accuracy with workload sensitivity, optimizing both performance and energy usage. PARV-CE integrates quantization-aware execution with a reconfigurable SIMD pipeline, enabling high-throughput processing with minimal overhead through hardware-software co-design. The results demonstrate up to 2x improvement in PDP and 3x reduction in resource usage compared to SoTA designs, while retaining accuracy within 1.8% FP32 baseline. The architecture supports both on-device training and inference across a range of workloads, including DNNs, RNNs, RL, and Transformer models. The empirical analysis establish PARVCE incorporated POLARON as a scalable and energy-efficient solution for precision-adaptive AI acceleration at edge.
人工智能模型复杂性的增加要求硬件具备支持多种精度格式的灵活性,特别是在能源受限的边缘平台上。本文介绍了PARV-CE,这是一种SIMD(单指令多数据)启用、多精度MAC(乘积累加)引擎,能够在统一的数据路径上高效地执行4/8/16位定点数、浮点数和posit格式的乘积累加操作。该架构采用了层适应性精度策略,以使计算准确性与工作负载敏感度相匹配,从而优化性能和能耗。 PARV-CE通过量化感知执行以及可重构SIMD流水线集成了硬件和软件协同设计,实现了高吞吐量处理,并且最小化了开销。实验结果表明,与现有最佳设计相比,PARV-CE在PDP(每周期操作数)方面提高了高达2倍,在资源使用上减少了高达3倍,同时保持了与FP32基准线误差不超过1.8%的精度。 该架构支持设备上的训练和推理,并且能够处理各种工作负载,包括DNNs(深度神经网络)、RNNs(递归神经网络)、RL(强化学习)以及Transformer模型。实证分析表明,结合使用POLARON后,PARV-CE成为了一种可扩展的、节能的人工智能精度自适应加速解决方案,适用于边缘计算环境。
https://arxiv.org/abs/2506.08785
The continuous improvements on image compression with variational autoencoders have lead to learned codecs competitive with conventional approaches in terms of rate-distortion efficiency. Nonetheless, taking the quantization into account during the training process remains a problem, since it produces zero derivatives almost everywhere and needs to be replaced with a differentiable approximation which allows end-to-end optimization. Though there are different methods for approximating the quantization, none of them model the quantization noise correctly and thus, result in suboptimal networks. Hence, we propose an additional finetuning training step: After conventional end-to-end training, parts of the network are retrained on quantized latents obtained at the inference stage. For entropy-constraint quantizers like Trellis-Coded Quantization, the impact of the quantizer is particularly difficult to approximate by rounding or adding noise as the quantized latents are interdependently chosen through a trellis search based on both the entropy model and a distortion measure. We show that retraining on correctly quantized data consistently yields additional coding gain for both uniform scalar and especially for entropy-constraint quantization, without increasing inference complexity. For the Kodak test set, we obtain average savings between 1% and 2%, and for the TecNick test set up to 2.2% in terms of Bjøntegaard-Delta bitrate.
基于变分自编码器的图像压缩持续改进,已经使得学习编解码器在率失真效率方面与传统方法相当。然而,在训练过程中考虑量化仍然是一个问题,因为这会产生几乎处处为零的导数,需要使用可微逼近来替代,从而实现端到端优化。尽管有许多不同的方法可以近似量化,但没有一种能够正确建模量化噪声,因此导致网络次优。为此,我们提出了一种额外的微调训练步骤:在传统的端到端训练之后,通过推理阶段获得的量化潜在变量对网络的部分进行再训练。对于熵约束量化器(如网格编码量化),量化的效应难以仅通过舍入或添加噪声来近似,因为在基于熵模型和失真度量的网格搜索过程中,量化后的潜在变量是相互依赖地选择的。我们展示了在正确量化数据上重新训练可以持续为均匀标量以及特别是熵约束量化带来额外的编码增益,而不增加推理复杂性。对于Kodak测试集,我们获得了平均1%到2%的节省,在Bjøntegaard-Delta比特率方面;而对于TecNick测试集,则最高可达到2.2%的节约。
https://arxiv.org/abs/2506.08662
The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked "middle tier" between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.
小型语言模型(SLMs)在设备端和资源受限环境中的快速采用已经超过了我们对其伦理风险的理解。据我们所知,这是首次对参数范围从0.5亿到50亿的指令调优SLMs进行大规模审计,填补了BERT类编码器与旗舰级大语言模型之间的“中间层级”。我们的评估涵盖了Qwen 2.5、LLaMA 3.2、Gemma 3和Phi家族中的九种开源模型。通过在零样本提示下使用BBQ基准测试,我们分析了这些模型在模糊和非模糊情境下的实用性和公平性。 此次评估揭示了三个关键洞察: 首先,效率与公平并非必然对立:Phi系列模型实现了超过90%的F1评分,并且表现出极小的偏见,表明高效而合乎伦理的自然语言处理是可行的。 其次,社会偏见在不同的架构中表现差异显著:Qwen 2.5模型可能看起来较为公正,但这往往反映了空洞的中立性、随机猜测或回避行为,而非真正的道德一致性。相比之下,LLaMA 3.2系列模型表现出更强的刻板印象偏见,表明其更倾向于自信而不是中立。 第三,压缩引入了复杂的权衡:对于LLaMA 3.2-3B模型,4位AWQ量化在模糊情境下提高了F1评分,但对于Phi-4-Mini模型而言,则增加了与残疾相关的偏见超过7个百分点。 这些洞察为SLMs在需要公平性和效率的应用中负责任地部署提供了实用指导,尤其有益于小型企业和资源受限的环境。
https://arxiv.org/abs/2506.08487
Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
常用的图像标记化方法会产生一个二维的、空间排列的标记网格。相比之下,所谓的1D图像标记器将图像表示为高度压缩的一维序列,最少仅包含32个离散标记。我们发现,通过向量量化实现的高度压缩能力使得基于1D标记器进行图像编辑和生成功能成为可能,并且通过对标记进行启发式操作就能实现这一点。即使是最粗糙的操作——如在不同图像的潜在表示之间复制和替换标记——也能使精细的图像编辑成为现实,包括传输外观和语义属性。 受到1D标记器隐空间表达能力的启发,我们构建了一个基于梯度测试时间优化标记(使用可插拔损失函数如重构或CLIP相似性)的图像生成流水线。我们的方法在图像修复和文本引导的图像编辑场景中得到展示,并且可以生成多样而逼真的样本,无需训练任何生成模型。
https://arxiv.org/abs/2506.08257
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose this http URL that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.
本文介绍了MiniCPM4,这是一种专门为终端设备设计的高效大型语言模型(LLM)。我们通过在四个关键维度上的系统创新实现了这一效率:模型架构、训练数据、训练算法和推理系统。具体来说,在模型架构方面,我们提出了InfLLM v2,一种可训练的稀疏注意力机制,它加速了长上下文处理中的预填充和解码阶段。关于训练数据,我们提出了UltraClean,这是一种高效的预训练数据过滤和生成策略,以及UltraChat v2,这是一个全面的监督微调数据集。这些数据集使得仅使用8万亿个训练令牌就可以实现令人满意的模型性能。在训练算法方面,我们提出了ModelTunnel v2以进行有效的预训练策略搜索,并通过引入分块滚动来改进现有的后训练方法,实现了负载平衡的强化学习和数据高效的三值LLM,BitCPM。关于推理系统,我们提出了一种集成稀疏注意力、模型量化和投机采样的方案,从而实现高效的预填充和解码。 为了满足各种终端设备的需求,MiniCPM4提供了两种版本,分别拥有0.5B和8B的参数量。充分的评估结果显示,在多个基准测试中,MiniCPM4的表现超过了类似大小的开源模型,突显了其效率和有效性。值得注意的是,当处理长序列时,MiniCPM4-8B在速度上显著优于Qwen3-8B。通过进一步适应性调整,MiniCPM4成功地为各种应用提供了支持,包括可信调查生成和利用模型上下文协议的工具使用,这清楚地展示了它的广泛适用性。
https://arxiv.org/abs/2506.07900
Recent advancements in large language models (LLMs) have revitalized philosophical debates surrounding artificial intelligence. Two of the most fundamental challenges - namely, the Frame Problem and the Symbol Grounding Problem - have historically been viewed as unsolvable within traditional symbolic AI systems. This study investigates whether modern LLMs possess the cognitive capacities required to address these problems. To do so, I designed two benchmark tasks reflecting the philosophical core of each problem, administered them under zero-shot conditions to 13 prominent LLMs (both closed and open-source), and assessed the quality of the models' outputs across five trials each. Responses were scored along multiple criteria, including contextual reasoning, semantic coherence, and information filtering. The results demonstrate that while open-source models showed variability in performance due to differences in model size, quantization, and instruction tuning, several closed models consistently achieved high scores. These findings suggest that select modern LLMs may be acquiring capacities sufficient to produce meaningful and stable responses to these long-standing theoretical challenges.
近期在大型语言模型(LLM)领域的进展重新激发了关于人工智能的哲学辩论。两个最基本的问题——即框架问题和符号接地问题——在过去被认为传统符号AI系统无法解决。本研究探讨现代LLM是否具备应对这些问题的认知能力。为此,我设计了两项基准任务,这些任务反映了每个问题的核心哲学内涵,并在零样本条件下对13种著名LLM(包括开源和闭源)进行了测试,每项任务重复五次评估模型输出的质量。根据上下文推理、语义连贯性和信息筛选等多重标准来评分答案。 结果显示,在性能方面,开源模型由于模型大小、量化及指令微调的差异而表现不一,但一些封闭源代码的模型在多次试验中均获得了高分。这些发现表明,某些现代LLM可能正在获得足以对长期存在的理论挑战提供有意义且稳定响应的能力。
https://arxiv.org/abs/2506.07896
With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compression and temporal fusion for neural B-frame coding. First, we design a fine-grained motion compression method. This method incorporates an interactive dual-branch motion auto-encoder with per-branch adaptive quantization steps, which enables fine-grained compression of bi-directional motion vectors while accommodating their asymmetric bitrate allocation and reconstruction quality requirements. Furthermore, this method involves an interactive motion entropy model that exploits correlations between bi-directional motion latent representations by interactively leveraging partitioned latent segments as directional priors. Second, we propose a selective temporal fusion method that predicts bi-directional fusion weights to achieve discriminative utilization of bi-directional multi-scale temporal contexts with varying qualities. Additionally, this method introduces a hyperprior-based implicit alignment mechanism for contextual entropy modeling. By treating the hyperprior as a surrogate for the contextual latent representation, this mechanism implicitly mitigates the misalignment in the fused bi-directional temporal priors. Extensive experiments demonstrate that our proposed codec outperforms state-of-the-art neural B-frame codecs and achieves comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.
随着神经P帧视频编码的显著进展,神经B帧编码近期已成为一个关键的研究方向。然而,大多数现有的神经B帧编解码器直接采用P帧编码工具,而未能充分解决B帧压缩的独特挑战,导致性能不佳。为填补这一空白,我们提出了针对神经B帧编码的新颖改进方法,包括运动压缩和时间融合方面的创新。 首先,我们设计了一种精细粒度的运动压缩方法。这种方法采用了互动双分支运动自编码器,并在每个分支中应用了自适应量化步骤,能够实现双向运动矢量的细粒度压缩并满足其非对称比特率分配和重建质量的需求。此外,该方法还包含了一个互动运动熵模型,通过利用分区潜变量段作为方向先验来探索双向运动潜在表示之间的相关性。 其次,我们提出了一种选择性时间融合方法,该方法预测双向融合权重以实现具有不同质量的双向多尺度时间上下文的选择性使用。此外,这种方法还引入了一个基于超先验的隐式对齐机制,用于上下文熵建模。通过将超先验视为上下文潜在表示的替代品,这一机制可以隐式地缓解在融合的双向时间先验中的错位问题。 广泛的实验表明,我们提出的编解码器在神经B帧编解码方面优于现有的先进方法,并且在随机接入配置下能够达到与H.266/VVC参考软件相当甚至更优的压缩性能。
https://arxiv.org/abs/2506.07709
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
优化大型语言模型(LLM)以进行长上下文推理的一个主要挑战在于键值(KV)缓存的高内存消耗。现有方法,如量化技术,在减少内存使用方面已显示出有希望的结果。然而,当前的量化方法无法同时兼顾有效性和效率。在本文中,我们提出了一种新颖的方法 MoQAE,通过混合量化感知专家实现混合精度量化。首先,我们将不同的量化位宽配置视为专家,并采用传统的混合专家(MoE)方法来选择最佳配置。为避免传统 MoE 方法中逐个输入标记到路由器时导致的低效问题,我们分块将令牌输入路由器。其次,我们设计了一个轻量级的仅路由微调过程,通过综合损失训练 MoQAE 来学习模型准确性和内存使用之间的权衡。最后,我们引入了路由冻结(RF)和路由共享(RS)机制以进一步降低推理开销。在多个基准数据集上的广泛实验表明,我们的方法在效率和有效性方面均优于最先进的 KV 缓存量化方法。
https://arxiv.org/abs/2506.07533
Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in this https URL.
翻译如下: 视觉-语言-行动(VLA)模型在广泛的机器人操作任务中展现了出色的性能。然而,其不断增长的模型规模给资源受限的机器人系统部署带来了显著挑战。尽管1位预训练方法已被证明能够有效地提高大型语言模型的推理效率,同时几乎不会损失表现力,但将其应用到VLA模型上的研究仍然不足。在本项工作中,我们提出了BitVLA——首个用于机器人操作任务的一位视觉-语言-行动模型,其中每个参数都是三进制的,即{-1, 0, 1}。为了进一步减少视觉编码器的内存占用,我们提出了一种知识蒸馏感知训练策略,将全精度编码器压缩为1.58位权重。在此过程中,一个全精度编码器作为教师模型来更好地对齐潜在表示。尽管缺乏大规模机器人预训练数据集的支持,BitVLA在LIBERO基准测试中的表现与最先进的OpenVLA-OFT模型相当(后者采用4位后量化),而其内存占用量仅为前者的29.8%。这些结果突显了BitVLA在内存受限边缘设备上部署的潜力。我们将在[此链接](https://example.com)发布代码和模型权重。
https://arxiv.org/abs/2506.07530
This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
本文介绍了一种高效的视觉-语言模型(VLM)管道,该管道专门针对嵌入式设备进行了优化,例如用于机器人和自动驾驶的设备。通过联合使用补丁选择来过滤无关的相机视图、令牌选择模块以减少输入序列长度供大型语言模型处理以及投机解码来加速令牌生成,此管道显著减少了计算开销。在NVIDIA DRIVE Thor平台上针对自主驾驶应用进行评估后,我们的管道实现了2.5倍的整体延迟减少,且不影响任务准确性。当使用FP8训练后量化时,速度提升进一步增加到3.2倍。这些结果证明了该管道是实现在资源受限环境中部署实时VLM的一个可行解决方案。
https://arxiv.org/abs/2506.07416
With the rapid development of deep learning, a growing number of pre-trained models have been publicly available. However, deploying these fixed models in real-world IoT applications is challenging because different devices possess heterogeneous computational and memory resources, making it impossible to deploy a single model across all platforms. Although traditional compression methods, such as pruning, quantization, and knowledge distillation, can improve efficiency, they become inflexible once applied and cannot adapt to changing resource constraints. To address these issues, we propose ReStNet, a Reusable and Stitchable Network that dynamically constructs a hybrid network by stitching two pre-trained models together. Implementing ReStNet requires addressing several key challenges, including how to select the optimal stitching points, determine the stitching order of the two pre-trained models, and choose an effective fine-tuning strategy. To systematically address these challenges and adapt to varying resource constraints, ReStNet determines the stitching point by calculating layer-wise similarity via Centered Kernel Alignment (CKA). It then constructs the hybrid model by retaining early layers from a larger-capacity model and appending deeper layers from a smaller one. To facilitate efficient deployment, only the stitching layer is fine-tuned. This design enables rapid adaptation to changing budgets while fully leveraging available resources. Moreover, ReStNet supports both homogeneous (CNN-CNN, Transformer-Transformer) and heterogeneous (CNN-Transformer) stitching, allowing to combine different model families flexibly. Extensive experiments on multiple benchmarks demonstrate that ReStNet achieve flexible accuracy-efficiency trade-offs at runtime while significantly reducing training cost.
随着深度学习的快速发展,越来越多的预训练模型已公开可用。然而,在实际的物联网应用中部署这些固定模型是具有挑战性的,因为不同的设备拥有异构的计算和内存资源,使得无法在所有平台上部署单一模型。尽管传统的压缩方法(如剪枝、量化以及知识蒸馏)可以提高效率,但一旦应用便变得僵化且无法适应变化中的资源限制。为解决这些问题,我们提出了ReStNet(可重用和拼接的网络),该网络通过将两个预训练模型拼接起来来动态构建一个混合网络。实现ReStNet需要应对几个关键挑战,包括如何选择最佳拼接点、确定两个预训练模型的拼接顺序以及选择有效的微调策略。 为了系统地解决这些挑战并适应不断变化的资源限制,ReStNet通过使用中心核对齐(Centered Kernel Alignment, CKA)计算逐层相似性来确定拼接点。然后通过保留大容量模型的早期层,并附加一个小容量模型的深层构建混合模型。为实现高效部署,仅微调拼接层。这种设计可以在预算变化时快速适应并充分利用现有资源。 此外,ReStNet支持同构(CNN-CNN、Transformer-Transformer)和异构(CNN-Transformer)拼接,能够灵活地结合不同的模型家族。在多个基准测试上的广泛实验表明,ReStNet能够在运行时实现灵活的精度效率权衡,并显著降低训练成本。
https://arxiv.org/abs/2506.09066
Reinforcement Learning (RL) has outperformed other counterparts in sequential decision-making and dynamic environment control. However, FPGA deployment is significantly resource-expensive, as associated with large number of computations in training agents with high-quality images and possess new challenges. In this work, we propose QForce-RL takes benefits of quantization to enhance throughput and reduce energy footprint with light-weight RL architecture, without significant performance degradation. QForce-RL takes advantages from E2HRL to reduce overall RL actions to learn desired policy and QuaRL for quantization based SIMD for hardware acceleration. We have also provided detailed analysis for different RL environments, with emphasis on model size, parameters, and accelerated compute ops. The architecture is scalable for resource-constrained devices and provide parametrized efficient deployment with flexibility in latency, throughput, power, and energy efficiency. The proposed QForce-RL provides performance enhancement up to 2.3x and better FPS - 2.6x compared to SoTA works.
强化学习(RL)在序列决策和动态环境控制方面已经超越了其他方法。然而,FPGA部署由于与使用高质量图像训练代理相关的大量计算而显得资源成本高昂,并带来了新的挑战。在这项工作中,我们提出了QForce-RL,利用量化来提升吞吐量并减少能耗,同时采用轻量级的强化学习架构,在不显著降低性能的情况下实现这一目标。QForce-RL借鉴了E2HRL以减少总体强化学习动作,以便学习所需的策略,并借鉴QuaRL通过基于SIMD的量化进行硬件加速。我们还对不同的强化学习环境进行了详细的分析,重点关注模型大小、参数以及加速计算操作。该架构适用于资源受限设备,并提供具有灵活性的高效部署方案,在延迟、吞吐量、功耗和能效方面均可调整。提出的QForce-RL相比现有最佳方法(SoTA)在性能上提升了高达2.3倍,并且帧率(FPS)提高了2.6倍。
https://arxiv.org/abs/2506.07046