Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta's LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healthcare and law. By integrating hybrid datasets (BELLE and Guanaco) and 4-bit quantization (QLoRA), the model achieves competitive performance in tasks such as translation, code generation, and domain-specific Q\&A. The project provides a comprehensive toolkit for model conversion, CPU inference, and multi-turn dialogue interfaces, emphasizing accessibility for researchers and developers. Evaluations indicate competitive performance across medical tasks, multi-turn dialogue coherence, and real-time legal updates. Chinese-Vicuna's modular design, open-source ecosystem, and community-driven enhancements position it as a versatile foundation for Chinese LLM applications.
Chinese-Vicuna 是一个开源、资源高效的语言模型,旨在通过使用低秩适应(LoRA)技术对 Meta 的 LLaMA 架构进行微调,来弥补中文指令跟随能力的不足。它针对计算资源有限的环境而设计,可以在消费级 GPU(例如 RTX-2080Ti 上运行 7B 模型)上以低成本部署,并支持医疗和法律等领域的特定领域适应性。 通过整合混合数据集(如 BELLE 和 Guanaco)以及采用4位量化(QLoRA),该模型在诸如翻译、代码生成及特定领域的问答任务中表现出竞争力的性能。该项目提供了一整套工具包,涵盖模型转换、CPU 推断和多轮对话接口等功能,旨在为研究人员和开发人员提供高度可访问性。 评估结果表明,Chinese-Vicuna 在医疗任务、多轮对话连贯性和实时法律更新等方面都达到了竞争性的表现水平。凭借模块化设计、开源生态系统及社区驱动的增强功能,Chinese-Vicuna 作为中文大型语言模型应用的基础平台而具备极高的灵活性和适用性。
https://arxiv.org/abs/2504.12737
Neural Networks (NNs) trained through supervised learning struggle with managing edge-case scenarios common in real-world driving due to the intractability of exhaustive datasets covering all edge-cases, making knowledge-driven approaches, akin to how humans intuitively detect unexpected driving behavior, a suitable complement to data-driven methods. This work proposes a hybrid architecture combining low-level Model Predictive Controller (MPC) with locally deployed Large Language Models (LLMs) to enhance decision-making and Human Machine Interaction (HMI). The DecisionxLLM module evaluates robotic state information against natural language instructions to ensure adherence to desired driving behavior. The MPCxLLM module then adjusts MPC parameters based on LLM-generated insights, achieving control adaptability while preserving the safety and constraint guarantees of traditional MPC systems. Further, to enable efficient on-board deployment and to eliminate dependency on cloud connectivity, we shift processing to the on-board computing platform: We propose an approach that exploits Retrieval Augmented Generation (RAG), Low Rank Adaptation (LoRA) fine-tuning, and quantization. Experimental results demonstrate that these enhancements yield significant improvements in reasoning accuracy by up to 10.45%, control adaptability by as much as 52.2%, and up to 10.5x increase in computational efficiency (tokens/s), validating the proposed framework's practicality for real-time deployment even on down-scaled robotic platforms. This work bridges high-level decision-making with low-level control adaptability, offering a synergistic framework for knowledge-driven and adaptive Autonomous Driving Systems (ADS).
通过监督学习训练的神经网络(NNs)在处理现实世界驾驶中常见的边缘情况时遇到困难,因为无法生成包含所有可能边缘情况的详尽数据集。因此,类似于人类直观地识别意外驾驶行为的知识驱动方法可以作为数据驱动方法的有效补充。本文提出了一种结合低级模型预测控制器(MPC)与本地部署的大语言模型(LLMs)的混合架构,以增强决策制定和人机交互(HMI)。DecisionxLLM模块评估机器人状态信息是否符合自然语言指令,确保遵守预期驾驶行为。随后,MPCxLLM模块根据LLM生成的见解调整MPC参数,在保持传统MPC系统的安全性和约束保证的同时实现控制灵活性。 为了在车载平台上高效部署并减少对云端连接的依赖,我们将处理转移到了车载计算平台:我们提出了一种利用检索增强生成(RAG)、低秩适应性(LoRA)微调和量化的方法。实验结果表明,这些改进显著提高了推理准确性(最多提高10.45%),增强了控制灵活性(最多提高52.2%),并实现了高达10.5倍的计算效率提升(每秒标记数量)。这验证了所提出的框架在即使是在简化的机器人平台上的实时部署中也具有实用性。 这项工作将高级决策制定与低级控制适应性结合起来,为知识驱动和自适应自动驾驶系统提供了协同架构。
https://arxiv.org/abs/2504.11514
In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM's unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.
近年来,密集检索(dense retrieval)成为了信息检索(IR)研究的焦点。尽管效果显著,但密集检索产生的密集向量难以解释,并且由于索引规模庞大而存在局限性。学习稀疏检索(LSR)作为一种有前景的替代方案出现,它不仅能够实现与密集检索相当的检索性能,还能利用传统的倒排索引数据结构进行高效的检索。然而,少有研究探索如何将LSR扩展到BERT规模之外的应用场景。 在这项工作中,我们识别了在大规模语言模型(LLM)中训练LSR时面临的两个挑战:(1) 对比训练早期阶段的训练不稳定;(2) 由于预训练的LLM单向注意机制而导致次优性能。为了应对这些挑战,我们提出了两项相应技术:(1) 轻量级适应性训练阶段以消除训练初期的不稳定性;(2) 两种模型变体以实现双向信息处理能力。通过采用上述技术,我们可以使用8B规模的大语言模型来训练LSR模型,并且在减少索引大小的情况下仍然能获得有竞争力的检索性能。 此外,我们是首批分析基于大语言模型(LLM)的LSR模型性能效率权衡的研究者之一,我们的分析视角基于模型量化。研究发现为如何适应大语言模型进行高效的检索建模提供了见解。
https://arxiv.org/abs/2504.10816
While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.
尽管通过离散标记化范式,大型语言模型(LLMs)已经革新了文本到语音(TTS)合成技术,但目前的架构在三个关键维度上存在根本性的矛盾:1) 由于对语音提示进行量化而造成的声学特征不可逆损失;2) 对精确匹配的语音-文本对的高度依赖性限制了实际部署的可能性;3) 在优化生成语音标记的过程中,LLM自身的文本理解能力会严重遗忘。为了解决这些挑战,我们提出了一种基于LLM的通过新颖双分支架构进行优化的文本到语音生成方法(GOAT-TTS)。我们的框架引入了两个关键创新点:(1) 模态对齐分支结合了一个语音编码器和投影器,以捕获连续的声学嵌入,从而能够在没有转录依赖的情况下双向关联语言、音色、情感等副语言特征与语义文本表示;(2) 语音生成分支通过在LLM的顶层k层进行模块化微调来进行语音标记预测,并冻结底层k层以保留基础的语言知识。此外,还引入了多令牌预测来支持实时流式TTS合成。实验结果表明,我们的GOAT-TTS实现了与最先进的TTS模型相当的性能,同时验证了所生成方言语音数据的有效性。
https://arxiv.org/abs/2504.12339
The classical limit of quantum mechanics, formally investigated through frameworks like strict deformation quantization, remains a profound area of inquiry in the philosophy of physics. This paper explores a computational approach employing a neural network to emulate the emergence of classical behavior from the quantum harmonic oscillator as Planck's constant $\hbar$ approaches zero. We develop and train a neural network architecture to learn the mapping from initial expectation values and $\hbar$ to the time evolution of the expectation value of position. By analyzing the network's predictions across different regimes of hbar, we aim to provide computational insights into the nature of the quantum-classical transition. This work demonstrates the potential of machine learning as a complementary tool for exploring foundational questions in quantum mechanics and its classical limit.
量子力学的经典极限,通过严格的形变量化等框架正式研究,仍然是物理学哲学中一个深刻的探究领域。本文探讨了一种采用神经网络的计算方法,用于模拟当普朗克常数$\hbar$趋近于零时,量子谐振子系统如何展现出经典行为的过程。我们开发并训练了一个神经网络架构,使之能够学习从初始期望值和$\hbar$映射到位置期望值的时间演化过程。通过分析该网络在不同$\hbar$范围内的预测结果,我们的目标是提供关于量子-经典过渡本质的计算洞察。这项工作展示了机器学习作为探索量子力学及其经典极限基础问题的补充工具的巨大潜力。
https://arxiv.org/abs/2504.10781
Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.
近期在音频语言模型方面的进步强调了音频标记化(即将音频信号转换为离散标记)的关键作用,从而使得语言模型架构能够应用于音频领域。在这项研究中,我们介绍了一种新型的低比特率且语义丰富的音频编解码器标记器ALMTokenizer,专门用于音频语言模型。先前的方法如Encodec通常将单独的音频帧编码为离散标记,并不考虑跨帧使用上下文信息。与这些方法不同,我们引入了一种基于查询的压缩策略,通过一组可学习的查询令牌来捕捉整体信息,从而显式地建模了跨帧之间的上下文信息。这一设计不仅使编解码模型能够捕获更多语义信息,而且还减少了标记序列的数量以编码音频信号。 为了增强音频编解码器模型中的语义信息,我们还引入了以下内容: 1. 掩蔽自动编码器(MAE)损失; 2. 基于语义先验的向量量化; 3. 自回归(AR)预测损失。 因此,在与最先进的方法相比的同时操作较低比特率的情况下,ALMTokenizer实现了具有竞争力的重建性能。在相同的音频语言模型框架内,ALMTokenizer在音频理解和生成任务中优于先前的标记器。
https://arxiv.org/abs/2504.10344
Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.
提高大规模语言模型(LLMs)推理效率是研究中的关键领域。后训练量化(PTQ)是一种流行的技术,但在低比特级别上往往面临挑战,特别是在下游任务中。量化的感知训练(QAT)可以缓解这些问题,但需要更多的计算资源。为解决这一问题,我们引入了权重分解的低秩量化感知训练(DL-QAT),这种方法在只训练不到1%的总参数的情况下结合了QAT的优势。具体而言,我们为每个量化组引入了一个特定于组的量化幅度来调整整体缩放比例。在每一个量化组内,我们使用LoRA矩阵更新量化的权重大小和方向。 我们在LLaMA和LLaMA2模型家族上验证了该方法的有效性。结果表明,在不同的量化粒度下,我们的方法显著优于基线方法。例如,在3比特的LLaMA-7B模型中,我们的方法在MMLU上的表现比之前的最先进方法高出4.2%。此外,我们在预训练模型上的量化结果也超过了以前的QAT方法,展示了我们方法的优越性能和效率。
https://arxiv.org/abs/2504.09223
In modern air traffic management, generating synthetic flight trajectories has emerged as a promising solution for addressing data scarcity, protecting sensitive information, and supporting large-scale analyses. In this paper, we propose a novel method for trajectory synthesis by adapting the Time-Based Vector Quantized Variational Autoencoder (TimeVQVAE). Our approach leverages time-frequency domain processing, vector quantization, and transformer-based priors to capture both global and local dynamics in flight data. By discretizing the latent space and integrating transformer priors, the model learns long-range spatiotemporal dependencies and preserves coherence across entire flight paths. We evaluate the adapted TimeVQVAE using an extensive suite of quality, statistical, and distributional metrics, as well as a flyability assessment conducted in an open-source air traffic simulator. Results indicate that TimeVQVAE outperforms a temporal convolutional VAE baseline, generating synthetic trajectories that mirror real flight data in terms of spatial accuracy, temporal consistency, and statistical properties. Furthermore, the simulator-based assessment shows that most generated trajectories maintain operational feasibility, although occasional outliers underscore the potential need for additional domain-specific constraints. Overall, our findings underscore the importance of multi-scale representation learning for capturing complex flight behaviors and demonstrate the promise of TimeVQVAE in producing representative synthetic trajectories for downstream tasks such as model training, airspace design, and air traffic forecasting.
在现代空中交通管理中,生成合成飞行轨迹已成为解决数据稀缺、保护敏感信息和支持大规模分析的有前景的方法。本文提出了一种新颖的轨迹综合方法,通过调整基于时间向量量化变分自动编码器(TimeVQVAE)来实现。我们的方法利用了时频域处理、向量量化和基于Transformer的先验知识,以捕捉飞行数据中的全局和局部动态。通过离散化潜在空间并整合变压器先验,模型能够学习长时间的空间-时间依赖关系,并保持整个飞行路径的一致性。 我们使用了一套广泛的品质、统计和分布度量以及在开源空中交通模拟器中进行的可飞性评估来评价改进后的TimeVQVAE。实验结果显示,TimeVQVAE的表现优于基于时序卷积的VAE基准模型,在空间准确性、时间一致性及统计数据特性方面,生成的合成轨迹与真实飞行数据相似。 此外,基于模拟器的评估显示,大多数生成的轨迹在操作上是可行的,尽管偶尔会出现异常值,这可能表明需要额外加入特定领域的约束条件。总的来说,我们的研究强调了多尺度表示学习对于捕捉复杂飞行行为的重要性,并证明了TimeVQVAE在产生用于后续任务(如模型训练、空域设计和空中交通预测)的代表性合成轨迹方面的潜力。
https://arxiv.org/abs/2504.09101
We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order ("small to large") allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.
我们介绍了PQS技术,它结合了三种方法:修剪(Prune)、量化(Quantize)和排序(Sort),以实现神经网络计算中低比特宽度的点积累加。在传统的量化(例如8位)点积运算中,中间部分结果会被累积到较宽的累加器(如32位)中,以便在累加过程中避免溢出。然而,这种宽型累加器增加了内存带宽使用量,并降低了能效。 我们展示了通过浮点数中的迭代N:M修剪、量化至8位或更少位以及按照大小顺序(从小到大)累积部分乘积的方法,在不需要宽型累加器的情况下仍可获得准确且压缩的模型,同时缩短了点积长度。我们设计并实现了PQS算法,以消除多种神经网络在推理时的积累溢出问题。 我们的方法将累加器比特宽度减少了2.5倍,并为多个图像分类任务达到了与浮点数基准相媲美的模型精度。
https://arxiv.org/abs/2504.09064
Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: this https URL
生成式遮蔽变换器在各种内容生成任务中表现出显著的成功,主要是由于它们能够有效地对大规模数据集分布进行建模,并保持高度的一致性。然而,在动画领域中,大型数据集并不总是可用的。将生成式遮蔽模型应用于从单个运动捕捉(MoCap)参考生成多样实例可能会导致过拟合,这是一个尚未解决的问题。 在本文工作中,我们提出了MotionDreamer,这是一种本地化遮蔽建模范式,旨在从给定具有任意拓扑和持续时间的运动中学习内部运动模式。通过使用一种新颖的分布正则化方法将给定的运动嵌入到量化令牌中,MotionDreamer构建了一个稳健且信息丰富的代码库来表示局部运动模式。此外,在我们的遮蔽变换器中引入了滑动窗口本地注意力机制,使生成自然且多样化的动画成为可能,并且这些动画与参考运动模式非常相似。 通过全面的实验表明,MotionDreamer在忠实度和多样性方面超越了基于GAN或Diffusion的方法(这些方法通常是当前最佳的方法)。由于量化方法的一致性和鲁棒性,MotionDreamer还可以有效地执行下游任务,例如时间运动编辑、人群动画以及根据节拍对齐舞蹈生成,并且所有这些都仅使用单个参考动作即可完成。 欲了解更多信息,请访问我们的项目页面:[此URL](this https URL)
https://arxiv.org/abs/2504.08959
Recent advances in visual synthesis have leveraged diffusion models and attention mechanisms to achieve high-fidelity artistic style transfer and photorealistic text-to-image generation. However, real-time deployment on edge devices remains challenging due to computational and memory constraints. We propose Muon-AD, a co-designed framework that integrates the Muon optimizer with attention distillation for real-time edge synthesis. By eliminating gradient conflicts through orthogonal parameter updates and dynamic pruning, Muon-AD achieves 3.2 times faster convergence compared to Stable Diffusion-TensorRT, while maintaining synthesis quality (15% lower FID, 4% higher SSIM). Our framework reduces peak memory to 7GB on Jetson Orin and enables 24FPS real-time generation through mixed-precision quantization and curriculum learning. Extensive experiments on COCO-Stuff and ImageNet-Texture demonstrate Muon-AD's Pareto-optimal efficiency-quality trade-offs. Here, we show a 65% reduction in communication overhead during distributed training and real-time 10s/image generation on edge GPUs. These advancements pave the way for democratizing high-quality visual synthesis in resource-constrained environments.
最近在视觉合成领域取得的进展利用了扩散模型和注意力机制,实现了高质量的艺术风格转换以及逼真的文本到图像生成。然而,由于计算资源和内存限制,在边缘设备上实现实时部署仍然面临挑战。我们提出了一种名为Muon-AD的框架,该框架结合了Muon优化器与注意力蒸馏技术,旨在为边缘设备上的实时合成提供解决方案。 通过正交参数更新和动态修剪消除梯度冲突,Muon-AD实现了比Stable Diffusion-TensorRT快3.2倍的收敛速度,同时保持了合成质量(FID降低了15%,SSIM提高了4%)。我们的框架将Jetson Orin上的峰值内存减少到7GB,并通过混合精度量化和课程学习实现实时生成速率高达24FPS。在COCO-Stuff和ImageNet-Texture数据集上进行的广泛实验表明,Muon-AD在效率与质量之间实现了Pareto最优权衡。 此外,我们的方法展示了分布式训练期间通信开销减少了65%,并且能够在边缘GPU上实现实时每10秒生成一张图像。这些改进为在资源受限环境中实现高质量视觉合成铺平了道路。
https://arxiv.org/abs/2504.08451
Early exiting has recently emerged as a promising technique for accelerating large language models (LLMs) by effectively reducing the hardware computation and memory access. In this paper, we present SpecEE, a fast LLM inference engine with speculative early exiting. (1) At the algorithm level, we propose the speculation-based lightweight predictor design by exploiting the probabilistic correlation between the speculative tokens and the correct results and high parallelism of GPUs. (2) At the system level, we point out that not all layers need a predictor and design the two-level heuristic predictor scheduling engine based on skewed distribution and contextual similarity. (3) At the mapping level, we point out that different decoding methods share the same essential characteristics, and propose the context-aware merged mapping for predictor with efficient GPU implementations to support speculative decoding, and form a framework for various existing orthogonal acceleration techniques (e.g., quantization and sparse activation) on cloud and personal computer (PC) scenarios, successfully pushing the Pareto frontier of accuracy and speedup. It is worth noting that SpecEE can be applied to any LLM by negligible training overhead in advance without affecting the model original parameters. Extensive experiments show that SpecEE achieves 2.25x and 2.43x speedup with Llama2-7B on cloud and PC scenarios respectively.
早期退出技术最近作为一种有前景的方法出现,用于通过有效减少硬件计算和内存访问来加速大型语言模型(LLM)。在本文中,我们提出了SpecEE,这是一种基于投机性早期退出的快速LLM推理引擎。 1. 在算法层面,我们提出了一种基于推测的轻量级预测器设计,该设计利用了推测令牌与正确结果之间的概率相关性和GPU的高度并行特性。 2. 在系统层面,我们指出并非所有层都需要预测器,并根据偏斜分布和上下文相似性设计了一个两级启发式预测调度引擎。 3. 在映射层面,我们指出了不同的解码方法共享相同的基本特征,并提出了一种感知上下文的合并映射方案以支持推测性解码的有效GPU实现。这为各种现有的正交加速技术(如量化和稀疏激活)在云端和个人计算机(PC)场景中的应用提供了一个框架,成功地推动了准确性和加速比之间的帕累托前沿。 值得一提的是,SpecEE可以通过忽略训练开销应用于任何LLM,并且不会影响模型的原始参数。广泛的实验表明,在云和PC场景中,SpecEE分别实现了2.25倍和2.43倍的速度提升(使用Llama2-7B)。
https://arxiv.org/abs/2504.08850
We present a design called \emph{Proof of Gradient Optimization} (PoGO) for blockchain consensus, where miners produce verifiable evidence of training large-scale machine-learning models. Building on previous work, we incorporate \emph{quantized gradients} (4-bit precision) to reduce storage and computation requirements, while still preserving the ability of verifiers to check that real progress has been made on lowering the model's loss. Additionally, we employ Merkle proofs over the full 32-bit model to handle large parameter sets and to enable random leaf checks with minimal on-chain data. We illustrate these ideas using GPT-3 (175B parameters) as a reference example and also refer to smaller but high-performance models (e.g., \emph{Gemma~3} with 27B parameters). We provide an empirical cost analysis showing that verification is significantly cheaper than training, thanks in part to quantization and sampling. We also discuss the necessity of longer block times (potentially hours) when incorporating meaningful training steps, the trade-offs when using specialized GPU hardware, and how binary diffs may incrementally optimize updates. Finally, we note that fine-tuning can be handled in a similar manner, merely changing the dataset and the manner of sampling but preserving the overall verification flow. Our protocol allows verifiers to issue either \emph{positive} or \emph{negative} attestations; these are aggregated at finalization to either confirm the update or slash the miner.
我们提出了一种名为“梯度优化证明”(PoGO)的区块链共识设计,其中矿工生成训练大规模机器学习模型的有效证据。在此前工作的基础上,我们采用4位精度的量化梯度来减少存储和计算需求,同时仍然保持验证者检查实际损失降低进展的能力。此外,我们在整个32位模型上使用Merkle证明处理大量参数,并允许进行随机叶节点检查,仅需极少量的链上数据。我们以GPT-3(1750亿参数)作为参考示例来阐述这些想法,并提及一些较小但高性能的模型(如具有270亿参数的Gemma 3)。我们提供了一个经验成本分析,表明由于量化和采样的原因,验证比训练要便宜得多。我们还讨论了在包含有意义的训练步骤时需要更长的区块时间(可能长达数小时),使用专用GPU硬件时的成本与收益权衡,以及二进制差分如何逐步优化更新的问题。最后,我们注意到微调可以以类似的方式处理,只需更改数据集和采样的方式即可保持整体验证流程不变。我们的协议允许验证者发布“积极”或“消极”的认证;这些在最终确定时进行汇总,确认更新或削减矿工的奖励。
https://arxiv.org/abs/2504.07540
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.
训练后的量化(PTQ)通过将全精度权重映射到低比特权重来减少模型的内存占用,而无需昂贵的重新训练。然而,在2至3位的设置中,这种方法可能会降低模型在下游任务中的性能。我们开发了一种新的混合精度PTQ方法,称为任务电路量化(TaCQ),该方法借鉴了自动化电路发现的思想,直接根据特定权重电路来调整量化过程——我们将这些权重定义为与下游任务表现相关的权重集合。这些关键权重保持16位精度不变,而其他权重则进行量化处理,在仅增加边际内存成本的情况下维护性能。 具体来说,TaCQ通过对比未量化的模型权重和均匀量化的模型,估计量化导致的权重预期变化,并利用梯度信息预测对任务性能的影响,从而能够保留特定任务所需的权重。我们比较了基于TaCQ的方法与其他混合精度量化方法在一般数据集和特定任务数据集上的表现情况。 对于问答、数学推理以及文本到SQL的任务,在Llama-3和Qwen2.5模型上,TaCQ在相同的校准数据和更低的权重预算下优于基准线,并且在2位和3位设置中取得了显著改进。使用仅3.1比特的情况下,我们能够恢复Llama-3-8B-Instruct未量化的16位MMLU表现的96%,相较于SPQR方法获得了5.25%绝对性能提升。 此外,在2位设置下,TaCQ相对于现有最强基准线SliM-LLM平均表现出14.74%的优势。值得注意的是,即使在不针对特定任务的情况下使用TaCQ,我们也观察到了7.20%的显著增益,表明其识别重要权重的能力不仅仅局限于任务导向设置中。
https://arxiv.org/abs/2504.07389
This paper proposes StreamCodec, a streamable neural audio codec designed for real-time communication. StreamCodec adopts a fully causal, symmetric encoder-decoder structure and operates in the modified discrete cosine transform (MDCT) domain, aiming for low-latency inference and real-time efficient generation. To improve codebook utilization efficiency and compensate for the audio quality loss caused by structural causality, StreamCodec introduces a novel residual scalar-vector quantizer (RSVQ). The RSVQ sequentially connects scalar quantizers and improved vector quantizers in a residual manner, constructing coarse audio contours and refining acoustic details, respectively. Experimental results confirm that the proposed StreamCodec achieves decoded audio quality comparable to advanced non-streamable neural audio codecs. Specifically, on the 16 kHz LibriTTS dataset, StreamCodec attains a ViSQOL score of 4.30 at 1.5 kbps. It has a fixed latency of only 20 ms and achieves a generation speed nearly 20 times real-time on a CPU, with a lightweight model size of just 7M parameters, making it highly suitable for real-time communication applications.
本文提出了一种名为StreamCodec的流式神经音频编解码器,专门用于实时通信。StreamCodec采用完全因果关系的对称编码-解码结构,并在修改后的离散余弦变换(MDCT)域中运行,旨在实现低延迟推理和实时高效生成。为了提高代码簿利用率并补偿由结构性因果性引起的音质损失,StreamCodec引入了一种新型残差标量向量量化器(RSVQ)。该RSVQ以残差方式顺序连接标量量化器和改进的向量量化器,分别构建粗略音频轮廓和细化声学细节。实验结果证实,所提出的StreamCodec达到了与先进的非流式神经音频编解码器相当的解码音质。具体而言,在16kHz LibriTTS数据集上,当比特率为1.5kbps时,StreamCodec获得了4.30的ViSQOL得分。它具有固定的20毫秒延迟,并且在CPU上的生成速度接近于实时的20倍,模型大小仅为7M参数,非常适合用于实时通信应用。
https://arxiv.org/abs/2504.06561
The rapid adoption of large language models (LLMs) has led to significant energy consumption and carbon emissions, posing a critical challenge to the sustainability of generative AI technologies. This paper explores the integration of energy-efficient optimization techniques in the deployment of LLMs to address these environmental concerns. We present a case study and framework that demonstrate how strategic quantization and local inference techniques can substantially lower the carbon footprints of LLMs without compromising their operational effectiveness. Experimental results reveal that these methods can reduce energy consumption and carbon emissions by up to 45\% post quantization, making them particularly suitable for resource-constrained environments. The findings provide actionable insights for achieving sustainability in AI while maintaining high levels of accuracy and responsiveness.
大型语言模型(LLMs)的迅速采用导致了显著的能量消耗和碳排放,这对生成式AI技术的可持续性提出了严峻挑战。本文探讨了在部署LLMs时集成节能优化技术的方法,以应对这些环境问题。我们提出了一项案例研究和框架,展示了战略性量化和本地推理技术如何能够大幅降低LLMs的碳足迹而不影响其运行效率。实验结果显示,在进行量化后,这些方法可以减少高达45%的能量消耗和碳排放,使其特别适合资源受限的环境。这些发现为在保持高准确性和响应速度的同时实现AI的可持续性提供了可操作的见解。
https://arxiv.org/abs/2504.06307
Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx's KV260 SoM.
基于机器学习的嵌入式系统在航空航天和自动驾驶等关键安全应用中,必须对由软错误引起的扰动具有鲁棒性。随着晶体管几何尺寸缩小和电压降低,现代电子设备更容易受到背景辐射的影响,从而增加了对软错误导致故障的关注。深度神经网络(DNN)对这些错误的抵抗力不仅取决于目标设备技术,还取决于模型结构以及其参数的数值表示和算术精度。压缩技术如剪枝和量化用于减少内存占用和计算复杂性时,会改变模型结构和表示形式,从而影响软错误的鲁棒性。在这方面,虽然常常被忽视,但激活函数(AFs)的选择不仅会影响准确性与可训练性,还会对可压缩性和错误抵抗力产生影响。本文探索了使用有界激活函数以增强参数扰动下的鲁棒性,并通过一种技术无关的方法评估其对模型准确度、可压缩性和计算负担的影响。我们专注于用于超光谱图像语义分割的编码-解码卷积模型,这些模型在自动驾驶系统中有应用价值。实验是在AMD-Xilinx的KV260 SoM上进行的。
https://arxiv.org/abs/2504.05119
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces this http URL, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate this http URL on a common four-node home cluster. It outperforms this http URL, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at this https URL.
深寻R1和QwQ 32B在家庭设备上运行前沿大规模语言模型(LLMs)方面打破了性能壁垒。虽然消费者硬件正在变得更强,且模型量化也在改进,但现有的端侧解决方案仍然需要GPU集群、大容量RAM/VRAM以及高带宽,这远超出了普通家用服务器的能力范围。本文介绍了一个名为http URL的分布式推理系统,该系统可以在日常家庭设备上运行70B规模的大模型,并使用混合CPU/GPU架构、低RAM/VRAM、Wi-Fi和跨平台支持技术实现这一目标。 该系统利用mmap来管理模型权重,并引入了带预取功能的管道环形并行机制以隐藏磁盘加载过程。通过建模计算异构性、通信、存储设备、内存(及其管理行为)以及操作系统,它能将模型层最优分配到每台设备的CPU和GPU上,进一步降低令牌延迟。提出了一种名为Halda的优雅算法来解决这一NP难问题。 我们在一个常见的四节点家庭集群中对该系统进行了评估,结果表明,在30B+规模的大模型上,http URL的表现优于this http URL、exo和dllama,并且内存压力保持在6%以下。这使得前沿的30B-70B规模大模型(如Llama 3、深寻R1、Qwen 2.5和QwQ)可以在家庭助理设备上运行,使高级AI真正变得触手可及。 代码开源,并可在https://github.com/DeepSeek-Lab/Halda中获取。
https://arxiv.org/abs/2504.08791
Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in this https URL.
近期在推理型语言模型方面取得的进展,在处理复杂任务时表现出色,但其扩展的链式思维推理过程增加了推断成本。虽然量化技术已被广泛采用以减少大型语言模型的推断成本,但对于推理模型的影响却研究不足。本研究首次系统地探讨了量化的推理模型,评估了开源的DeepSeek-R1-Distilled Qwen和LLaMA系列模型(从1.5B到70B参数),以及QwQ-32B。我们的调查涵盖了使用各种位宽的状态-of-the-art算法进行权重、KV缓存和激活量化的研究,并在数学(AIME、MATH-500)、科学(GPQA)和编程(LiveCodeBench)推理基准上进行了广泛的评估。我们的发现表明,虽然可以使用W8A8或W4A16量化实现无损量化,但较低的位宽会带来显著的准确性风险。我们进一步确定模型大小、模型来源以及任务难度是决定性能的关键因素。与预期相反,量化的模型并不会表现出输出长度增加的现象。此外,策略性地调整模型规模或推理步骤可以有效提升性能。所有量化的模型和代码将在以下网址开源:[此链接](请替换为实际的URL)。
https://arxiv.org/abs/2504.04823
Deep learning-based computer vision systems adopt complex and large architectures to improve performance, yet they face challenges in deployment on resource-constrained mobile and edge devices. To address this issue, model compression techniques such as pruning, quantization, and matrix factorization have been proposed; however, these compressed models are often highly vulnerable to adversarial attacks. We introduce the \textbf{Efficient Ensemble Defense (EED)} technique, which diversifies the compression of a single base model based on different pruning importance scores and enhances ensemble diversity to achieve high adversarial robustness and resource efficiency. EED dynamically determines the number of necessary sub-models during the inference stage, minimizing unnecessary computations while maintaining high robustness. On the CIFAR-10 and SVHN datasets, EED demonstrated state-of-the-art robustness performance compared to existing adversarial pruning techniques, along with an inference speed improvement of up to 1.86 times. This proves that EED is a powerful defense solution in resource-constrained environments.
基于深度学习的计算机视觉系统采用复杂的大型架构来提升性能,但这些系统在资源受限的移动和边缘设备上部署时面临挑战。为了解决这一问题,提出了模型压缩技术(如剪枝、量化和矩阵分解),然而,这些被压缩后的模型往往容易受到对抗性攻击的影响。我们引入了\textbf{高效集成防御 (EED)} 技术,该技术基于不同的剪枝重要性分数来多样化单个基础模型的压缩,并通过增强集成多样性实现高对抗鲁棒性和资源效率。EED在推理阶段动态确定所需的子模型数量,以最小化不必要的计算同时保持高度的鲁棒性。在CIFAR-10和SVHN数据集上,与现有的对抗性剪枝技术相比,EED展示了最先进的抗扰性能,并且推理速度提高了最多1.86倍。这证明了EED是资源受限环境中一种强大的防御解决方案。
https://arxiv.org/abs/2504.04747