Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading to reduced watermark imperceptibility, extraction accuracy, and capacity. To address these issues, we propose WMCodec, the first neural speech codec to jointly train compression-reconstruction and watermark embedding-extraction in an end-to-end manner, optimizing both imperceptibility and extractability of the watermark. Furthermore, We design an iterative Attention Imprint Unit (AIU) for deeper feature integration of watermark and speech, reducing the impact of quantization noise on the watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec in most quality metrics for watermark imperceptibility and consistently exceeds both AudioSeal with Encodec and reinforced TraceableSpeech in extraction accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16 bps, WMCodec maintains over 99% extraction accuracy under common attacks, demonstrating strong robustness.
近年来,语音伪造技术的进步使得在神经语音编码器中需要更强的验证机制来确保真实性。目前的方法在压缩之前嵌入数字水印,然后从重构语音中提取它们进行验证,但面临诸如水印和编码器的单独训练过程以及缺乏跨模态信息整合等问题,导致水印的不感知性、提取精度和容量降低。为了应对这些问题,我们提出了WMCodec,第一个在端到端方式下共同训练压缩和编码的水印嵌入的神经语音编码器,通过优化水印的感知度和提取性来提高其性能。此外,我们设计了一个递归注意印迹单元(AIU)来融合水印和语音的特征,减少量化噪声对水印的影响。实验结果表明,WMCodec在大多数质量指标上超过了AudioSeal with Encodec,并且 consistently超过了AudioSeal with Encodec和强化可追溯语音。在带宽为6 kbps,水印容量为16 bps的情况下,WMCodec在普通攻击下保持了超过99%的提取精度,证明了其强大的鲁棒性。
https://arxiv.org/abs/2409.12121
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
大语言模型(LLMs)通过将音频转换为离散 tokens 的音频编解码器显著提高了音频处理。然而,音频编解码器通常以高帧率为操作,导致训练和推理缓慢,特别是对于自回归模型。为了解决这个问题,我们提出了低帧率语音编解码器(LFSC):一种利用有限标量量化和大语言模型中的对抗训练来获得高品质音频压缩的神经音频编解码器,具有1.89 kbps的比特速率和21.5帧每秒。我们证明了我们的新编解码器可以在不降低质量的情况下将基于LLM的文本到语音模型的推理速度提高三倍,同时提高可听度和产生与以前模型相当的质量。
https://arxiv.org/abs/2409.12117
This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95\% with sample rates reduced by 75\% and bit depths and clip length reduced by 50\% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.
本文介绍了一种名为帕雷托数据框架的方法,用于确定并选择最小可行数据(MFD),以在受约束平台(如嵌入式系统、移动设备和物联网设备)上实现机器学习应用。我们证明了战略数据降维可以在保持高性能的同时显著降低带宽、能量、计算和存储成本。该框架通过优化受约束环境中的最小可行数据来保持效率,同时不牺牲性能。它解决了物联网应用中常见的低效实践,如过度配备传感器和过度精确、信号过度采样等,并提出了可扩展的传感器选择、信号提取和传输以及数据表示的最佳实践。一种实验方法证明了在降维、量化和截断后有效的高频数据特征;结果表明,通过将采样率降低75%,比特深度和截断长度降低50%,性能可以维持95%,这相当于降低数据和资源的成本。这些发现对受约束系统的设计和开发有影响。论文还讨论了框架的更广泛的影响,包括其在促进先进AI技术在物联网应用和行业(如农业、交通和制造业)中的应用和普及,以改善数据驱动洞察的访问和扩大其益处。
https://arxiv.org/abs/2409.12112
This paper provides a comprehensive overview of the principles, challenges, and methodologies associated with quantizing large-scale neural network models. As neural networks have evolved towards larger and more complex architectures to address increasingly sophisticated tasks, the computational and energy costs have escalated significantly. We explore the necessity and impact of model size growth, highlighting the performance benefits as well as the computational challenges and environmental considerations. The core focus is on model quantization as a fundamental approach to mitigate these challenges by reducing model size and improving efficiency without substantially compromising accuracy. We delve into various quantization techniques, including both post-training quantization (PTQ) and quantization-aware training (QAT), and analyze several state-of-the-art algorithms such as LLM-QAT, PEQA(L4Q), ZeroQuant, SmoothQuant, and others. Through comparative analysis, we examine how these methods address issues like outliers, importance weighting, and activation quantization, ultimately contributing to more sustainable and accessible deployment of large-scale models.
本文全面探讨了量化大规模神经网络模型的原则、挑战和方法。随着神经网络朝着更大、更复杂的架构发展,计算和能源成本急剧增加。我们探讨了模型大小增长必要性和影响,并强调了性能优势、计算挑战以及环境考虑。核心关注点是模型量化作为一种基本方法,通过减小模型规模和提高效率来缓解这些挑战。我们深入探讨了各种量化技术,包括后训练量化(PTQ)和量化意识训练(QAT),并分析了几个最先进的算法,如LLM-QAT、PEQA(L4Q)、ZeroQuant、SmoothQuant等。通过比较分析,我们研究了这些方法如何解决诸如离群值、重要性加权以及激活量化等问题,最终为大型模型的可持续且易用部署做出了贡献。
https://arxiv.org/abs/2409.11650
Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.
之前的研究工作已经使用有限的指标如拼写或几个基本知识任务和过时的数据集对量化LLMs进行了评估。此外,最近的大型模型如Llama 3.1(具有高达405B个参数)尚未进行彻底的评估。本文评估了指令定制LLM在各种量化方法(GPTQ、AWQ、SmoothQuant和FP8)上的性能,这些模型从7B到405B。使用13个基准,我们评估了六个任务类型的性能:常识问答、知识与语言理解、指令跟随、幻觉检测、数学和对话。我们发现,(1)将一个较大的LLM量化为与较小FP16 LLM相似的大小通常在大多数基准测试中都表现更好,但除外于幻觉检测和指令跟随;(2)不同的量化方法、模型大小和位宽对性能的影响显著不同,权重方法通常在较大模型上产生更好的结果;(3)量化不会显著影响任务难度导致的准确性下降;(4)MT-Bench评估方法在最近的高性能LLM之间缺乏区分力。
https://arxiv.org/abs/2409.11055
Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
大规模语言模型(LLMs)已广泛应用于处理长上下文任务。然而,大型KV缓存的高内存开销在长上下文场景中带来了显著的挑战。现有的无训练的KV缓存压缩方法通常关注量化器和词剪枝,这些方法具有压缩限制,而过度的稀疏可能导致严重的性能下降。其他方法设计的新架构具有较少的KV开销,但需要较大的训练开销。为了应对上述两个缺点,我们进一步探索了通道维度的冗余,并应用了少量训练开销的架构级设计。因此,我们引入了CSKV,一种用于KV缓存压缩的训练高效Channel Shrinking技术:(1)我们首先分析了KV缓存的单值分布,揭示了通道维度上的显著冗余和压缩潜力。(基于这一观察结果,我们提出使用低秩分解对键和值层进行操作,并存储低维度特征。) (2)为了保留模型性能,我们引入了双分支KV缓存,包括基于窗口的完整精确KV缓存和低精度压缩KV缓存。(3)为了降低训练成本,我们通过最小化压缩KV缓存的层间重构损失来压缩KV缓存,而不是重新训练整个LLM。大量实验证明,CSKV可以在保持模型长上下文能力的同时将KV缓存内存开销减少80%。此外,我们还证明了我们的方法可以与量化相结合,进一步降低内存开销,实现压缩比高达95%。
https://arxiv.org/abs/2409.10593
Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.
学习紧凑且富有意义的潜在空间表示在生成建模任务中对于视觉数据非常有益。一个具体的例子是在变分自编码器(VQ-VAE,VQ-GAN等)中应用向量量化(VQ),已经在许多现代生成建模应用中证明了最先进的性能。将潜在空间量化为数据固有的离散性质(如像素值)的假设,为向量量化提供了正当理由。在本文中,我们提出了另一种潜在空间表示,通过放宽VQ公式的结构假设,将潜在空间建模为一个字典基于表示的子空间模型的并集。字典在训练过程中学习/更新。我们将这种方法应用于看两个模型:词典学习变分自编码器(DL-VAEs)和词典学习变分自编码器(DL-VAEs)与生成对抗网络(DL-GANs)。我们通过实验实证地证明,我们的更丰富的潜在空间具有更丰富的表达性,并且在一定程度上优于VQ方法在重构质量方面的表现,但代价是计算开销略小。因此,我们得出的结论是,VQ方法的真实益处可能不在于对潜在空间的离散化,而在于对潜在空间的损失压缩。我们证实了这一假设,通过实验发现我们的稀疏表示同样解决了VQ家族模型中常见代码book collapse问题。
https://arxiv.org/abs/2409.11184
Ultrasound imaging of the forearm has demonstrated significant potential for accurate hand gesture classification. Despite this progress, there has been limited focus on developing a stand-alone end- to-end gesture recognition system which makes it mobile, real-time and more user friendly. To bridge this gap, this paper explores the deployment of deep neural networks for forearm ultrasound-based hand gesture recognition on edge devices. Utilizing quantization techniques, we achieve substantial reductions in model size while maintaining high accuracy and low latency. Our best model, with Float16 quantization, achieves a test accuracy of 92% and an inference time of 0.31 seconds on a Raspberry Pi. These results demonstrate the feasibility of efficient, real-time gesture recognition on resource-limited edge devices, paving the way for wearable ultrasound-based systems.
超声波成像技术对手部动作识别展示了很大的潜力。然而,目前尚未对开发一个端到端的动作识别系统进行深入研究,使其具有便携性、实时性和更友好的特点。为了填补这一空白,本文探讨了在边缘设备上使用深度神经网络进行手部超声波动作识别的研究。通过使用量化技术,我们实现了模型大小的大幅减少,同时保持高准确性和低延迟。我们最好的模型,使用浮点16量化,在Raspberry Pi上的测试准确率为92%,推理时间为0.31秒。这些结果证明了在资源受限的边缘设备上实现高效、实时手部动作识别是可能的,为基于超声波的可穿戴系统铺平了道路。
https://arxiv.org/abs/2409.09915
Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.
目的:开发和评估一个自动系统,用于从非结构化的放射学和病理学报告提取结构化的临床信息,使用开放权重的大型语言模型(LMs)和检索增强生成(RAG),并对模型配置变量对提取性能的影响进行评估。方法与材料:本研究采用了两个数据集:7,294份放射学报告为Brain Tumor Reporting and Data System(BT-RADS)评分,2,154份病理报告为isocitrate dehydrogenase(IDH)突变状态。开发了一个自动管道来对比各种LMs和RAG配置的性能。对模型大小、量化、提示策略、输出格式和推理参数的系统评估进行了影响评估。结果:表现最好的模型在提取BT-RADS分数和IDH突变状态方面均超过98%的准确率。最佳模型为医疗微调的llama3。较大的、较新的和领域微调的模型始终优于较老的和较小的模型。模型量化对性能的影响最小。少数shot提示显著提高准确性。RAG对于复杂的病理报告的性能有所提高,但对于较短的放射学报告的性能没有改善。结论:开放的大型语言模型具有从非结构化临床报告中自动提取结构化临床数据的重大潜力。仔细的模型选择、提示工程和半自动化优化使用注释数据是实现最优性能的关键。这些方法在研究工作流程中可能是可靠的。该研究结果表明,在 healthcare data extraction 中,人类与机器合作具有巨大的潜力。
https://arxiv.org/abs/2409.10576
3D Gaussian Splatting demonstrates excellent quality and speed in novel view synthesis. Nevertheless, the huge file size of the 3D Gaussians presents challenges for transmission and storage. Current works design compact models to replace the substantial volume and attributes of 3D Gaussians, along with intensive training to distill information. These endeavors demand considerable training time, presenting formidable hurdles for practical deployment. To this end, we propose MesonGS, a codec for post-training compression of 3D Gaussians. Initially, we introduce a measurement criterion that considers both view-dependent and view-independent factors to assess the impact of each Gaussian point on the rendering output, enabling the removal of insignificant points. Subsequently, we decrease the entropy of attributes through two transformations that complement subsequent entropy coding techniques to enhance the file compression rate. More specifically, we first replace rotation quaternions with Euler angles; then, we apply region adaptive hierarchical transform to key attributes to reduce entropy. Lastly, we adopt finer-grained quantization to avoid excessive information loss. Moreover, a well-crafted finetune scheme is devised to restore quality. Extensive experiments demonstrate that MesonGS significantly reduces the size of 3D Gaussians while preserving competitive quality.
3D高斯平铺在新型视图合成中表现出优秀的质量和速度。然而,3D高斯文件的大小为传输和存储带来了巨大的挑战。目前的工作设计出紧凑的模型来替代大量体积和属性的3D高斯,并通过密集训练来提取信息。这些努力需要相当长的时间,在实际部署中造成了巨大的障碍。因此,我们提出了MesonGS,一种用于3D高斯的后训练压缩编码。最初,我们引入了一个测量标准,考虑了视图相关和视图无关的因素,以评估每个高斯点对渲染输出的影响,从而可以删除无关的点。随后,我们通过两种变换来降低属性的熵,补充后续的熵编码技术,从而提高文件压缩率。具体来说,我们首先用Euler角度替换旋转四元数;然后,我们对关键属性应用区域自适应层次变换,以减少熵。最后,我们采用更细粒度的量化来避免信息损失。此外,还设计了一个精细的微调方案来恢复质量。大量实验证明,MesonGS显著地减小了3D高斯的大小,同时保留了竞争力的质量。
https://arxiv.org/abs/2409.09756
Selecting an automatic metric that best emulates human judgments is often non-trivial, because there is no clear definition of "best emulates." A meta-metric is required to compare the human judgments to the automatic metric judgments, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric judgments. SPA allows for more fine-grained comparisons between systems than a simplistic binary win/loss, and addresses a number of shortcomings with PA: it is more stable with respect to both the number of systems and segments used for evaluation, it mitigates the issue of metric ties due to quantization, and it produces more statistically significant results. SPA was selected as the official system-level metric for the 2024 WMT metric shared task.
选择一个最佳地模仿人类判断的自动指标通常是不容易的,因为没有明确的“最佳模仿”的定义。需要一个元指标来比较人类判断和自动指标判断,指标排名取决于所选的元指标。我们提出了软对偶准确性(SPA),一种基于Pairwise Accuracy(PA)的新元指标,并融入了人类判断和指标判断的统计显著性。与简单的二元胜利/失败相比,SPA允许进行更细致的系统之间的比较,并解决了PA的一些缺陷:它在用于评估的系统数量和数据段数量上更加稳定,缓解了量化导致的指标绑定问题,并产生了更具有统计学意义的结论。SPA被选为2024 WMT指标共享任务的官方系统级指标。
https://arxiv.org/abs/2409.09598
The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.
量化(quantization)和稀疏化(sparsification)中固有的不连续操作会阻碍反向传播。在用超低精度(ultra-low precision)和稀疏状态(sparse regime)训练深度神经网络时,这尤其具有挑战性。我们提出了一种新颖的、鲁棒的、通用的解决方案:一种平滑的线性变换,它在这些具有挑战性的条件下能稳定训练。通过将量化(quantization)和稀疏化(sparsification)表示为训练过程中的扰动,我们得到了基于平滑回归的扰动鲁棒方法。我们的解决方案采用分块常数骨架模型来确保具有最低性能下限,并具有固有的噪声减少机制,以减轻扰动引起的错误。这个公式允许现有的模型在任意低精度和稀疏水平上通过离线食谱进行训练。此外,我们的方法提供了一种新的视角来研究训练时间二进制神经网络,为缩小人工智能和生物神经网络之间的差距做出了贡献。
https://arxiv.org/abs/2409.09245
Most of the prevalent approaches in speech prosody modeling rely on learning global style representations in a continuous latent space which encode and transfer the attributes of reference speech. However, recent work on neural codecs which are based on Residual Vector Quantization (RVQ) already shows great potential offering distinct advantages. We investigate the prosody modeling capabilities of the discrete space of such an RVQ-VAE model, modifying it to operate on the phoneme-level. We condition both the encoder and decoder of the model on linguistic representations and apply a global speaker embedding in order to factor out both phonetic and speaker information. We conduct an extensive set of investigations based on subjective experiments and objective measures to show that the phoneme-level discrete latent representations obtained this way achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable. The latent space turns out to have interpretable structure with its principal components corresponding to pitch and energy.
大多数流行的语音语调建模方法都依赖于在连续的潜在空间中学习全局风格表示,这编码和传递参考语音的属性。然而,基于残差向量量化(RVQ)的神经码解码器已经在之前的研究中展示了巨大的潜力,提供了明显的优势。我们研究了这种基于RVQ的离散空间语调建模模型的 prosody 建模能力,将其修改为在音素层面上操作。我们将模型的编码器和解码器都 conditioned 到语言表示上,并应用全局说话人嵌入以消除音位信息和说话人信息。我们对这种方式获得的音素级离散潜在表示进行了大量主观实验和客观测量,结果表明,这种方法获得的音素级离散潜在表示具有很高的离散度,捕捉到了细粒度的语调信息,这些信息健壮且可转移。该潜在空间在主要成分上具有可解释的结构,其主成分与音高和能量对应。
https://arxiv.org/abs/2409.08664
Diffusion Transformers (DiTs) have recently attracted significant interest from both industry and academia due to their enhanced capabilities in visual generation, surpassing the performance of traditional diffusion models that employ U-Net. However, the improved performance of DiTs comes at the expense of higher parameter counts and implementation costs, which significantly limits their deployment on resource-constrained devices like mobile phones. We propose DiTAS, a data-free post-training quantization (PTQ) method for efficient DiT inference. DiTAS relies on the proposed temporal-aggregated smoothing techniques to mitigate the impact of the channel-wise outliers within the input activations, leading to much lower quantization error under extremely low bitwidth. To further enhance the performance of the quantized DiT, we adopt the layer-wise grid search strategy to optimize the smoothing factor. Experimental results demonstrate that our approach enables 4-bit weight, 8-bit activation (W4A8) quantization for DiTs while maintaining comparable performance as the full-precision model.
扩散变换器(DiTs)最近在产业界和学术界引起了广泛的关注,因为它们在视觉生成方面的增强能力超过了使用U-Net的传统扩散模型的性能。然而,DiTs的提高性能代价是参数数量和实现成本更高,这使得它们在资源受限的设备(如智能手机)上的部署受到限制。我们提出DiTAS,一种无数据的用于 efficient DiT 推理的后训练量化(PTQ)方法。DiTAS依赖于所提出的时域聚合平滑技术来缓解输入激活中的通道级异常对量化误差的影响,导致在极其低位宽时量化误差大大降低。为了进一步提高量化DiT的性能,我们采用层间网格搜索策略优化平滑因子。实验结果表明,与全精度模型相比,我们的方法可以实现4位权重,8位激活(W4A8)量化对DiTs,同时保持相当的表演。
https://arxiv.org/abs/2409.07756
Recent advances in implicit neural representation (INR)-based video coding have demonstrated its potential to compete with both conventional and other learning-based approaches. With INR methods, a neural network is trained to overfit a video sequence, with its parameters compressed to obtain a compact representation of the video content. However, although promising results have been achieved, the best INR-based methods are still out-performed by the latest standard codecs, such as VVC VTM, partially due to the simple model compression techniques employed. In this paper, rather than focusing on representation architectures as in many existing works, we propose a novel INR-based video compression framework, Neural Video Representation Compression (NVRC), targeting compression of the representation. Based on the novel entropy coding and quantization models proposed, NVRC, for the first time, is able to optimize an INR-based video codec in a fully end-to-end manner. To further minimize the additional bitrate overhead introduced by the entropy models, we have also proposed a new model compression framework for coding all the network, quantization and entropy model parameters hierarchically. Our experiments show that NVRC outperforms many conventional and learning-based benchmark codecs, with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset, measured in PSNR. As far as we are aware, this is the first time an INR-based video codec achieving such performance. The implementation of NVRC will be released at this http URL.
近年来,基于隐式神经表示(INR)的视频编码技术的进步已经证明了其与传统和基于学习的方法竞争的潜力。使用INR方法,神经网络通过压缩视频序列的参数,获得了视频内容的紧凑表示。然而,尽管已经取得了很好的效果,基于INR的最佳方法仍被最新的标准编解码器(如VVC VTM)所超越,这部分原因在于采用的简单模型压缩技术。在本文中,我们提出了一个新颖的基于INR的视频压缩框架,称为Neural Video Representation Compression(NVRC),专注于表示的压缩。基于新提出的熵编码和量化模型,NVRC是第一次在端到端方式上优化基于INR的视频编码码解码器。为了进一步最小化熵模型的附加带宽开销,我们还提出了一个新的模型压缩框架,用于对整个网络、量化以及熵模型参数进行分层压缩。我们的实验结果表明,NVRC在许多传统和学习方法基准编码器上表现优异,在UVG数据集上的PSNR平均编码增益达到24%。据我们所知,这是第一个实现此性能的基于INR的视频编码器。NVRC的实现将发布在http://这个网址上。
https://arxiv.org/abs/2409.07414
Image Transformers show a magnificent success in Image Restoration tasks. Nevertheless, most of transformer-based models are strictly bounded by exorbitant memory occupancy. Our goal is to reduce the memory consumption of Swin Transformer and at the same time speed up the model during training process. Thus, we introduce AgileIR, group shifted attention mechanism along with window attention, which sparsely simplifies the model in architecture. We propose Group Shifted Window Attention (GSWA) to decompose Shift Window Multi-head Self Attention (SW-MSA) and Window Multi-head Self Attention (W-MSA) into groups across their attention heads, contributing to shrinking memory usage in back propagation. In addition to that, we keep shifted window masking and its shifted learnable biases during training, in order to induce the model interacting across windows within the channel. We also re-allocate projection parameters to accelerate attention matrix calculation, which we found a negligible decrease in performance. As a result of experiment, compared with our baseline SwinIR and other efficient quantization models, AgileIR keeps the performance still at 32.20 dB on Set5 evaluation dataset, exceeding other methods with tailor-made efficient methods and saves over 50% memory while a large batch size is employed.
图像转换器在图像修复任务中表现出惊人的成功。然而,大多数基于Transformer的模型在内存占用方面具有极高的限制。我们的目标是减少Swin Transformer的内存消耗,同时加速训练过程。因此,我们引入了敏捷IR,并使用窗位注意力和组内转移注意机制,简化了模型的架构。我们提出了Group Shifted Window Attention(GSWA)来将Shift Window Multi-head Self Attention(SW-MSA)和Window Multi-head Self Attention(W-MSA)的注意力头分解成组,有助于在反向传播过程中缩小内存使用。此外,我们还保持了窗位掩码及其可学习偏置在训练过程中的偏移,以便在通道内交互模型。此外,我们将投影参数重新分配来加速注意力矩阵计算,但我们发现其性能的影响非常微小。通过实验,与我们的基线SwinIR和其他高效量化模型相比,敏捷IR在Set5评估数据集上的性能仍然保持在32.20 dB,超过其他具有定制高效方法的方法,并且在大批量应用时降低了超过50%的内存消耗。
https://arxiv.org/abs/2409.06206
Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
使用离散单元表示语音在语音编码和语音生成中的应用非常广泛。然而,关于自监督离散单元有一些未验证的声明,例如通过k-means将语音音素和说话人信息解耦,或者在k-means之后假设信息损失。在本文中,我们从信息论的角度回答了离散单元中信息的存在(信息完整性)和可获得性(信息可用性),以及在剩余向量量化之前和之后的信息量。我们在剩余向量量化后的离散HuBERT表示上找到了信息完整性的下界,并估计了完整性。我们发现,在HuBERT离散单元中,说话人的信息足够存在,而语音信息足够存在于剩余中,这说明向量量化并没有实现解耦。我们的结果对选择离散单元提供了全面的评估,并表明应该开采残留向量中更多信息,而不是将其废弃。
https://arxiv.org/abs/2409.06109
Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.
向量量化(VQ)是一种通过离散代码本表示学习特征的方法。最近的工作利用了视觉词表来对视觉区域进行离散化,以进行自监督表示学习。然而,这些词表的一个显著缺点是缺乏语义,因为它们仅基于自编码器范式中的预处理任务来生成预处理图像的像素。此外,不均衡的代码本分布和代码本崩塌问题会因低效的代码本利用而适得其反地影响性能。为了应对这些挑战,我们引入了SGC-VQGAN,通过半监督语义在线聚类方法增强词的语义。利用分割模型的推理结果,我们的方法通过一致语义学习构建了一个时间上连续的语义代码本,解决了代码本崩塌和词义不均衡的问题。我们提出的金字塔特征学习流程将多层次特征与同时捕获图像细节和语义相结合。因此,SGC-VQGAN在 both reconstruction quality and various downstream tasks 都实现了 state-of-the-art (SOTA) performance。其简单性,无需额外的参数学习,使得它可以直接应用于下游任务,具有显著的意义。
https://arxiv.org/abs/2409.06105
We present BigCodec, a low-bitrate neural speech codec. While recent neural speech codecs have shown impressive progress, their performance significantly deteriorates at low bitrates (around 1 kbps). Although a low bitrate inherently restricts performance, other factors, such as model capacity, also hinder further improvements. To address this problem, we scale up the model size to 159M parameters that is more than 10 times larger than popular codecs with about 10M parameters. Besides, we integrate sequential models into traditional convolutional architectures to better capture temporal dependency and adopt low-dimensional vector quantization to ensure a high code utilization. Comprehensive objective and subjective evaluations show that BigCodec, with a bitrate of 1.04 kbps, significantly outperforms several existing low-bitrate codecs. Furthermore, BigCodec achieves objective performance comparable to popular codecs operating at 4-6 times higher bitrates, and even delivers better subjective perceptual quality than the ground truth.
我们提出了BigCodec,一种低比特率神经语音编码器。虽然最近的一些神经语音编码器取得了令人印象深刻的进展,但在低比特率下(大约1 kbps)它们的性能显著恶化。尽管低比特率固有限制,但其他因素(如模型容量)也阻碍了进一步的改进。为了解决这个问题,我们将模型大小扩展到159M参数,这是流行codecs大约10M参数的15倍。此外,我们将序列模型集成到传统的卷积架构中,更好地捕捉时间依赖关系,并采用低维向量量化来确保高编码利用率。综合目标评估和主观评估显示,BigCodec在1.04 kbps的比特率下显著优于几个低比特率codec。此外,BigCodec在4-6倍高比特率时实现的目标性能与流行codec相当,甚至比地面真值提供更差的主观感知质量。
https://arxiv.org/abs/2409.05377
The rapid advancement and increasing complexity of pretrained models, exemplified by CLIP, offer significant opportunities as well as challenges for Federated Learning (FL), a critical component of privacy-preserving artificial intelligence. This research delves into the intricacies of integrating large foundation models like CLIP within FL frameworks to enhance privacy, efficiency, and adaptability across heterogeneous data landscapes. It specifically addresses the challenges posed by non-IID data distributions, the computational and communication overheads of leveraging such complex models, and the skewed representation of classes within datasets. We propose TriplePlay, a framework that integrates CLIP as an adapter to enhance FL's adaptability and performance across diverse data distributions. This approach addresses the long-tail distribution challenge to ensure fairness while reducing resource demands through quantization and low-rank adaptation techniques.Our simulation results demonstrate that TriplePlay effectively decreases GPU usage costs and speeds up the learning process, achieving convergence with reduced communication overhead.
预训练模型的快速发展及其日益复杂,以CLIP为例,为联邦学习(FL)提供了显著的机会和挑战。这项研究深入探讨了将大型基础模型(如CLIP)集成到FL框架中以提高隐私、效率和可适应性跨异质数据场景的方法,特别关注非IID数据分布、利用此类复杂模型所导致的计算和通信开销以及数据集中类的不均匀表示。我们提出了TriplePlay框架,将CLIP作为适配器集成到FL中,以增强FL在不同数据分布下的可适应性和性能。通过量化技术和低秩适应技术解决长尾分布问题,从而确保公平性,同时降低资源需求。 我们的仿真结果表明,TriplePlay有效地降低了GPU使用成本并加速了学习过程,通过减少通信开销达到收敛。
https://arxiv.org/abs/2409.05347