Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
生成推荐系统通过利用语义ID来表示项目已经取得了显著的进步。然而,现有独立处理每种模式的方法面临着两个关键限制:(1)跨模式冗余降低了效率;(2)未能捕捉到模态间的相互作用,从而限制了项目的表征能力。我们引入了一个名为FusID的融合式语义ID框架,该框架通过三个核心组件解决了这些问题:(i)多模态融合,通过联合编码多种模式的信息来学习统一表示;(ii)表征学习,将频繁共同出现的商品嵌入体拉近的同时保持差异性和防止特征冗余;(iii)产品量化,将融合后的连续嵌入转换为多个离散令牌以缓解ID冲突。在多模态下一首歌推荐(即播放列表延续)基准测试中,FusID实现了零ID冲突,确保每个令牌序列映射到唯一的歌曲,减轻了代码本未充分利用的问题,并且在MRR和Recall@k(k = 1, 5, 10, 20)指标上超越了基线方法。
https://arxiv.org/abs/2601.08764
Public large language models (LLMs) are typically safety-aligned during pretraining, yet task-specific fine-tuning required for deployment often erodes this alignment and introduces safety risks. Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction, leaving safety recovery tightly coupled with training and incurring high computational overhead and a complex workflow. To address these challenges, we propose \texttt{Q-realign}, a post-hoc defense method based on post-training quantization, guided by an analysis of representational structure. By reframing quantization as a dual-objective procedure for compression and safety, \texttt{Q-realign} decouples safety alignment from fine-tuning and naturally piggybacks into modern deployment pipelines. Experiments across multiple models and datasets demonstrate that our method substantially reduces unsafe behaviors while preserving task performance, with significant reductions in memory usage and GPU hours. Notably, our approach can recover the safety alignment of a fine-tuned 7B LLM on a single RTX 4090 within 40 minutes. Overall, our work provides a practical, turnkey solution for safety-aware deployment.
公共大型语言模型(LLM)在预训练阶段通常会进行安全对齐,但部署时所需的特定任务微调往往会削弱这种对齐,并引入安全隐患。现有的防御措施要么将安全性恢复嵌入到微调过程中,要么依赖于微调衍生的先验知识来进行事后修正,这使得安全性恢复紧密地与训练过程绑定在一起,并导致高计算开销和复杂的工作流程。为解决这些挑战,我们提出了一种基于后训练量化并由表示结构分析引导的事后防御方法 \texttt{Q-realign}。通过将量化重新构架为压缩与安全的双重目标程序,\texttt{Q-realign} 将安全性对齐从微调中解耦,并自然地集成到现代部署管道中。在多个模型和数据集上的实验表明,我们的方法显著减少了不安全行为,同时保持了任务性能,并大幅度减少了内存使用量和GPU小时数。值得注意的是,我们的方法可以在单个RTX 4090 GPU上,在40分钟内恢复一个7B参数的微调LLM的安全对齐。总体而言,本工作提供了一个实用、即插即用的解决方案,用于安全意识型部署。
https://arxiv.org/abs/2601.08089
Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM's capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.
最近,离散运动标记化技术使大型语言模型(LLM)能够作为动作理解与动作用语推理的多功能基础。然而,现有的处理流程通常将运动量化与语义嵌入学习解耦,仅通过标记ID进行连接。这种做法无法有效对齐动作空间内在几何结构与嵌入空间,从而限制了LLM进行细腻动作推理的能力。我们认为,在两种模态之间实现最有效的对齐是当它们共享统一的几何基础时发生的。因此,我们提出了一种新的框架,该框架明确地在动作代码簿和LLM嵌入空间中强制执行正交性,确保两者的关系结构自然相互映射,而不是让LLM从头开始重建复杂的运动标记之间的几何关系。 具体来说,我们使用了一个带有Gumbel-Softmax的解码器独享量化器来进行可微分训练,并平衡代码簿的使用。为了连接这两种模态,我们在保持正交性的同时,将动作编码映射到LLM嵌入空间中,通过稀疏投影实现这一目标。最后,我们采用了一个两阶段的规范正交化时间表,在标记器训练和LLM微调期间施加软约束,以维持几何对齐而不阻碍语义适应。 在HumanML3D数据集上的广泛实验表明,我们的框架比当前最先进的方法性能提高了20%,验证了统一的几何基础确实能有效增强LLM进行细腻动作推理的能力。
https://arxiv.org/abs/2601.07632
Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose \textbf{Fresco}, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10$\times$ speedup on FLUX, and 5$\times$ on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22$\times$ speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.
扩散变换器(Diffusion Transformers)在生成质量方面表现出色,但由于迭代采样过程中的计算成本高昂而受到限制。最近,动态分辨率采样作为一种有前景的加速技术出现,通过降低早期采样步骤的分辨率来减少计算量。然而,现有方法依赖于每次分辨率转换时使用启发式重新加噪(re-noising),这会破坏跨阶段的一致性,并迫使模型重新学习全局结构。此外,这些方法在一次无差别地上采样整个潜在空间,而不检查哪些区域已经收敛,导致累积误差和可见的伪影出现。 因此,我们提出了一种名为\textbf{Fresco}的动态分辨率框架,它通过渐进式上采样统一了跨阶段重新加噪与全局结构,并且同时保留了低分辨率草图绘制的效率以及高分辨率细化时的保真度。所有阶段都朝着同一个最终目标一致前进。 在不同领域和模型中(包括FLUX上的10倍加速,HunyuanVideo上5倍加速),Fresco实现了近乎无损的速度提升。同时,这种方法与知识蒸馏、量化和特征缓存技术保持独立,并且当结合蒸馏模型使用时可实现22倍的加速。 我们的代码包含在补充材料中,并将在Github上发布。
https://arxiv.org/abs/2601.07462
Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55$\times$ speedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is in supplementary material and will be released on Github.
扩散变换器(DiT)模型在图像和视频生成方面取得了前所未有的质量,但其迭代采样过程仍然在计算上具有挑战性。为了加速推理过程,已经出现了通过重用跨时间步的中间表示来实现特征缓存的方法。然而,现有的缓存方法对所有特征组件一视同仁。我们发现DiT特征空间包含不同的主成分子空间和残差子空间,这些子空间表现出不同的时间行为:主成分子空间平滑且可预测地演化,而残差子空间则显示出易变的、低能量振荡,难以准确预测。 基于这一见解,我们提出了SVD-Cache,这是一种子空间感知缓存框架。该框架通过奇异值分解(SVD)来分解扩散特征,并对主要的低秩成分应用指数移动平均(EMA)预测,同时直接重用残差子空间。广泛的实验表明,SVD-Cache在各种模型和方法中实现了几乎无损的效果,包括FLUX和HunyuanVideo上的5.55倍加速,并且与模型加速技术兼容,如知识蒸馏、量化和稀疏注意力机制。 我们的代码作为补充材料提供,并将在Github上发布。
https://arxiv.org/abs/2601.07396
The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to {-1, 0, +1}, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA-3.2 across five benchmarks demonstrate that Sherry matches state-of-the-art ternary performance while significantly reducing model size. Notably, on an Intel i7-14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at this https URL .
在资源受限的边缘设备上部署大型语言模型(LLMs)日益受到内存和计算需求过高的阻碍。虽然三值量化通过将权重减少到{-1, 0, +1}提供了一个有吸引力的解决方案,但当前的实现方式与商品硬件存在根本性的不匹配问题。大多数现有方法必须在2位对齐打包(导致显著的比特浪费)或1.67位非规则打包(降低推理速度)之间做出选择。为了解决这一矛盾,我们提出了Sherry,一个高效的硬件三值量化框架。Sherry引入了3:4细粒度稀疏性,通过将四个权重打包到五个比特中实现了常规化的1.25位宽度,恢复了幂次方对齐。此外,我们在稀疏三值训练中识别出了重量捕获问题,这导致了表示能力的丧失。为此,Sherry引入了Arenas,一种退火残差突触机制,在训练过程中保持表示多样性。 在LLaMA-3.2模型上进行的五项基准测试中的实证评估表明,Sherry能够与最先进的三值性能相匹敌,同时显著减小模型大小。特别地,在Intel i7-14700HX CPU上,我们的1B模型相比最先进的基线在不损失任何精度的情况下提供了25%的比特节省和10%的速度提升。 代码可在以下链接获取:[此URL](https://this-url.com)。
https://arxiv.org/abs/2601.07892
The benefits of most large language models come with steep and often hidden economic and environmental costs due to their resource usage inefficiency during deployment. Model quantization improves energy and memory efficiency through representing model parameters by lower-precision values. However, compression below 4-bits often distorts activation distributions and degrades performance. We address this challenge by introducing a sliced Wasserstein loss function for distribution-aware calibration in ultra-low-bit post-training quantization. The proposed loss aligns the output distributions of full-precision and quantized models under random linear projections, complementing standard mean-squared error loss without adding any computational overhead during inference. Our proposed loss function can be incorporated with any post-training quantization framework that has a retraining component. We demonstrate the performance gains of our proposed model by incorporating it with two frontier methods known as OmniQuant and TesseraQ. Compared to these two baselines, the proposed loss consistently improves both perplexity and downstream task accuracy across multiple ultra-low-bit settings. Our proposed loss function recovers 4.12-20.37% of the OmniQuant's lost accuracy on the language model LLaMA-2-7B, 0.93-7.65% on OPT-6.7B, and 2.26-6.20% on LLaMA-2-13B. TesseraQ's accuracy degradation is recovered by 3.63-7.63% in relative terms when augmented by our proposed loss function. Taken together, these results demonstrate that distributional alignment provides a simple yet effective performance boost that can push the limits of frontier quantization methods. Our method is available on GitHub to facilitate future progress in ultra-low-bit quantization.
大多数大型语言模型的好处伴随着由于部署时资源使用效率低下而产生的高昂且往往隐性的经济和环境成本。量化通过用较低精度的值表示模型参数来提高能效和内存效率。然而,低于4位的压缩通常会扭曲激活分布并降低性能。为了解决这个问题,我们引入了一种基于切片沃瑟斯坦损失函数(sliced Wasserstein loss function)的方法,用于在极端低比特后训练量化中的分布感知校准。所提出的损失函数通过随机线性投影对全精度和量化模型的输出分布进行对齐,并补充了标准均方误差损失,在推理过程中不会增加任何计算开销。我们的损失函数可以与任何具有再训练组件的后训练量化框架相结合。 我们展示了将所提出的方法与两个前沿方法OmniQuant和TesseraQ结合时,性能有所提升。相较于这两个基准,提出的损失函数在多个极端低比特设置中始终改善了困惑度(perplexity)和下游任务准确性。对于语言模型LLaMA-2-7B,我们的损失函数恢复了OmniQuant 4.12%至20.37%的丢失精度;对于OPT-6.7B,恢复了0.93%至7.65%,而对于LLaMA-2-13B,则为2.26%至6.20%。当TesseraQ被我们的损失函数增强时,它在相对术语中的准确性下降得到了3.63%到7.63%的恢复。 总的来说,这些结果表明分布对齐提供了一种简单而有效的性能提升方法,可以推动前沿量化方法的发展极限。我们所提出的方法可以在GitHub上获得,以促进未来在极端低比特量化的进展。
https://arxiv.org/abs/2601.07878
Embedded vision systems need efficient and robust image processing algorithms to perform real-time, with resource-constrained hardware. This research investigates image processing algorithms, specifically edge detection, corner detection, and blob detection, that are implemented on embedded processors, including DSPs and FPGAs. To address latency, accuracy and power consumption noted in the image processing literature, optimized algorithm architectures and quantization techniques are employed. In addition, optimal techniques for inter-frame redundancy removal and adaptive frame averaging are used to improve throughput with reasonable image quality. Simulations and hardware trials of the proposed approaches show marked improvements in the speed and energy efficiency of processing as compared to conventional implementations. The advances of this research facilitate a path for scalable and inexpensive embedded imaging systems for the automotive, surveillance, and robotics sectors, and underscore the benefit of co-designing algorithms and hardware architectures for practical real-time embedded vision applications.
嵌入式视觉系统需要高效且鲁棒的图像处理算法,以在资源受限的硬件上实现实时性能。本研究调查了在包括DSP和FPGA在内的嵌入式处理器上实现的图像处理算法,特别是边缘检测、角点检测和Blob(连通区域)检测算法。为解决图像处理文献中提到的延迟、准确性和功耗问题,采用了优化的算法架构和量化技术。此外,还使用了最佳的技术来消除帧间冗余并进行自适应帧平均,以在保证合理图像质量的前提下提高吞吐量。 对所提出方法的仿真和硬件试验表明,与传统实现相比,在处理速度和能效方面有了显著改进。本研究的进步为汽车、监控和机器人领域的可扩展且低成本的嵌入式成像系统铺平了道路,并强调了针对实际实时嵌入式视觉应用联合设计算法和硬件架构的好处。
https://arxiv.org/abs/2601.06243
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
能够进行推理和规划的现实世界代理需要预测其行动后果的能力。尽管世界模型具备这种能力,但它们通常需要复杂的动作标签,这些标签在大规模应用中难以获取。这促使了学习潜在动作模型的发展,这种模型仅通过视频就可以学习动作空间。我们的研究致力于解决基于野外视频学习潜在动作世界模型的问题,扩大现有工作主要关注简单机器人仿真、电子游戏或操作数据的研究范围。 虽然这种方法允许我们捕捉更复杂的动作,但也带来了挑战,例如来自视频多样性的环境噪声以及缺乏跨视频的共同存在形式(即,相同的主体)。为了解决这些挑战,我们讨论了动作应该遵循的属性、相关架构选择和评估方法。我们发现连续但受限制的潜在动作能够捕捉野外视频中动作的复杂性,而常见的向量量化方法却无法做到这一点。 例如,从代理如人类进入房间等变化可以跨视频传递。这突显了学习特定于野外视频的动作的能力。在缺乏跨视频共同存在形式的情况下,我们主要学会了空间相对定位的潜在动作,相对于摄像机而言。 尽管如此,我们也能够训练一个控制器将已知动作映射到潜在动作上,使得我们可以使用潜在动作作为通用接口,并利用我们的世界模型解决规划任务,其性能与基于动作条件的方法相当。我们的分析和实验为潜在动作模型在现实世界的规模化应用迈出了重要的一步。
https://arxiv.org/abs/2601.05230
Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60\% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49\% Dice and 89.61\% IoU, while reducing GPU memory and inference latency by 64\% and 68\%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.
三维医学图像分割是一项基础但计算密集的任务,由于体素处理的立方增长和对同质区域的冗余计算,这一任务变得尤为复杂。为了解决这些限制,我们提出了一种基于边界感知稀疏令牌表示框架的TokenSeg,用于高效地进行3D医学体积分割。具体来说: 1. 我们设计了一个多尺度分层编码器,该编码器在四个分辨率级别上提取400个候选令牌,以捕捉全局解剖学上下文和精细的边界细节; 2. 我们引入了一种结合了VQ-VAE量化与重要性评分的边界感知标记器,用于选择100个显著令牌,其中超过60%位于肿瘤边界附近; 3. 我们开发了一个从稀疏到密集的解码器,该解码器通过令牌重投影、渐进式上采样和跳跃连接重构全分辨率掩模。 在包含960例数据集的三维乳腺DCE-MRI数据集上的广泛实验表明,TokenSeg实现了最先进的性能,Dice系数为94.49%,IoU(交并比)为89.61%,同时将GPU内存使用量减少了64%,推理延迟减少了68%。为了验证泛化能力,我们在MSD心脏和大脑MRI基准数据集上的评估结果表明TokenSeg在异构解剖结构上始终表现出最优性能。 这些成果突显了基于解剖学信息的稀疏表示对于准确且高效的3D医学图像分割的有效性。
https://arxiv.org/abs/2601.04519
Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency.
最近在3D高斯点置技术(Gaussian Splatting)方面的发展使得实时、高质量的新视图合成成为可能。然而,这些模型对于大型和中型场景来说存储需求较大,这阻碍了它们在云端和服务流中的部署。最新的渐进式压缩技术依赖于渐进掩码与标量量化方法来减少高斯属性的比特率,并利用空间上下文模型。尽管这种方法有效,但标量量化可能无法充分捕捉高维特征向量之间的相关性,从而限制了其速率失真性能。 在此研究中,我们提出了一种新的用于3D高斯点置的渐进式编码器-解码器技术,该技术使用更强大的残差矢量量化方法来压缩基本特征。我们的主要贡献是一个自回归熵模型,它利用多分辨率哈希网格指导,能够准确预测每个相继传输索引的条件概率,从而使得粗糙层和精细层能以高效的方式进行压缩。
https://arxiv.org/abs/2601.04348
This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.
本文提出了一种新的技术,用于通过RGB空间中像素强度的五进制组合将文本数据嵌入图像。现有的方法主要依赖于对最不重要位(LSB)和最重要位(MSB)的操作、像素值差异(PVD)、RGB通道中的空间扰动、基于变换域的方法、量化方法、边缘和区域基方法,以及最近通过深度学习方法和生成式人工智能技术在图像的空间领域隐藏文本信息。大多数这些方法都依赖于多个像素上的强度翻转,例如LSB及其组合方法,并且通常基于转换系数,这往往导致噪声的形式。 在现有的许多方法中,编码和解码是确定性的,并且对于更复杂的模型(如深度学习和生成式AI)来说,计算成本较高。所提出的方法则是在RGB空间中的像素强度五进制组合基础上工作的,在每个R、G、B通道上形成了多达125种不同的像素强度组合。这些组合被映射到文本符号上,使得可以表示大写和小写字母、数字、空格以及常用的特殊字符。 本文评估了原始图像与编码后图像在均方误差(MSE)、平均绝对误差(MAE)、信噪比(SNR)、峰值信噪比(PSNR)、结构相似性指数(SSIM)及直方图比较和热图分析等不同指标上,结果显示图像未出现明显的失真。此外,该方法通过在单个RGB像素内编码一个完整的文本符号实现了更高的嵌入效率,与LSB、MSB及其他基于变换和学习的方法相比,它们通常需要多个像素或多步骤处理,并且计算开销更高。
https://arxiv.org/abs/2601.04302
Spatial prediction of reservoir parameters, especially permeability, is crucial for oil and gas exploration and development. However, the wide range and high variability of permeability prevent existing methods from providing reliable predictions. For the first time in subsurface spatial prediction, this study presents a quantum-enhanced long short-term memory with attention (QLSTMA) model that incorporates variational quantum circuits (VQCs) into the recurrent cell. Using quantum entanglement and superposition principles, the QLSTMA significantly improves the ability to predict complex geological parameters such as permeability. Two quantization structures, QLSTMA with Shared Gates (QLSTMA-SG) and with Independent Gates (QLSTMA-IG), are designed to investigate and evaluate the effects of quantum structure configurations and the number of qubits on model performance. Experimental results demonstrate that the 8-qubit QLSTMA-IG model significantly outperforms the traditional long short-term memory with attention (LSTMA), reducing Mean Absolute Error (MAE) by 19% and Root Mean Squared Error (RMSE) by 20%, with particularly strong performance in regions featuring complex well-logging data. These findings validate the potential of quantum-classical hybrid neural networks for reservoir prediction, indicating that increasing the number of qubits yields further accuracy gains despite the reliance on classical simulations. This study establishes a foundational framework for the eventual deployment of such models on real quantum hardware and their extension to broader applications in petroleum engineering and geoscience.
储层参数的空间预测,特别是渗透率的预测,对于石油和天然气勘探与开发至关重要。然而,渗透率范围广泛且变化多端,使得现有的方法难以提供可靠的结果。本研究首次在地下空间预测中提出了一种量子增强长短期记忆模型(QLSTMA),该模型将变分量子电路(VQCs)整合到递归单元中。通过利用量子纠缠和叠加原理,QLSTMA显著提高了对复杂地质参数如渗透率进行预测的能力。为了研究和评估量子结构配置以及量子比特数量对模型性能的影响,设计了两种量化结构:共享门的QLSTMA (QLSTMA-SG) 和独立门的QLSTMA (QLSTMA-IG)。实验结果表明,具有8个量子位的QLSTMA-IG模型在传统长短期记忆注意力模型(LSTMA)的基础上显著提升了性能,在均方根误差(RMSE)上降低了20%,绝对误差(MAE)减少了19%;尤其在复杂测井数据区域表现出色。这些发现验证了量子-经典混合神经网络用于储层预测的潜力,表明增加量子比特的数量可以进一步提高精度,即使依赖于经典的模拟。这项研究为未来将此类模型部署到真正的量子硬件上以及将其扩展至石油工程和地球科学领域更广泛的应用奠定了基础。
https://arxiv.org/abs/2601.02818
Rapid motorization in emerging economies such as India has created severe enforcement asymmetries, with over 11 million recorded violations in 2023 against a human policing density of roughly one officer per 4000 vehicles. Traditional surveillance and manual ticketing cannot scale to this magnitude, motivating the need for an autonomous, cooperative, and energy efficient edge AI perception infrastructure. This paper presents a real time roadside perception node for multi class traffic violation analytics and safety event dissemination within a connected and intelligent vehicle ecosystem. The node integrates YOLOv8 Nano for high accuracy multi object detection, DeepSORT for temporally consistent vehicle tracking, and a rule guided OCR post processing engine capable of recognizing degraded or multilingual license plates compliant with MoRTH AIS 159 and ISO 7591 visual contrast standards. Deployed on an NVIDIA Jetson Nano with a 128 core Maxwell GPU and optimized via TensorRT FP16 quantization, the system sustains 28 to 30 frames per second inference at 9.6 W, achieving 97.7 percent violation detection accuracy and 84.9 percent OCR precision across five violation classes, namely signal jumping, zebra crossing breach, wrong way driving, illegal U turn, and speeding, without manual region of interest calibration. Comparative benchmarking against YOLOv4 Tiny, PP YOLOE S, and Nano DetPlus demonstrates a 10.7 percent mean average precision gain and a 1.4 times accuracy per watt improvement. Beyond enforcement, the node publishes standardized safety events of CAM and DENM type to connected vehicles and intelligent transportation system backends via V2X protocols, demonstrating that roadside edge AI analytics can augment cooperative perception and proactive road safety management within the IEEE Intelligent Vehicles ecosystem.
在新兴经济体如印度快速汽车化的背景下,执法对技术的需求变得极为迫切。截至2023年,仅记录的交通违法行为就超过了1100万起,而每4000辆汽车才有一名警察进行管理,这导致了严重的执行不对称性。传统的监控和手动开罚单的方式无法应对如此大规模的问题,因此需要一种自主、协作且节能高效的边缘人工智能感知基础设施来解决这些问题。 本文介绍了一种实时路边感知节点,用于多类交通违规行为分析以及安全事件的传播,在连接智能车辆生态系统中发挥作用。该节点集成了YOLOv8 Nano进行高精度的多目标检测、DeepSORT进行时间上连贯的车辆跟踪,并且通过规则引导的OCR后处理引擎来识别符合MoRTH AIS 159和ISO 7591视觉对比标准的退化或多种语言的车牌。该系统部署在配备有128核Maxwell GPU的NVIDIA Jetson Nano上,通过TensorRT FP16量化进行优化,在功耗为9.6瓦的情况下维持每秒28到30帧的推理速度,并实现了高达97.7%的违规检测准确率和84.9%的OCR精度。这些结果覆盖了五个违规类别:闯红灯、违反人行横道规定、逆向行驶、非法掉头以及超速,无需手动设置感兴趣区域。 与YOLOv4 Tiny、PP YOLOE S和Nano DetPlus进行比较基准测试显示,在平均准确度上该系统有10.7%的提升,并且在每瓦性能方面提高了1.4倍。此外,该节点能够通过V2X协议向联网车辆和智能交通系统的后台发布标准化的安全事件(包括CAM类型与DENM类型),表明路边边缘AI分析可以增强协作感知并促进主动的道路安全管理工作,在IEEE智能汽车生态系统中发挥重要作用。 这一系统不仅在执法方面具有显著优势,还能够在更广泛的智能交通管理领域内发挥作用,展示了技术如何解决实际问题并提升交通安全水平。
https://arxiv.org/abs/2601.07845
In this study, we evaluated four binarization methods. Locality-Sensitive Hashing (LSH), Iterative Quantization (ITQ), Kernel-based Supervised Hashing (KSH), and Supervised Discrete Hashing (SDH) on the ODIR dataset using deep feature embeddings. Experimental results show that SDH achieved the best performance, with an mAP@100 of 0.9184 using only 32-bit codes, outperforming LSH, ITQ, and KSH. Compared with prior studies, our method proved highly competitive: Fang et al. reported 0.7528 (Fundus-iSee, 48 bits) and 0.8856 (ASOCT-Cataract, 48 bits), while Wijesinghe et al. achieved 94.01 (KVASIR, 256 bits). Despite using significantly fewer bits, our SDH-based framework reached retrieval accuracy close to the state-of-the-art. These findings demonstrate that SDH is the most effective approach among those tested, offering a practical balance of accuracy, storage, and efficiency for medical image retrieval and device inventory management.
在这项研究中,我们在使用深度特征嵌入的ODIR数据集上评估了四种二值化方法:局部敏感哈希(LSH)、迭代量化(ITQ)、基于核函数的监督哈希(KSH)和监督离散哈希(SDH)。实验结果表明,SDH取得了最佳性能,在仅使用32位代码的情况下实现了0.9184的mAP@100,优于其他方法LSH、ITQ和KSH。与之前的研究所报道的结果相比,我们的方法表现出高度竞争力:Fang等人报告了Fundus-iSee(48位)为0.7528和ASOCT-Cataract(48位)为0.8856,而Wijesinghe等人的研究在KVASIR数据集中实现了94.01的精度(使用256位)。尽管使用的比特数显著较少,我们基于SDH的方法达到了接近当前最优水平的检索准确性。这些发现表明,在所测试的方法中,SDH是最有效的技术之一,并且为医学图像检索和设备库存管理提供了准确度、存储空间与效率之间的实际平衡点。
https://arxiv.org/abs/2601.02564
Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.
大型语言模型(LLMs)已经展示了出色的代码生成能力,但它们的令牌级机制在压缩模型中仍然鲜有研究。通过对编程语言标记表示的系统性分析,我们通过考察词汇分布和关键字覆盖模式来描述编程语言是如何被LLM分词器编码的。我们引入了一种新颖的冷启动概率分析方法,这种方法无需显式提示就能提供对模型行为的理解。此外,我们还全面评估了不同的模型优化技术(包括量化、蒸馏、模型缩放以及任务特定的微调)如何影响令牌级表示和代码生成质量。 我们的实验通过全面的概率分布分析和支持性评价指标揭示了关于令牌级别行为的关键见解,并提供了在各种优化约束下保持代码生成质量的经验验证指导。这些发现不仅推进了对LLM代码生成理论的理解,也为生产环境中优化模型的实际应用提供了实践指南。
https://arxiv.org/abs/2601.02563
Running Automatic Speech Recognition (ASR) models on memory-constrained edge devices requires efficient compression. While layer-wise post-training quantization is effective, it suffers from error accumulation, especially in encoder-decoder architectures. Existing solutions like Quantization Error Propagation (QEP) are suboptimal for ASR due to the model's heterogeneity, processing acoustic features in the encoder while generating text in the decoder. To address this, we propose Fine-grained Alpha for Dynamic Quantization Error Propagation (FADE), which adaptively controls the trade-off between cross-layer error correction and local quantization. Experiments show that FADE significantly improves stability by reducing performance variance across runs, while simultaneously surpassing baselines in mean WER.
在内存受限的边缘设备上运行自动语音识别(ASR)模型需要高效的压缩方法。虽然逐层后训练量化是一种有效的技术,但它容易导致误差累积问题,尤其是在编码器-解码器架构中。现有的解决方案,如量化误差传播(QEP),对于ASR来说效果不佳,因为这种模型在处理声学特征时是异构的(即编码器和解码器的任务不同:一个负责将声音信号转换为有意义的信息表示,另一个则生成文本),这使得QEP不能很好地适应这些差异。为了应对这一挑战,我们提出了一种名为细粒度Alpha动态量化误差传播(FADE)的新方法,它能够自适应地控制跨层误差校正和局部量化的权衡。 实验结果表明,与现有基准相比,FADE不仅显著提高了模型的稳定性(通过减少不同运行之间的性能波动来体现),还在平均WER(词错误率)上超越了基线。这意味着FADE能够在保持良好识别精度的同时,进一步优化ASR模型在内存受限设备上的部署效果。
https://arxiv.org/abs/2601.02455
In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware Training (QAT) to enhance performance through additional training. Results on GPT-2 124M show a major enhancement for quantized PoT model after additional training, with a perplexity enhancement of 66% and BERT-Score loss to baseline GPT-2 of 1%. The memory saving is estimated to be 87.5% while the inference speed is expected to be 3-10x faster with PoT quantization versus full-precision.
在大型语言模型(LLMs)中,过去几年参数数量呈指数级增长,例如从GPT-2的15亿个参数增加到GPT-3的1750亿个参数,甚至更高版本可能超过万亿。这给实施带来了巨大挑战,尤其是在边缘设备上。与云计算不同,边缘设备在内存和处理能力方面非常有限,因此需要开发新的方法来使这类应用成为可能。在这项工作中,我们研究了一种特殊的量化技术,即将权重压缩为仅限于2的幂次(PoT),以此来节省大量内存,因为只需存储指数部分即可;更重要的是,这种方法通过用低成本位移操作替代昂贵的乘法运算,显著减少了处理能力的需求。 为了克服由于这种严格的量化导致性能下降的问题,我们研究了量化的感知训练(QAT),通过对模型进行额外训练来提高其性能。在GPT-2 1.24亿参数版本上的实验结果显示,在经过额外训练后,量化后的PoT模型性能得到了显著提升,困惑度提升了66%,与基准的GPT-2相比,BERT-Score仅损失了1%。预计内存节省可达87.5%,同时采用PoT量化相对于全精度推理速度可提高3到10倍。
https://arxiv.org/abs/2601.02298
The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.
在资源受限的边缘设备上部署基于Transformer的模型,对于实现实时人工智能应用构成了一个关键挑战。这篇全面的综述研究了专门为边缘部署设计的轻量级Transformer架构,并分析了最近关于模型压缩、量化、剪枝和知识蒸馏技术的进步。我们系统地回顾了一些著名的轻量级变体,包括MobileBERT、TinyBERT、DistilBERT、EfficientFormer、EdgeFormer和MobileViT,并在GLUE、SQuAD、ImageNet-1K和COCO等标准数据集上提供了详细的性能基准测试。我们的分析涵盖了主要硬件平台(如NVIDIA Jetson、Qualcomm Snapdragon、Apple Neural Engine及ARM架构)、部署框架(TensorFlow Lite、ONNX Runtime、PyTorch Mobile及CoreML)以及优化策略的当前行业采纳模式。 实验结果表明,现代轻量级Transformer可以在保持75%-96%全模型精度的同时,将模型大小减少4-10倍,推理延迟降低3-9倍,从而能够在低至2-5W功耗的设备上部署。我们确定了稀疏注意机制、混合精度量化(INT8/FP16)和硬件感知神经架构搜索作为最有效的优化策略。 新颖的研究发现包括对内存带宽瓶颈分析,显示具有15-40M参数的模型能够实现最佳的硬件利用效率(60%-75%),针对不同类型模型的最佳量化点,以及跨边缘平台的全面能效配置文件。我们建立了实时性能边界,并提供了一个实用的六步部署管道,在减少8-12倍大小的同时保证了不到2%的精度退化。
https://arxiv.org/abs/2601.03290
Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync's superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.
本文介绍了一种名为SmoothSync的新型框架,该框架旨在解决同步生成人类样貌手势过程中存在的诸如节奏不一致、动作抖动和脚步滑移等问题。SmoothSync利用新颖的双流Diffusion Transformer(DiT)架构中的量化音频令牌来合成整体手势,并增强采样的多样性。 具体而言,我们采用了以下方法: 1. 通过互补的Transformer流融合音频-运动特征以实现更优同步。 2. 引入抖动抑制损失函数,提升时间上的平滑度。 3. 实现概率性音频量化技术,从相同输入生成不同的手势序列。 为了可靠地评估在抖动情况下的节拍同步,我们引入了Smooth-BC,这是一种对运动噪声不敏感的节拍一致性指标的稳健变体。在BEAT2和SHOW数据集上的全面实验表明,与当前最佳方法相比,SmoothSync的表现更优,在BEAT2数据集中,它分别减少了30.6% FGD(固定手势差异)、提高了10.3% Smooth-BC以及增加了8.4%的多样性,并且在减少抖动和脚步滑移方面分别改善了62.9%和17.1%。未来的研究人员可以获取该代码以进一步开展相关研究工作。
https://arxiv.org/abs/2601.04236