Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
在使用大型语言模型(LLM)建模时间事件序列时,表示连续时间是一个关键且研究不足的挑战。已经提出了各种策略,如字节级表示或日历令牌等方法。然而,在考虑到实际世界事件数据的各种统计分布(从平滑对数正态到离散、尖峰模式)的情况下,最优方法仍不清楚。本文首次针对事件序列的时间标记化进行了实证研究,比较了不同的编码策略:简单的数字字符串、高精度字节级表示、人类语义日历令牌、经典的均匀分箱以及自适应残差标量量化。我们通过在代表这些多样化分布的真实世界数据集上微调LLM来评估这些策略。我们的分析表明,没有任何单一策略能够普遍优于其他方法;相反,预测性能高度依赖于标记器与数据统计特性的匹配程度,基于对数的方法在偏斜分布中表现出色,而以人为中心的格式则证明了其在混合模式下的稳健性。
https://arxiv.org/abs/2512.13618
We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
我们探讨了在资源受限条件下,为下游文本分类任务高效微调解码器独享的大规模语言模型(LLMs)的策略。研究调查了两种方法:(1) 将分类头附加到预训练的因果LMM上,并针对特定任务进行微调(使用LMM最终的标记嵌入作为序列表示),以及 (2) 采用指令调整,以提示-响应格式对LLM进行微调用于分类。为了实现单GPU环境下高达80亿参数模型的有效微调,我们结合了4位模型量化与低秩适应(LoRA)技术,以达到高效的训练效果。 在两个数据集上进行了实验:一个是专有的单一标签数据集和公开的WIPO-Alpha专利数据集(极端多标签分类)。实验结果显示,基于嵌入的方法在F1分数方面显著优于指令调优方法,并且与针对特定领域的微调模型(如BERT)相比,在相同的任务中表现出色甚至超越。 这些结果表明,在计算资源有限的情况下,直接利用因果LMM内部表示以及高效的微调技术可以实现卓越的分类性能。我们讨论了每种方法的优势,并提出了优化LLM在分类场景下的微调的实际指南和未来方向。
https://arxiv.org/abs/2512.12677
Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance. Project website: this https URL
实例级别的图像检索旨在找到包含与给定查询对象相同物体的图片,即使这些图片在大小、位置或外观上有所不同。为了应对这一挑战性任务,我们提出了Patchify,这是一种简单而有效的基于补丁的检索框架,它无需微调即可提供高性能、可扩展性和可解释性。Patchify将每个数据库图像分割成一小部分结构化的补丁,并通过比较这些局部特征与全局查询描述符来进行检索,从而实现准确且空间定位的匹配。 为了评估不仅仅是检索准确性,还包括空间正确性,我们引入了LocScore,这是一种具有定位感知能力的度量标准,用于量化所检索区域是否对准目标对象。这使得LocScore成为一种理解并改进检索行为的重要诊断工具。 我们在多个基准、骨干网络和区域选择策略上进行了广泛的实验,证明Patchify优于全局方法,并且可以补充当前最先进的重排名管道。此外,我们应用产品量化技术进行高效的大型规模检索,并强调在压缩过程中使用信息丰富的特征的重要性,这极大地提升了性能。 项目网站: [请参阅原文链接](this https URL)
https://arxiv.org/abs/2512.12610
Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.
领域自适应检索旨在将有标签的源领域的知识转移到无标签的目标领域,以实现有效的检索并减轻域差异。然而,现有方法遇到了几个基本限制:1)忽略了类级别语义对齐,并过度追求样本级对齐;2)缺乏伪标签可靠性考量或几何指导来评估标签正确性;3)直接量化受域偏移影响的原始特征,从而降低了学习哈希码的质量。 为了解决这些限制,我们提出了基于原型的语义一致性对齐(PSCA),这是一种两阶段框架,旨在实现有效的领域自适应检索。在第一阶段,一组正交原型直接建立了类级别的语义连接,在最大化类别间分离的同时聚合并内部分散同类样本。在此过程中,几何接近度为语义一致性对齐提供了可靠性指标,通过自适应加权伪标签置信度来评估。由此产生的隶属矩阵和原型有助于特征重建,并确保基于重构而非原始特征进行量化,从而提高后续哈希编码的质量并无缝连接两个阶段。 在第二阶段,特定领域的量化函数处理在相互逼近约束下的重构特征,生成跨域统一的二进制哈希码。广泛的实验验证了PSCA在多个数据集上的优越性能。
https://arxiv.org/abs/2512.04524
Domain models are central to software engineering, as they enable a shared understanding, guide implementation, and support automated analyses and model-driven development. Yet, despite these benefits, practitioners often skip modeling because it is time-consuming and demands scarce expertise. We address this barrier by investigating whether open-weight large language models, adapted via instruction tuning, can generate high-quality BPMN process models directly from natural language descriptions in a cost-effective and privacy-preserving way. We introduce InstruBPM, a reproducible approach that prepares paired text-diagram data and instruction tunes an open source large language model with parameter-efficient fine-tuning and quantization for on-prem deployment. We evaluate the tuned model through complementary perspectives: (i) text/code similarity using BLEU, ROUGE-L, and METEOR, (ii) structural fidelity using Relative Graph Edit Distance, (iii) guidelines conformance using external tool checks, and (iv) a small expert review. Using a curated subset of a multi-domain BPMN dataset, we compare the tuned model with untuned open-weight baselines and strong proprietary models under consistent prompting regimes. Our compact tuned model outperforms all baselines across sequence and structural metrics while requiring substantially fewer resources; guideline analysis and expert feedback further indicate that the generated diagrams largely follow BPMN best practices and are useful starting points that reduce modeling effort. Overall, instruction tuning improves structural accuracy and robustness compared to untuned baselines and reduces reliance on heavy prompt scaffolding. We publicly share the trained models and scripts to support reproducibility and further research.
领域模型在软件工程中至关重要,因为它们能够促进共同理解、指导实现,并支持自动化分析和基于模型的开发。然而,尽管这些好处存在,从业者常常会跳过建模过程,原因在于其耗时且需要稀缺的专业知识。为了解决这一障碍,我们研究了通过指令微调适配的大规模语言模型是否能够在成本效益高且隐私保护的方式下,直接从自然语言描述生成高质量的BPMN流程图。 为此,我们引入了一种可重复的方法——InstruBPM,该方法准备了文本-图形对数据,并使用参数高效的微调和量化技术,在本地部署开源大规模语言模型。我们的评估通过多种视角进行: (i) 使用BLEU、ROUGE-L和METEOR等指标来衡量文本/代码的相似度; (ii) 采用相对图编辑距离来评估结构保真性; (iii) 利用外部工具检查来验证是否符合指南要求; (iv) 小规模专家评审以获取专业意见。 我们使用了一个多领域BPMN数据集的精选子集,将微调后的模型与未经调整的基础开源模型及强大的专有模型进行了对比。在一致的提示制度下,我们的精简微调模型在序列和结构指标上超越了所有基准,并且需要的资源显著减少;指南分析和专家反馈进一步表明生成的图表基本遵循BPMN最佳实践,并成为减少建模工作量的良好起点。 总体而言,指令微调提高了与未调整基线相比的结构准确性及稳健性,减少了对大量提示框架的需求。我们公开分享训练好的模型和脚本以支持可重复性和进一步的研究。
https://arxiv.org/abs/2512.12063
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in robotic manipulation,enabling robots to execute natural language commands through end-to-end learning from visual this http URL, deploying large-scale VLA models on affordable robotic platforms remains challenging due to computational constraints and the need for efficient adaptation to new robot embodiments. This paper presents an efficient fine-tuning methodology and real-world deployment analysis for adapting VLA models to low-cost robotic manipulation this http URL propose a resource-efficient fine-tuning strategy using Low-Rank Adaptation (LoRA) and quantization techniques that enable multi-billion parameter VLA models ( 3.1B parameters) to run on consumer-grade GPUs with 8GB VRAM. Our methodology addresses the critical challenge of adapting pre-trained VLA models to new robot embodiments with limited demonstration data, focusing on the trade-offs between frozen and unfrozen vision encoders. Through real-world deployment on the SO101 robotic arm for a button-pressing manipulation task, we demonstrate that our approach achieves effective manipulation performance while maintaining computational efficiency. We provide detailed analysis of deployment challenges, failure modes, and the relationship between training data quantity and real-world performance,trained on 200 demonstration episodes. Our results show that with proper fine-tuning methodology, VLA models can be successfully deployed on affordable robotic platforms,making advanced manipulation capabilities accessible beyond expensive research robots.
翻译如下: 视觉-语言-行动(VLA)模型在机器人操作中展示了非凡的能力,使机器人能够通过端到端的学习从图像数据执行自然语言命令。然而,由于计算限制和对新机器人实体高效适应的需求,在低成本机器人平台上部署大规模的VLA模型仍然是一个挑战。本文提出了一种高效的微调方法,并分析了将VLA模型适应于低成本机器人操作的实际应用情况。我们提出了使用低秩适应(LoRA)技术和量化技术的资源节约型微调策略,使得具有数十亿参数的大规模VLA模型能够在配备8GB VRAM的消费级GPU上运行。我们的方法解决了如何在有限演示数据的情况下将预训练的VLA模型适配到新机器人实体的关键挑战,并重点关注冻结和非冻结视觉编码器之间的权衡。 通过在SO101机械臂上进行按钮按压操作任务的实际部署,我们证明了我们的方法能够实现有效的操作性能同时保持计算效率。我们详细分析了部署挑战、失败模式以及训练数据量与现实世界性能之间的关系,在200次演示集上的表现良好。研究表明,通过适当的微调策略,VLA模型可以成功地在经济实惠的机器人平台上运行,从而使高级操作能力超越昂贵的研究型机器人的范围成为可能。 这种方法不仅有助于推动机器人技术的发展,使得基于自然语言理解的操作技能变得更加广泛和实用,而且还为低成本的自动化解决方案开辟了新的可能性。通过减少对高性能计算资源的需求,我们的研究使更多领域的研究人员和公司能够受益于先进的VLA模型,从而加速机器人在日常任务中的应用。
https://arxiv.org/abs/2512.11921
Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.
现有的基于学习的视频压缩框架在时间对齐的准确性与运动估计和补偿(ME/MC)中的误差传播之间存在着一个难题。分离变换框架通过为帧内和帧间压缩使用不同的变换来获得出色的率失真(R-D)性能,但会导致明显的误差传播问题;统一变换框架则通过共享变换消除了误差传播的影响,但在共享潜在领域中对ME/MC的处理效果较差。 为了克服这一限制,在本文中我们提出了一种新的统一变换框架,该框架结合了双域渐进式时间对齐和基于质量条件的专家混合(QCMoE),从而能够在无误差传播的情况下实现学习视频压缩的质量一致性流媒体传输。具体来说,我们提出了针对ME/MC的双域渐进式时间对齐方法,这种方法利用粗略像素域对齐与精细潜在域对齐相结合的方式,以一种从粗到细的方法显著增强了时态上下文建模。 粗略像素域对齐能够高效地处理基于单一参考帧估算出光流的简单运动模式。而精细化潜在域对齐则通过在多参考帧的潜在值上发展流动引导可变形变换器(FGDT),实现了复杂运动模式下的长期运动细化(LTMR)。 此外,我们设计了一个QCMoE模块来实现连续比特率自适应调整,该模块根据目标质量和内容动态地分配不同的专家以调节每个像素的量化步骤,而不是依赖单一的量化步长。QCMoE模块允许进行连续且一致的速率控制,并取得令人满意的R-D性能。 实验结果表明,所提出的方法在与现有最佳方法相比时能够达到竞争性的率失真性能,同时成功地消除了误差传播问题。
https://arxiv.org/abs/2512.10450
Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{this https URL}.
传统的顺序推荐系统(SRS)通常为每个项目分配唯一的Hash ID(HID),以构建项目的嵌入。这些HID嵌入从历史用户-物品交互中有效学习协作信息,但它们在大多数项目很少被消费的情况下容易出现问题(即长尾问题)。最近的一些方法通过引入辅助信息来解决这些问题,但是往往会因为共现信号导致噪声共享或由扁平密集嵌入引起的语义同质化而受到影响。语义ID(SID)由于其代码共享能力和多层次的语义建模能力,为上述问题提供了有前景的替代方案。然而,协作压倒现象阻碍了基于SID的方法进一步发展;量化机制常常会牺牲头部项目建模所需的标识符的独特性,在头部和尾部项目的性能上形成了一个跷跷板效应。 为了应对这一困境,我们提出了一种新框架**\textbf{\name}**,该框架旨在融合SID和HID。具体而言,我们设计了一种双分支模型架构,使模型能够捕捉到SID内的多层次语义同时保持HID的唯一协作身份的独特性。此外,我们还引入了一个两级对齐策略来连接这两种表示形式,从而促进知识转移并支持稳健的偏好建模。 在三个真实世界数据集上的广泛实验表明,\name~能够有效平衡头部和尾部项目的推荐质量,并超越现有基准线。该实现代码可以在网上找到\footnote{this https URL}。
https://arxiv.org/abs/2512.10388
Physics-Informed Neural Networks (PINNs) have emerged as a promising paradigm for solving partial differential equations (PDEs) by embedding physical laws into neural network training objectives. However, their deployment on resource-constrained platforms is hindered by substantial computational and memory overhead, primarily stemming from higher-order automatic differentiation, intensive tensor operations, and reliance on full-precision arithmetic. To address these challenges, we present a framework that enables scalable and energy-efficient PINN training on edge devices. This framework integrates fully quantized training, Stein's estimator (SE)-based residual loss computation, and tensor-train (TT) decomposition for weight compression. It contributes three key innovations: (1) a mixed-precision training method that use a square-block MX (SMX) format to eliminate data duplication during backpropagation; (2) a difference-based quantization scheme for the Stein's estimator that mitigates underflow; and (3) a partial-reconstruction scheme (PRS) for TT-Layers that reduces quantization-error accumulation. We further design PINTA, a precision-scalable hardware accelerator, to fully exploit the performance of the framework. Experiments on the 2-D Poisson, 20-D Hamilton-Jacobi-Bellman (HJB), and 100-D Heat equations demonstrate that the proposed framework achieves accuracy comparable to or better than full-precision, uncompressed baselines while delivering 5.5x to 83.5x speedups and 159.6x to 2324.1x energy savings. This work enables real-time PDE solving on edge devices and paves the way for energy-efficient scientific computing at scale.
物理信息神经网络(Physics-Informed Neural Networks,PINNs)作为一种新兴的方法,在通过将物理定律嵌入到神经网络训练目标中来解决偏微分方程(PDEs)方面表现出巨大潜力。然而,由于高阶自动微分、密集张量运算和依赖全精度算术所引起的大量计算和内存开销,它们在资源受限平台上的部署受到了阻碍。为了应对这一挑战,我们提出了一种框架,该框架使边缘设备上具有可扩展性和能效的PINN训练成为可能。此框架整合了完全量化训练、基于Stein估计器(SE)残差损失计算以及用于权重压缩的张量列车(TT)分解方法。其贡献包括三个关键创新:(1) 使用平方块MX (SMX) 格式消除反向传播期间数据复制问题的混合精度训练方法;(2) 一种基于Stein估计器的差异化量化方案,可以缓解下溢的问题;以及(3) TT层的部分重构方案(PRS),减少了量化误差累积。此外,我们还设计了一种精度可调硬件加速器PINTA,以充分利用该框架性能。 在二维泊松方程、20维汉密尔顿-雅克比-贝尔曼(HJB) 方程以及100维热传导方程上的实验表明,所提出的框架能够在与全精度基线模型准确度相当或更好的情况下实现5.5至83.5倍的速度提升和159.6至2324.1倍的能耗节省。这项工作使得在边缘设备上实现实时PDE求解成为可能,并为大规模节能科学研究铺平了道路。
https://arxiv.org/abs/2512.09202
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
多模态语言模型(MLLM)需要较大的参数容量来对齐高维视觉特征与语言表示,这导致它们在计算上非常耗资源且难以高效部署。我们提出了一种渐进式重参数化策略,通过逐渐用紧凑的参数化双复数乘法(PHM)层替换密集型前馈网络块来压缩这些模型。残差插值计划以及轻量级重构和知识蒸馏损失确保了在训练过程中PHM模块能够继承其密集型对应层的功能行为。这种转换显著减少了参数数量和FLOP,同时保持强大的多模态对齐能力,从而实现更快的推理速度而不影响输出质量。我们在多个视觉-语言模型(VLM)上评估了该方法,并发现我们的方法在保持与基础模型相当性能的同时实现了显著减少的模型大小和推理延迟。渐进式PHM替换因此为更高效的多模态推理提供了一条架构兼容路径,同时补充现有的低比特量化技术。
https://arxiv.org/abs/2512.08524
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
视觉语言模型(VLM)已经革新了多模态推理,但将数百个视觉补丁令牌输入到大型语言模型中会产生二次计算成本,并对内存和上下文窗口造成压力。传统方法面临这样的权衡:连续压缩会稀释对象身份等高层次语义,而离散量化则会导致纹理等细粒度细节的损失。我们引入了HTC-VLM,这是一种通过双通道分离语义和外观的混合框架,即一个用于细粒度细节的连续路径(通过ViT补丁)以及使用MGVQ量化并投影到四个令牌上的符号锚点离散路径。这些被融合成一个580个令牌的混合序列,并通过解缠注意力掩码和瓶颈压缩为单个词汇令牌,从而确保高效的基于接地表示。HTC-VLM在七个基准测试(GQA、VQAv2、MMBench、MME、POPE、SEED-Bench、ScienceQA-Image)中实现了平均性能保持87.2%,优于领先的连续基线模型的81.0%表现,同时压缩比达到了580:1。注意力分析表明,经过压缩后的令牌优先考虑离散锚点,验证了其语义指导的作用。我们的工作证明了一个极简主义混合设计可以解决效率与保真度之间的矛盾,并推进可扩展VLM的发展。
https://arxiv.org/abs/2512.08240
Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: this https URL
Voxel艺术是一种在游戏和数字媒体中广泛使用的独特风格化表现形式,然而从3D网格自动生成Voxel艺术仍然具有挑战性,因为这需要满足几何抽象、语义保留以及离散色彩连贯性的冲突要求。现有的方法要么过度简化了几何结构,要么无法实现像素精确且受调色板约束的Voxel艺术美学效果。我们引入了Voxify3D,这是一个将3D网格优化与2D像素艺术监督相结合的可微分两阶段框架。我们的核心创新在于三种成分的协同整合: 1. 正交视角下的像素艺术监督,消除了透视失真问题,实现了精准的体素-像素对齐。 2. 基于补丁的CLIP对齐技术,在不同离散化级别间保持语义的一致性。 3. 受调色板约束的Gumbel-Softmax量化方法,使得可以在受控调色策略下的离散色彩空间内进行可微分优化。 这种集成解决了几个基本挑战:极端离散化情况下的语义保留、通过体积渲染实现像素艺术美学效果以及端到端离散优化。实验结果显示,在不同角色和可控抽象(2-8种颜色,20x-50x分辨率)的情况下,Voxify3D的性能优于现有方法(CLIP-IQA得分为37.12,用户偏好率为77.90%)。项目页面:[此链接](this https URL)
https://arxiv.org/abs/2512.07834
Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs' reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
混合精度量化(MPQ)解决了深度神经网络(DNNs)面临的内存不足瓶颈问题,引起了越来越多的研究关注。然而,传统方法要么通过昂贵的可微优化进行搜索,这种方法既不高效也不灵活;要么从人工设计的代理模型(如HAWQ)学习量化的DNN,这种方式耗费人力且需要大量的专业知识。有没有办法在无需任何人类专家或训练的情况下设计一个代理呢?在这篇论文中,我们提供了一个肯定的答案,提出了一个新的由大型语言模型(LLMs)驱动、无须训练即可自动发现代理的框架(简称TAP),该框架通过利用LLMs来为MPQ寻找更优的TAP,改革了MPQ的设计范式。此外,为了弥合黑盒LLMs与艰巨的MPQ任务之间的差距,我们巧妙地提出了基于强化学习的直接策略优化(DPO),以增强LLMs的推理能力,通过优化提示构建LLM和MPQ任务间的正向反馈循环,使LLMs能够生成更好的TAP。在主流基准上的广泛实验表明,TAP实现了最先进的性能。最后,我们坚信我们的TAP将显著推动MPQ社区的发展,为基于LLMs的设计算法提供了新的视角。
https://arxiv.org/abs/2512.07419
Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
基于向量量化(VQ)的生成式图像压缩方法最近展示了在平衡失真和感知质量方面的巨大潜力。然而,这些方法通常使用静态全局概率分布来估计VQ索引的熵,这无法适应每张图像的具体内容。这种非自适应的方法导致了比特率潜力未被充分利用,并且难以实现灵活的码率控制。为了应对这一挑战,我们引入了一种基于VQ超先验的可控生成式图像压缩框架(HVQ-CGIC)。该框架严格推导出在向量量化索引熵模型中引入超先验的数学基础。在此基础上,通过新颖的损失设计,据我们所知,这个框架首次实现了RD平衡和控制在基于向量量化的生成式图像压缩中的应用。配合轻量级的超先验估计网络,HVQ-CGIC在率失真性能上相比当前最先进的(SOTA)生成式压缩方法取得了显著优势。在Kodak数据集上,我们与Control-GIC、CDC和HiFiC达到了相同的LPIPS指标,但比特数平均减少了61.3%。我们认为HVQ-CGIC具有成为基于VQGAN的图像压缩的基础组件的巨大潜力,类似于超先验框架在神经网络图像压缩中的核心地位。
https://arxiv.org/abs/2512.07192
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
我们介绍了一种两阶段的自监督框架,该框架结合了联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)与密度自适应注意机制(Density Adaptive Attention Mechanism, DAAM),用于学习鲁棒的语音表示。第一阶段使用带有DAAM的JEPA通过在隐空间中进行掩码预测来学习语义音频特征,完全独立于波形重建过程。第二阶段利用这些表示进行高效的标记化,采用有限标量量化(Finite Scalar Quantization, FSQ)和混合基数打包方案,随后用HiFi-GAN解码器进行高保真度波形重构。 通过将基于高斯混合的密度自适应门控集成到JEPA编码器中,模型可以执行自适应的时间特征选择,并在每秒2.5帧的低速率下发现分层语音结构。最终生成的标记(47.5个标记/秒)提供了一种可逆、高度压缩且便于语言模型使用的表示形式,在性能上与现有的神经音频编解码器相当,甚至更加高效。
https://arxiv.org/abs/2512.07168
Vector quantized variational autoencoder (VQ-VAE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to discretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian VAE with certain constraint into a VQ-VAE without training. GQ generates random Gaussian noise as a codebook and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in this https URL.
向量量化变分自编码器(VQ-VAE)是一种离散自编码器,用于将图像压缩成离散令牌。由于离散化的原因,其训练过程较为困难。在这篇论文中,我们提出了一种简单而有效的技术——高斯量化(Gaussian Quant, GQ),它可以将具有特定约束的高斯VAE转换为无需训练的VQ-VAE。GQ生成随机高斯噪声作为代码本,并找到与后验均值最接近的噪声。理论上,我们证明了当代码本大小的对数超过高斯VAE的比特回退编码速率时,可以保证小量化误差的存在。实践中,我们提出了一种用于有效应用GQ的方法——目标散度约束(Target Divergence Constraint, TDC),以训练高斯VAE。实验表明,无论是在UNet还是ViT架构上,GQ的表现均优于先前的VQ-VAE技术,如VQGAN、FSQ、LFQ和BSQ。此外,TDC也改进了TokenBridge等先前用于高斯VAE离散化的技术表现。源代码可在提供的链接中获取(此链接指向的是原始文档中的链接,请访问相关网页获取具体信息)。
https://arxiv.org/abs/2512.06609
Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into this http URL. The code is available at this https URL.
大型语言模型(LLMs)正越来越多地部署在边缘设备上。为了满足严格的资源限制,实际部署已经将LLM的量化从8位推进到了4位、2位,现在甚至是1.58位。结合基于查找表(LUT)的推理方法,CPU可以比神经处理单元(NPUs)更快地运行这些超低比特数的LLMs,从而开启了无处不在的设备智能的新机会。然而,本文指出,在并行推理过程中,基于LUT的推理并未充分利用内存带宽,这对于预填充、测试时扩展和多令牌场景等来说是必需的。根本原因在于标量LUT范式,它对每个标记执行重复且非连续的内存访问。 为了解决这个问题,我们提出了矢量LUT(Vector LUT),这是一种新的查找表范式,能够构建跨越并行标记的统一查找表,并执行一次$1 \rightarrow N$索引查询。为了高效实现这一方案,我们进一步引入了: 1. 矢量LUT中心张量布局 2. 有意识流式的查找技术 在5种边缘设备和3个LLM上的评估显示,Vec-LUT相比最先进的基线性能提升了高达4.2倍。我们的实施方案已集成至[此处应为链接但未提供具体网址]。代码可在[此处应为链接但未提供具体网址]获取。 请注意,在上述翻译中,两个具体的网址位置由于原文没有给出确切的URL而被省略了,请您根据实际需要补充相应的链接信息。
https://arxiv.org/abs/2512.06443
Diffusion Transformers (DiTs) have emerged as a highly scalable and effective backbone for image generation, outperforming U-Net architectures in both scalability and performance. However, their real-world deployment remains challenging due to high computational and memory demands. Mixed-Precision Quantization (MPQ), designed to push the limits of quantization, has demonstrated remarkable success in advancing U-Net quantization to sub-4bit settings while significantly reducing computational and memory overhead. Nevertheless, its application to DiT architectures remains limited and underexplored. In this work, we propose TreeQ, a unified framework addressing key challenges in DiT quantization. First, to tackle inefficient search and proxy misalignment, we introduce Tree Structured Search (TSS). This DiT-specific approach leverages the architecture's linear properties to traverse the solution space in O(n) time while improving objective accuracy through comparison-based pruning. Second, to unify optimization objectives, we propose Environmental Noise Guidance (ENG), which aligns Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) configurations using a single hyperparameter. Third, to mitigate information bottlenecks in ultra-low-bit regimes, we design the General Monarch Branch (GMB). This structured sparse branch prevents irreversible information loss, enabling finer detail generation. Through extensive experiments, our TreeQ framework demonstrates state-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. Notably, our work is the first to achieve near-lossless 4-bit PTQ performance on DiT models. The code and models will be available at this https URL
扩散变换器(DiTs)作为一种高度可扩展且有效的图像生成主干网络,已经超越了U-Net架构,在可扩展性和性能方面表现更优。然而,由于其计算和内存需求较高,其实用部署面临挑战。混合精度量化(MPQ),旨在推动量化技术的极限,已经在将U-Net的量化推进到低于4位设置的同时显著降低了计算和内存开销方面取得了显著成功。尽管如此,将其应用到DiT架构中仍然受到限制且探索不足。 在本研究中,我们提出了TreeQ框架,以解决DiT量化中的关键挑战。首先,为了应对低效搜索和代理错配问题,我们引入了树状结构搜索(TSS)。这种方法利用扩散变换器特有的线性特性,在O(n)时间内遍历解决方案空间,并通过基于比较的修剪来提高目标精度。其次,为统一优化目标,我们提出了环境噪声指导(ENG),该方法使用单一超参数将后训练量化(PTQ)和量化感知训练(QAT)配置对齐。第三,为了在极低比特率环境中缓解信息瓶颈问题,我们设计了一般主干分支(GMB)。这种有结构的稀疏分支可以防止信息丢失,并支持更精细细节生成。 通过大量实验,我们的TreeQ框架展示了DiT-XL/2模型在W3A3和W4A4 PTQ/PEFT设置下的最新性能。值得注意的是,我们是首次成功实现接近无损的4位PTQ性能的研究团队,在DiT模型上取得了这一成果。代码及模型将通过提供的链接公开发布。
https://arxiv.org/abs/2512.06353
This work presents an independent reproducibility study of a lossy image compression technique that integrates singular value decomposition (SVD) and wavelet difference reduction (WDR). The original paper claims that combining SVD and WDR yields better visual quality and higher compression ratios than JPEG2000 and standalone WDR. I re-implemented the proposed method, carefully examined missing implementation details, and replicated the original experiments as closely as possible. I then conducted additional experiments on new images and evaluated performance using PSNR and SSIM. In contrast to the original claims, my results indicate that the SVD+WDR technique generally does not surpass JPEG2000 or WDR in terms of PSNR, and only partially improves SSIM relative to JPEG2000. The study highlights ambiguities in the original description (e.g., quantization and threshold initialization) and illustrates how such gaps can significantly impact reproducibility and reported performance.
这项工作提出了一种对一种结合奇异值分解(SVD)和小波差分减少(WDR)的有损图像压缩技术进行独立再现性研究。原论文声称,将SVD与WDR相结合能够比JPEG2000和单独使用WDR提供更好的视觉质量和更高的压缩比率。我重新实现了所提出的方法,仔细审查了缺失的实现细节,并尽可能地复制了原始实验。然后,我在新的图像上进行了额外的实验,并利用PSNR(峰值信噪比)和SSIM(结构相似性指数)对性能进行了评估。 与原论文中的说法相反,我的结果表明SVD+WDR技术在PSNR方面通常不如JPEG2000或WDR,在SSIM方面的改善也仅部分优于JPEG2000。这项研究强调了原始描述中存在的模糊之处(例如量化和阈值初始化),并展示了这些缺口如何显著影响再现性和报告的性能。
https://arxiv.org/abs/2512.06221
Diffusion models have demonstrated significant applications in the field of image generation. However, their high computational and memory costs pose challenges for deployment. Model quantization has emerged as a promising solution to reduce storage overhead and accelerate inference. Nevertheless, existing quantization methods for diffusion models struggle to mitigate outliers in activation matrices during inference, leading to substantial performance degradation under low-bit quantization scenarios. To address this, we propose HQ-DM, a novel Quantization-Aware Training framework that applies Single Hadamard Transformation to activation matrices. This approach effectively reduces activation outliers while preserving model performance under quantization. Compared to traditional Double Hadamard Transformation, our proposed scheme offers distinct advantages by seamlessly supporting INT convolution operations while preventing the amplification of weight outliers. For conditional generation on the ImageNet 256x256 dataset using the LDM-4 model, our W4A4 and W4A3 quantization schemes improve the Inception Score by 12.8% and 467.73%, respectively, over the existing state-of-the-art method.
扩散模型在图像生成领域展示了重要的应用,然而它们的高计算和内存成本给实际部署带来了挑战。模型量化作为一种有前景的方法被提出以减少存储开销并加速推理过程。但是,现有的针对扩散模型的量化方法难以解决在低比特量化场景下激活矩阵中的离群值问题,导致性能显著下降。为了解决这一问题,我们提出了HQ-DM,这是一种新颖的基于量化的训练框架,它对激活矩阵应用单次哈达玛变换(Single Hadamard Transformation)。该方法能够有效地减少激活离群值并保持量化后的模型性能。与传统的双重哈达玛变换相比,我们的方案通过无缝支持整数卷积操作并且防止权重离群值的放大而具有明显优势。 在使用LDM-4模型对ImageNet 256x256数据集进行条件生成时,我们提出的W4A4和W4A3量化方案分别将Inception Score提高了12.8%和467.73%,相较于现有的最先进方法取得了显著的进步。
https://arxiv.org/abs/2512.05746