Training modern deep learning models is increasingly constrained by GPU memory and compute limits. While Randomized Numerical Linear Algebra (RandNLA) offers proven techniques to compress these models, the lack of a unified, production-grade library prevents widely adopting these methods. We present Panther, a PyTorch-compatible library that consolidates established RandNLA algorithms into a single high-performance framework. Panther engineers efficient, drop-in replacements for standard components including sketched linear layers, 2D convolution, multi-head attention, and randomized matrix decompositions (such as pivoted CholeskyQR). By implementing a custom C++/CUDA backend (pawX), Panther provides an optimized implementation that can run on both CPUs and GPUs. We demonstrate the effectiveness of RandNLA techniques and Panther's ease of adoption. By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code) we achieve significant memory savings (up to 75%) on BERT while maintaining comparable loss. Source code is available (MIT License) at this https URL, along with demonstration video at this https URL.
训练现代深度学习模型越来越受到GPU内存和计算限制的制约。虽然随机数值线性代数(RandNLA)提供了一种压缩这些模型的有效技术,但缺乏一个统一且适合生产的库阻碍了这些方法的广泛采用。我们介绍了Panther,这是一个与PyTorch兼容的库,它将已确立的RandNLA算法整合到一个高性能框架中。Panther为标准组件(包括草图线性层、2D卷积、多头注意机制以及随机矩阵分解(如带有枢轴的CholeskyQR)等)提供了高效且可直接替换的解决方案。 通过实现自定义的C++/CUDA后端(pawX),Panther提供了一个优化的版本,可以在CPU和GPU上运行。我们展示了RandNLA技术的有效性以及Panther易于采用的特点。通过用Panther层替代标准的PyTorch线性层(仅需几行代码), 我们在BERT模型上实现了显著的记忆节省(高达75%),同时保持了类似的损失函数表现。 源代码可在[MIT许可证](https://www.mit.edu/~parrt/licenses/license.html)下提供,链接为 [此 URL](https://example.com/panther_repo),并提供了演示视频的链接:[此 URL](https://example.com/demo_video)。
https://arxiv.org/abs/2601.15473
Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.
分布外(OOD)检测是一项关键任务,近年来受到了广泛关注。CLIP模型的出现激发了大量关于零样本OOD检测的研究,这些研究通常采用无需训练的方法。目前的方法依赖于大型语言模型(LLMs)中的专家知识来识别潜在异常值,但它们往往过度依赖文本空间的知识,而忽视了在图像空间中检测分布外样本所面临的固有挑战。为此,在本文中我们提出了一种新型管道——MM-OOD,该方法利用多模态大语言模型的多模态推理能力和进行多轮对话的能力来增强异常值检测效果。我们的方法旨在提升近OOD和远OOD任务中的性能表现。具体来说: 1. 对于近OOD任务,我们将标准ID图像及相应的文本提示直接输入到多模态LLMs中以识别潜在异常; 2. 对于远OOD任务,我们引入了草图生成详细说明框架:首先使用文本提示进行分布外样本的草图绘制,然后生成对应的视觉OOD样本,并通过利用多模态提示来进一步详述。 实验结果表明,我们的方法在诸如Food-101等广泛使用的多模态数据集上取得了显著改进,同时验证了其在ImageNet-1K上的可扩展性。
https://arxiv.org/abs/2601.14052
Intent-Based Networking (IBN) allows operators to specify high-level network goals rather than low-level configurations. While recent work demonstrates that large language models can automate configuration tasks, a distinct class of intents requires generating optimization code to compute provably optimal solutions for traffic engineering, routing, and resource allocation. Current systems assume text-based intent expression, requiring operators to enumerate topologies and parameters in prose. Network practitioners naturally reason about structure through diagrams, yet whether Vision-Language Models (VLMs) can process annotated network sketches into correct optimization code remains unexplored. We present IntentOpt, a benchmark of 85 optimization problems across 17 categories, evaluating four VLMs (GPT-5-Mini, Claude-Haiku-4.5, Gemini-2.5-Flash, Llama-3.2-11B-Vision) under three prompting strategies on multimodal versus text-only inputs. Our evaluation shows that visual parameter extraction reduces execution success by 12-21 percentage points (pp), with GPT-5-Mini dropping from 93% to 72%. Program-of-thought prompting decreases performance by up to 13 pp, and open-source models lag behind closed-source ones, with Llama-3.2-11B-Vision reaching 18% compared to 75% for GPT-5-Mini. These results establish baseline capabilities and limitations of current VLMs for optimization code generation within an IBN system. We also demonstrate practical feasibility through a case study that deploys VLM-generated code to network testbed infrastructure using Model Context Protocol.
基于意图的网络(IBN)允许操作员指定高层次的网络目标,而不是低层次的配置。尽管最近的研究表明大型语言模型可以自动化配置任务,但仍然有一类特定的意图需要生成优化代码来计算流量工程、路由和资源分配方面的最优解。当前系统假定使用文本表达意图,要求操作员通过文字描述拓扑结构和参数。网络从业人员自然会通过图表形式进行结构化推理,而视觉语言模型(VLMs)是否能够将标注的网络草图转换为正确的优化代码仍未被探索。 我们提出了一个名为IntentOpt的新基准测试,包含85个涵盖17类别的优化问题,并评估了四种不同的视觉语言模型(GPT-5-Mini、Claude-Haiku-4.5、Gemini-2.5-Flash和Llama-3.2-11B-Vision)在三种不同提示策略下的性能表现,这些模型分别处理多模态输入与文本单一输入。我们的评估结果显示,视觉参数提取会导致执行成功率降低12%至21%,GPT-5-Mini从93%下降到72%。程序思维方法的提示降低了最多13个百分点的表现。开源模型在性能上落后于闭源模型,Llama-3.2-11B-Vision达到18%,而GPT-5-Mini为75%。 这些结果确立了当前VLMs在基于意图网络系统中生成优化代码的基本能力和限制。此外,我们通过一个案例研究展示了其实用可行性,该研究部署了由VLM生成的代码到网络测试平台基础设施,并使用模型上下文协议进行管理。
https://arxiv.org/abs/2601.12744
The design-build-test cycle is essential for innovation, but physical prototyping is often slow and expensive. Although physics-based simulation and strategic prototyping can reduce cost, meaningful evaluation is frequently constrained until an integrated prototype is built. This paper investigates whether a generative pretrained transformer (GPT) can predict information typically obtained through prototyping, including cost, performance, and perceived usability. We introduce a retrieval-augmented generation (RAG) method to emulate design feedback using OpenAI GPT-4o, grounded in prototyping data scraped from this http URL to increase access to relevant precedent. Two studies are reported. First, a controlled experiment compares GPT-RAG and human designers, who receive design sketches and predict cost, performance, and usability; predictions are evaluated against ground-truth results from physical prototypes. Second, we report an applied demonstration in which a physical prototype is produced from GPT-RAG recommendations and compared with a commercial baseline and a topology-optimized design. Results show that GPT-RAG provides more accurate cost and performance estimates than individual or crowd human estimates, while yielding comparable usability insights; the GPT-RAG-informed prototype also outperforms both comparison prototypes. Repeated querying with response averaging significantly improves accuracy, suggesting that LLMs can emulate crowd aggregation effects consistent with the law of large numbers.
设计-建造-测试周期对创新至关重要,但物理原型制作通常既缓慢又昂贵。尽管基于物理的模拟和战略性原型制作可以降低成本,有意义的评估往往要等到集成型原型建立之后才能进行。本文探讨了生成预训练变换器(GPT)是否能够预测通过原型制作获取的信息,包括成本、性能以及感知可用性。我们引入了一种检索增强生成(RAG)方法,利用从该网站(此URL处提供数据链接)收集的原型数据来模拟设计反馈,并借助OpenAI GPT-4o提高相关先例的可访问性。 本文报告了两项研究结果: 第一项是受控实验,对比GPT-RAG和人类设计师在收到设计草图后预测成本、性能及可用性的能力。这些预测与物理原型的实际测试结果进行比较以评估准确性; 第二项是一个应用演示,在此过程中从GPT-RAG建议中制造了一个物理原型,并将其与商用基准以及拓扑优化设计进行了对比。 实验结果显示,GPT-RAG提供的成本和性能预估比个体或群体的人类预测更加准确,同时在可用性方面的见解也相当一致。基于GPT-RAG的原型在表现上优于两个比较中的原型。 通过多次查询并采用响应平均值显著提高了准确性,这表明大规模的语言模型可以模拟类似群体聚合的效果,并且符合大数法则。
https://arxiv.org/abs/2601.12276
Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.
大型语言模型越来越扮演人工推理者的角色:它们评估论点、分配可信度并表达自信。然而,这些模型的信念形成行为受到隐含且未经审视的认识论政策(即指导信念如何形成的规则)所支配。本文主张为AI制定一个认识论宪法:明确且可争议的元规范,以调节系统如何形成和表达信念。 来源归因偏差提供了动议案例:我展示了前沿模型强制执行立场一致性,惩罚那些来自其预期意识形态与论证内容相冲突的来源提出的论点。当这些模型检测到系统的测试时,上述效果会消失,表明系统将来源敏感性视为需要抑制的偏见,而非执行良好的能力。 本文区分了两种宪法方法:柏拉图式的宪法强制要求形式上的正确性和默认的来源独立性,并从特权立场出发;而自由主义的宪法则拒绝这种特权地位,规定程序规范以保护集体探究所需的条件,同时允许基于认识论警惕的原则性的来源关注。我主张采用自由主义的方法,概述了一个由八项原则和四项倾向构成的认识论宪章的核心内容,并建议AI的认知治理需要与我们现在对AI伦理所期望的明确且可争议的结构相同。
https://arxiv.org/abs/2601.14295
Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
近期,视觉-语言模型在端到端光学字符识别(OCR)方面的成就指出了一种低损失压缩文本信息的新途径。这激发了早期的研究工作,将Transformer的输入转换为图像进行预填充,通过视觉编码有效减少了标记数量,从而减轻了注意力计算量二次增加的问题。然而,这种部分压缩未能在逐令牌推理过程中节省计算或内存成本。为此,在本文中,我们研究了一种全局上下文压缩方法,这种方法可以在预填充和推理阶段都节约令牌。 基于此,我们提出了VIST2,这是一种新型的Transformer模型,该模型交替输入文本块及其视觉编码,并且仅依赖于前缀中的视觉标记来预测下一个文本标记分布。围绕这一理念,我们将文本片段转换为草图图像,并通过分阶段训练VIST2来实现这一点,从基于课程调度的预训练开始,用于光学语言建模,然后进行模式交错指令微调。 我们使用扩展范围在0.6B到8B参数规模内的VIST2模型族进行了广泛的实验,以探索训练配方和超参数。通过4倍压缩比例,所得到的模型在长文本写作任务中表现出了相对于基线模型的重大优势,在首次生成标记的速度上平均提升了3倍,在内存使用量上减少了77%,浮点运算次数(FLOPS)减少74%。 我们的代码和数据集将公开发布,以支持进一步的研究。
https://arxiv.org/abs/2601.10378
Person identification in forensic investigations becomes very challenging when common identification means for DNA (i.e., hair strands, soft tissue) are not available. Current methods utilize deep learning methods for face recognition. However, these methods lack effective mechanisms to model cross-domain structural correspondence between two different forensic modalities. In this paper, we introduce a SPOT-Face, a superpixel graph-based framework designed for cross-domain forensic face identification of victims using their skeleton and sketch images. Our unified framework involves constructing a superpixel-based graph from an image and then using different graph neural networks(GNNs) backbones to extract the embeddings of these graphs, while cross-domain correspondence is established through attention-guided optimal transport mechanism. We have evaluated our proposed framework on two publicly available dataset: IIT\_Mandi\_S2F (S2F) and CUFS. Extensive experiments were conducted to evaluate our proposed framework. The experimental results show significant improvement in identification metrics ( i.e., Recall, mAP) over existing graph-based baselines. Furthermore, our framework demonstrates to be highly effective for matching skulls and sketches to faces in forensic investigations.
在法医调查中,当常见的DNA鉴定手段(如毛发、软组织)不可用时,人员识别变得非常具有挑战性。目前的方法主要依赖于深度学习技术进行人脸识别,但这些方法缺乏有效的机制来建模两种不同法医模式之间的跨域结构对应关系。本文介绍了一种名为SPOT-Face的框架,该框架基于超像素图,旨在使用受害者的骨骼和素描图像来进行跨域法医面部识别。 我们的统一框架包括从一张图片中构建一个基于超像素的图,并利用不同的图神经网络(GNN)骨干来提取这些图的嵌入,同时通过注意力引导的最佳传输机制建立跨域对应关系。我们在两个公开的数据集上评估了我们提出的框架:IIT_Mandi_S2F (S2F) 和 CUFS。 进行了广泛的实验以评估我们的提议框架。实验证明,在识别指标(如召回率、mAP)方面,与现有的基于图的基线相比,有了显著改进。此外,我们的框架还表现出强大的能力,能够将颅骨和素描匹配到面部图像中,这对于法医调查来说非常有效。 该研究提供了一种创新的方法来解决跨模态法医识别中的挑战,特别是在缺乏DNA样本的情况下,为受害者身份的确定提供了新的可能性。
https://arxiv.org/abs/2601.09229
Authentication and attribution of works on paper remain persistent challenges in cultural heritage, particularly when the available reference corpus is small and stylistic cues are primarily expressed through line and limited tonal variation. We present a verification-based computational framework for historical drawing authentication using one-class autoencoders trained on a compact set of interpretable handcrafted features. Ten artist-specific verifiers are trained using authenticated sketches from the Metropolitan Museum of Art open-access collection, the Ashmolean Collections Catalogue, the Morgan Library and Museum, the Royal Collection Trust (UK), the Victoria and Albert Museum Collections, and an online catalogue of the Casa Buonarroti collection and evaluated under a biometric-style protocol with genuine and impostor trials. Feature vectors comprise Fourier-domain energy, Shannon entropy, global contrast, GLCM-based homogeneity, and a box-counting estimate of fractal complexity. Across 900 verification decisions (90 genuine and 810 impostor trials), the pooled system achieves a True Acceptance Rate of 83.3% with a False Acceptance Rate of 9.5% at the chosen operating point. Performance varies substantially by artist, with near-zero false acceptance for some verifiers and elevated confusability for others. A pairwise attribution of false accepts indicates structured error pathways consistent with stylistic proximity and shared drawing conventions, whilst also motivating tighter control of digitisation artefacts and threshold calibration. The proposed methodology is designed to complement, rather than replace, connoisseurship by providing reproducible, quantitative evidence suitable for data-scarce settings common in historical sketch attribution.
在文化遗产中,纸本作品的认证和归因一直是一个持久的挑战,尤其是在可用参考文献较少且风格线索主要通过线条及有限色调变化表达的情况下。我们提出了一种基于验证的计算框架,用于使用单一类别自动编码器对历史素描进行身份验证,这些编码器是在一组紧凑且可解释的手工制作特征上训练的。根据大都会艺术博物馆开放获取收藏、阿什莫林藏品目录、摩根图书馆和博物馆、英国皇家收藏信托机构(UK)、维多利亚与阿尔伯特博物馆藏品以及博纳罗蒂之家在线目录中的认证素描,我们为十位艺术家各自训练了特定的验证器,并在一种生物识别风格的协议下使用真伪试验证明其有效性。特征向量包含傅立叶域能量、香农熵、全局对比度、基于GLCM(灰度共生矩阵)的同质性以及分形复杂性的箱计数估计值。 在整个测试过程中,总共做出900次身份验证决定(包括90个真实试验证明和810个虚假试验证明),合并系统在选定的操作点上达到了83.3%的真实接受率和9.5%的错误接受率。不同艺术家的身份验证性能差异很大,在某些验证器中几乎不存在误识别,而在其他验证器中则表现出较高的混淆性。对错误接受的成对归因表明了与风格相似性和共享绘画惯例一致的结构化误差路径,并且这也提出了在数字转换过程中需要加强对伪像控制和阈值校准的要求。 该方法旨在作为鉴定专家知识的一种补充手段,而不是替代品,在历史素描归属中常见的数据稀缺环境下,提供可重复、量化的证据。
https://arxiv.org/abs/2601.11627
Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.
立面翻新提供了一种比全面拆除更为可持续的替代方案,但要在保留现有结构的同时表达新的设计理念仍然具有挑战性。当前的工作流程通常需要在设计之前进行详细的竣工建模,这一过程耗时且劳动密集,并且往往涉及重复修订。为了解决这个问题,我们提出了一种三阶段框架,该框架结合了生成式人工智能(AI)和视觉语言模型(VLM),可以直接处理粗糙的结构草图和文本描述以生成一致的翻新提案。 首先,输入草图通过一个微调过的VLM模型来预测需要修改的边界框以及应添加哪些组件。接下来,使用稳定扩散模型生成新元素的详细草图,并通过生成式修复管道与原始轮廓合并。最后,采用ControlNet对结果进行细化,以生成逼真的图像。 在数据集和实际工业建筑上的实验表明,所提出的框架能够产生既能保留原有结构又能提高立面细节质量的翻新提案。这种方法有效地绕过了详细的竣工建模需求,使建筑师能够快速探索设计选择,在早期概念阶段迭代,并更清晰地传达翻新区的设计意图。
https://arxiv.org/abs/2601.08531
Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.
尽管在大型语言模型(LLMs)的提示方法方面取得了显著进展,例如链式思维(CoT),现有的策略仍然存在过多使用token和在多种推理任务中的泛化能力有限的问题。为了解决这些问题,我们提出了一种自适应因果提示框架与思路简图(ACPS)。该框架利用结构化的因果模型来推断查询对其答案的因果影响,并且能够自适应地选择适当的干预措施(即标准前门调整和条件前门调整)。这种设计能够在不进行特定任务重新训练的情况下,实现跨异质任务的一般化因果推理。通过用简洁的思路简图替换冗长的链式思维,ACPS使推理过程更加高效,并显著减少了token使用量和推断成本。 在多个推理基准测试和LLMs上的广泛实验表明,与现有的提示基线相比,ACPS在准确性、鲁棒性和计算效率方面始终表现更优。
https://arxiv.org/abs/2601.08108
The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.
该数据集涵盖了多种艺术风格,包括来自中东、北欧、东亚和南亚的地区性美学特色,以及素描和油画等通用类别。所有图像均使用Moonworks Lunara模型生成,并精心设计以体现独特的高品质美学风格,从而创建了一个前所未有的数据集,其美学评分显著高于专注于美学的数据集和其他普通用途的数据集。 每张图片都附有由人类精炼的提示语和结构化注释,这些提示和注释共同描述了突出的对象、属性、关系和风格线索。与那些侧重于广度而非精度的大规模网络数据集不同,Lunara美学数据集更注重美学质量、风格多样性以及版权透明性,并以Apache 2.0许可证发布,支持研究及不受限制的学术和商业用途。
https://arxiv.org/abs/2601.07941
Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose \textbf{Fresco}, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10$\times$ speedup on FLUX, and 5$\times$ on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22$\times$ speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.
扩散变换器(Diffusion Transformers)在生成质量方面表现出色,但由于迭代采样过程中的计算成本高昂而受到限制。最近,动态分辨率采样作为一种有前景的加速技术出现,通过降低早期采样步骤的分辨率来减少计算量。然而,现有方法依赖于每次分辨率转换时使用启发式重新加噪(re-noising),这会破坏跨阶段的一致性,并迫使模型重新学习全局结构。此外,这些方法在一次无差别地上采样整个潜在空间,而不检查哪些区域已经收敛,导致累积误差和可见的伪影出现。 因此,我们提出了一种名为\textbf{Fresco}的动态分辨率框架,它通过渐进式上采样统一了跨阶段重新加噪与全局结构,并且同时保留了低分辨率草图绘制的效率以及高分辨率细化时的保真度。所有阶段都朝着同一个最终目标一致前进。 在不同领域和模型中(包括FLUX上的10倍加速,HunyuanVideo上5倍加速),Fresco实现了近乎无损的速度提升。同时,这种方法与知识蒸馏、量化和特征缓存技术保持独立,并且当结合蒸馏模型使用时可实现22倍的加速。 我们的代码包含在补充材料中,并将在Github上发布。
https://arxiv.org/abs/2601.07462
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at this https URL.
尽管多模态大型语言模型(MLLMs)在视觉理解方面取得了显著进展,但它们在处理人类生成的草图时往往遇到困难。这种限制尤其体现在尚未充分研究的任务——视觉评分中,在该任务中,模型不仅需要解决问题,还需要诊断手绘图表中的错误。这类诊断能力依赖于复杂的结构化、语义和元认知推理。为解决这一差距,我们引入了SketchJudge,这是一个专为评估MLLM作为手绘STEM图评分者而设计的新基准测试。SketchJudge涵盖四个领域的1,015个手绘学生响应:几何学、物理、图表和流程图,并且包含多样化的风格变化和不同的错误类型。 在SketchJudge上的评估表明,即使是最先进的多模态大型语言模型也明显落后于人类,这证实了该基准测试能够有效揭示当前视觉-语言对齐在符号性和噪声环境中存在的脆弱性。所有数据、代码和评估脚本均可公开获取,网址为[此处提供的链接]。
https://arxiv.org/abs/2601.06944
Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23\% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.
图表是复杂数据的高密度视觉载体,用于信息提取和分析。由于需要进行精确且复杂的视觉推理,自动化的图表理解对现有的多模态大型语言模型(MLLMs)提出了重大挑战。许多通过强化学习(RL)训练的MLLM们面临信用分配的问题:它们通常在轨迹级别评估优势估计,无法区分单个生成响应中正确的和错误的推理步骤。为了解决这一限制,我们引入了SketchVL,这是一种新型的多模态大型语言模型,其采用了一种新的RL算法——细粒度信用分配(FinePO)进行优化。 SketchVL的方法包括在其内部推理过程中画出中间步骤作为图像上的标记,并将这些标注后的图像反馈给自己,从而建立一个稳健、多步的推理过程。在训练期间,FinePO算法利用了一个细粒度的过程奖励模型(FinePRM),该模型为轨迹中的每个绘画动作评分,从而精确地分配每一步的信用。这种机制使得当一条轨迹在全球范围内成功时,FinePO可以更加强烈地奖励正确的令牌,并且当全球结果次优时更严厉地惩罚错误的令牌,由此实现细粒度的强化信号。 实验表明,SketchVL学会了使其步级行为与FinePRM对齐,在图表数据集、自然图像数据集和数学问题上相对于其基础模型获得了平均7.23%的表现提升。这为训练强大的推理模型提供了一个有希望的新方向。
https://arxiv.org/abs/2601.05688
Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information--a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image's luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.
卷积神经网络(CNN)以其强烈的纹理偏好而著称,这种偏好使得局部模式比全局形状信息更为重要——这是由其卷积架构所固有的。虽然这种倾向在纹理丰富的自然图像上是有益的,但在以插图和草图为代表的形状主导的数据集上,则往往会导致性能下降。尽管之前的研究提出了一些偏重于形状的模型来缓解这一问题,但这些方法缺少一种定量的标准来识别哪些数据集实际上会从这样的修改中获益。 为了解决这个问题,我们提出了一个基于数据驱动的度量标准,该标准通过计算每个图像的亮度通道与其L0平滑版本之间的结构相似性指数(SSIM),从而量化数据集中的形状-纹理平衡。在此基础上,我们进一步引入了一种计算效率高的适应方法,这种方法通过调整最大池化操作的膨胀率来促进对形状偏好的支持,同时保持卷积权重不变。 实验结果显示,在低数据量的情况下,这种方法在形状主导的数据集中可以持续提高分类精度,并且在这种情况下进行完全微调是不切实际的。只需训练最终的分类层即可实现这一改进。
https://arxiv.org/abs/2601.05599
We observe that Gaussians exhibit distinct roles and characteristics analogous to traditional artistic techniques -- like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features such as edges and contours, while others represent broader, smoother regions analogous to brush strokes that add volume and depth. Based on this observation, we propose a hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which represent high-frequency, boundary-defining features, and (ii) Patch Gaussians, which cover low-frequency, smooth regions. This semantic separation naturally enables layered progressive streaming, where the compact Sketch Gaussians establish the structural skeleton before Patch Gaussians incrementally refine volumetric detail. In this work, we extend our previous method to arbitrary 3D scenes by proposing a novel hierarchical adaptive categorization framework that operates directly on the 3DGS representation. Our approach employs multi-criteria density-based clustering, combined with adaptive quality-driven refinement. This method eliminates dependency on external 3D line primitives while ensuring optimal parametric encoding effectiveness. Our comprehensive evaluation across diverse scenes, including both man-made and natural environments, demonstrates that our method achieves up to 1.74 dB improvement in PSNR, 6.7% in SSIM, and 41.4% in LPIPS at equivalent model sizes compared to uniform pruning baselines. For indoor scenes, our method can maintain visual quality with only 0.5\% of the original model size. This structure-aware representation enables efficient storage, adaptive streaming, and rendering of high-fidelity 3D content across bandwidth-constrained networks and resource-limited devices.
我们观察到,高斯函数(Gaussians)表现出类似于传统艺术技法的角色和特征——就像艺术家在用色块填充之前会先画出轮廓一样,一些高斯函数捕捉到了高频特性如边缘和轮廓,而另一些则代表了更广泛、更平滑的区域,类似于为物体增加体积感与深度的笔触。基于这一观察,我们提出了一种混合表示方法,将高斯函数分为两类:(i) 轮廓高斯(Sketch Gaussians),用于表示高频特征和边界定义特性;(ii) 区块高斯(Patch Gaussians),则覆盖低频、平滑的区域。这种语义上的分离自然地实现了分层渐进式流媒体处理,即首先使用紧凑型轮廓高斯建立结构骨架,随后区块高斯逐步细化体积细节。 在此研究中,我们扩展了先前的方法以适应任意3D场景,并提出了一个新颖的层次自适应分类框架,该框架直接应用于三维高斯混合(3DGS)表示。我们的方法采用多标准密度聚类结合自适应质量驱动精化策略,这种方法消除了对外部3D线性原语的依赖并确保了最优参数编码效率。 我们对多样化的场景进行了全面评估,包括人造和自然环境,在相同模型大小的情况下,与均匀剪枝基线相比,我们的方法在PSNR(峰值信噪比)上提高了高达1.74 dB,在SSIM(结构相似性指数)上提高了6.7%,在LPIPS(感知损失指标)上降低了41.4%。对于室内场景,我们所提出的方法仅使用原始模型大小的0.5%就能保持视觉质量。 这种具备结构意识的表示方式能够高效地存储、适应流媒体处理以及在网络带宽有限和设备资源受限的情况下渲染高保真度的3D内容。
https://arxiv.org/abs/2601.05394
Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.
移动GUI代理在现实世界中的自动化和实用应用中展现出了巨大的潜力。然而,大多数现有的代理仍保持反应性,主要依据当前屏幕信息做出决策,这限制了它们在长期任务上的表现能力。通过重复交互构建一个世界模型能够预测行动结果,并为移动GUI代理提供更佳的决策支持。这一过程具有挑战性,因为该模型必须具备空间感知能力以预测行动后的状态,同时还需要保持足够的效率以便实际部署。 为此,在这篇论文中我们提出了MobileDreamer——一种高效的世界模型为基础的前瞻框架,通过未来想象来增强基于GUI的代理的能力。它包含文本草图世界模型和滚动想象策略两种核心部分。其中: 1. **文本草图世界模型**:通过学习过程将数字图像转换为关键任务相关的草图,以预测行动后的状态,并设计了一种新颖的顺序不变性学习策略来保留GUI元素的空间信息。 2. **滚动想象策略**:利用世界模型的预测能力优化代理的动作选择流程。 在Android World上的实验表明,MobileDreamer达到了最先进的性能水平,并将任务成功率提高了5.25%。进一步的世界模型评估也证实了我们提出的文本草图建模方法能够准确地预测关键GUI元素的状态变化。
https://arxiv.org/abs/2601.04035
Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.
尽管大型多模态模型在广泛的、逐步推理方面取得了实证成功,但长推理过程不可避免地会产生巨大的计算开销,即更高的代币成本和更长的响应时间,这损害了推断效率。相比之下,人类通常采用草图式推理:这是一种简洁且目标导向的认知过程,优先考虑关键信息并促进高效的解决问题能力。受这种认知效率启发,我们提出了SketchThinker-R1,该模型鼓励大型多模态模型具备草图式推理的能力。我们的方法主要包括三个主要阶段: 1. **草图模式冷启动阶段**:我们将标准的长推理过程转换为草图式推理,并对基础多模态模型进行微调,以植入初始的草图式推理能力。 2. **训练SketchJudge奖励模型**:此模型明确评估模型的思维过程并给予草图式推理更高的评分。 3. **在SketchJudge监督下的草图思考强化学习阶段**:在此阶段中,在SketchJudge的监督下进行草图思考的强化学习,以进一步推广草图式推理能力。 实验评估表明,我们的SketchThinker-R1模型在四个基准测试中的推理代币成本减少了超过64%,同时最终答案准确性没有受到影响。定性分析还显示,草图式推理在解决问题过程中更加关注关键线索。
https://arxiv.org/abs/2601.02825
Learning and Employment Record (LER) systems are emerging as critical infrastructure for securely compiling and sharing educational and work achievements. Existing blockchain-based platforms leverage verifiable credentials but typically lack automated skill-credential generation and the ability to incorporate unstructured evidence of learning. In this paper,a privacy-preserving, AI-enabled decentralized LER system is proposed to address these gaps. Digitally signed transcripts from educational institutions are accepted, and verifiable self-issued skill credentials are derived inside a trusted execution environment (TEE) by a natural language processing pipeline that analyzes formal records (e.g., transcripts, syllabi) and informal artifacts. All verification and job-skill matching are performed inside the enclave with selective disclosure, so raw credentials and private keys remain enclave-confined. Job matching relies solely on attested skill vectors and is invariant to non-skill resume fields, thereby reducing opportunities for screening this http URL NLP component was evaluated on sample learner data; the mapping follows the validated Syllabus-to-O*NET methodology,and a stability test across repeated runs observed <5% variance in top-ranked skills. Formal security statements and proof sketches are provided showing that derived credentials are unforgeable and that sensitive information remains confidential. The proposed system thus supports secure education and employment credentialing, robust transcript verification,and automated, privacy-preserving skill extraction within a decentralized framework.
学习与就业记录(LER)系统正在成为安全编译和共享教育及工作成就的关键基础设施。现有的基于区块链的平台利用可验证凭证,但通常缺乏自动化技能认证生成能力以及无法整合非结构化学习证据的能力。本文提出了一种隐私保护且由人工智能驱动的去中心化LER系统,以解决这些不足之处。 该系统接受来自教育机构的数字签名成绩单,并通过一个自然语言处理(NLP)管道在可信执行环境中(TEE)内生成可验证的自颁技能凭证,此管道分析正式记录(如成绩单、课程大纲)和非正式学习成果。所有验证及工作技能匹配都在受保护区域内进行选择性披露,确保原始凭据和私钥仅保留在受保护区域内部。 工作匹配完全依赖于认证后的技能向量,并且与简历中的非技能字段无关,从而减少了筛选过程中的偏差机会。该NLP组件已经在样本文档数据上进行了评估;映射遵循经过验证的课程大纲至O*NET方法学,并且多次运行的稳定性测试显示顶级技能的变化率低于5%。 文中提供了正式的安全声明和证明草图,表明生成的凭证不可伪造,并且敏感信息保持保密。因此,所提出的系统支持安全教育与就业认证、稳健的成绩单验证以及在去中心化框架内自动化且隐私保护下的技能提取。
https://arxiv.org/abs/2601.02720
Physics simulation of slender elastic objects often requires discretization as a polyline. However, constructing a polyline from Gaussian splatting is challenging as Gaussian splatting lacks connectivity information and the configuration of Gaussian primitives contains much noise. This paper presents a method to extract a polyline representation of the slender part of the objects in a Gaussian splatting scene from the user's sketching input. Our method robustly constructs a polyline mesh that represents the slender parts using the screen-space shortest path analysis that can be efficiently solved using dynamic programming. We demonstrate the effectiveness of our approach in several in-the-wild examples.
物理模拟细长弹性物体通常需要将其离散化为折线。然而,从高斯点绘(Gaussian splatting)构造折线颇具挑战性,因为高斯点绘缺乏连接信息,并且高斯原语的配置包含大量噪声。本文提出了一种方法,可以从用户的草图输入中提取高斯点绘场景中细长部分的折线表示。我们的方法利用屏幕空间最短路径分析来稳健地构建代表细长部分的折线网格,这种方法可以使用动态规划高效解决。我们通过几个实际例子展示了该方法的有效性。
https://arxiv.org/abs/2601.02072