Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.
联邦学习(FL)使多个参与方能够在不共享原始数据的情况下协作训练机器学习模型。然而,在训练之前,必须对数据进行预处理以解决缺失值、格式不一致和特征尺度异质性等问题。这一预处理阶段对于模型性能至关重要,但在联邦学习的研究中往往被忽视。在实际的FL系统中,隐私约束禁止将原始数据集中化,而通信效率则给分布式预处理带来了进一步的挑战。为此,我们提出FedPS,这是一个基于聚合统计信息进行联邦数据预处理的统一框架。 FedPS利用数据素描技术高效地总结局部数据集的同时保留了重要的统计信息。在此基础上,我们为特征缩放、编码、离散化和缺失值填补设计了联邦算法,并将与预处理相关的模型(如k-Means、k-最近邻和支持向量机)扩展到水平和垂直的FL设置中。 FedPS提供了灵活且通信高效的预处理管道,能够支持实际部署中的联邦学习应用。
https://arxiv.org/abs/2602.10870
Influence functions and related data attribution scores take the form of $g^{\top}F^{-1}g^{\prime}$, where $F\succeq 0$ is a curvature operator. In modern overparameterized models, forming or inverting $F\in\mathbb{R}^{d\times d}$ is prohibitive, motivating scalable influence computation via random projection with a sketch $P \in \mathbb{R}^{m\times d}$. This practice is commonly justified via the Johnson--Lindenstrauss (JL) lemma, which ensures approximate preservation of Euclidean geometry for a fixed dataset. However, JL does not address how sketching behaves under inversion. Furthermore, there is no existing theory that explains how sketching interacts with other widely-used techniques, such as ridge regularization and structured curvature approximations. We develop a unified theory characterizing when projection provably preserves influence functions. When $g,g^{\prime}\in\text{range}(F)$, we show that: 1) Unregularized projection: exact preservation holds iff $P$ is injective on $\text{range}(F)$, which necessitates $m\geq \text{rank}(F)$; 2) Regularized projection: ridge regularization fundamentally alters the sketching barrier, with approximation guarantees governed by the effective dimension of $F$ at the regularization scale; 3) Factorized influence: for Kronecker-factored curvatures $F=A\otimes E$, the guarantees continue to hold for decoupled sketches $P=P_A\otimes P_E$, even though such sketches exhibit row correlations that violate i.i.d. assumptions. Beyond this range-restricted setting, we analyze out-of-range test gradients and quantify a \emph{leakage} term that arises when test gradients have components in $\ker(F)$. This yields guarantees for influence queries on general test points. Overall, this work develops a novel theory that characterizes when projection provably preserves influence and provides principled guidance for choosing the sketch size in practice.
影响函数及相关数据属性分数的形式为 $g^{\top}F^{-1}g'$,其中$F \succeq 0$ 是一个曲率算子。在现代过度参数化的模型中,形成或求逆 $F\in\mathbb{R}^{d\times d}$ 是不切实际的,这促使通过随机投影和草图(sketch)$P \in \mathbb{R}^{m\times d}$ 来实现可扩展的影响计算。这种做法通常基于约翰逊-林登施特劳斯(JL)引理进行合理化,该引理确保固定数据集的欧氏几何在近似意义上得到保持。然而,JL引理并没有解决草图在求逆时的行为问题。此外,没有现有的理论解释草图与诸如岭回归和结构曲率近似等其他广泛使用的技术之间的相互作用。 我们开发了一个统一的理论来描述投影何时可以证明地保留影响函数。 当 $g,g' \in \text{range}(F)$ 时,我们展示了: 1) **未正则化的投影**:精确保持成立的条件是$P$在$\text{range}(F)$上单射,这需要$m\geq \text{rank}(F)$; 2) **正则化投影**:岭回归从根本上改变了草图障碍,并且近似保证由$F$的有效维度在正则化规模下控制; 3) **分解影响函数**:对于克罗内克积构成的曲率 $F=A\otimes E$,即使这种草图展示出行相关性违反独立同分布(i.i.d.)假设的情况,对于解耦草图$P=P_A\otimes P_E$ 保证仍然成立。 超出这个限制范围设定,在测试梯度超出了范围时我们分析了它们的行为,并量化了一个**泄漏**项,该术语在测试梯度具有$\ker(F)$成分的情况下出现。这为一般测试点的影响查询提供了保证。总的来说,这项工作开发了一种新的理论来描述投影何时可以证明地保留影响函数,并且为实践中选择草图大小提供原理性的指导。
https://arxiv.org/abs/2602.10449
Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.
自注意力机制在长上下文大规模语言模型(LLM)的推理过程中,无论是预填充阶段还是解码阶段,都是计算和内存消耗的主要来源。为了解决这一挑战,我们引入了Sketch&Walk Attention,这是一种无需训练的稀疏注意力方法,通过轻量级草图和确定性遍历来决定稀疏度。Sketch&Walk利用哈达玛(Hadamard)抽样技术获取注意力分数的廉价近似值,然后通过一种机制将这些估计值在不同层之间进行聚合,该机制能够捕捉到超出直接令牌间交互的影响。累积后的遍历得分用于选择top-k注意力块,使得这种方法能够在无需重新训练的情况下动态调整稀疏度,并且均匀应用于预填充阶段和解码阶段,同时结合了定制的稀疏注意力内核。 在一系列模型和任务中,Sketch&Walk能够在20%的关注密度下保持几乎无损的准确性,在某些设置下甚至可以略微优于密集注意机制的表现。同时,它能够实现高达6倍的推理速度提升。
https://arxiv.org/abs/2602.07397
During product conceptualization, capturing the non-linear history and cognitive intent is crucial. Traditional sketching tools often lose this context. We introduce DIMES (Design Idea Management and Evolution capture System), a web-based environment featuring sGIT (SketchGit), a custom visual version control architecture, and Generative AI. sGIT includes AEGIS, a module using hybrid Deep Learning and Machine Learning models to classify six stroke types. The system maps Git primitives to design actions, enabling implicit branching and multi-modal commits (stroke data + voice intent). In a comparative study, experts using DIMES demonstrated a 160% increase in breadth of concept exploration. Generative AI modules generated narrative summaries that enhanced knowledge transfer; novices achieved higher replication fidelity (Neural Transparency-based Cosine Similarity: 0.97 vs. 0.73) compared to manual summaries. AI-generated renderings also received higher user acceptance (Purchase Likelihood: 4.2 vs 3.1). This work demonstrates that intelligent version control bridges creative action and cognitive documentation, offering a new paradigm for design education.
在产品概念化阶段,捕捉非线性的历史记录和认知意图至关重要。传统的草图工具往往丧失了这一背景信息。我们引入了一种新的系统DIMES(设计构思管理和演变捕获系统),这是一个基于Web的环境,它包含了sGIT(SketchGit)——一种定制化的视觉版本控制系统架构以及生成式人工智能模块。 sGIT包含了一个名为AEGIS的模块,该模块使用混合深度学习和机器学习模型来分类六种笔画类型。系统将Git的基本操作映射到设计行动上,从而支持隐式分支和多模态提交(包括笔画数据和语音意图)。在一项比较研究中,专家们在使用DIMES后展示出了概念探索范围的160%的增长。 生成式人工智能模块还能够创建叙事总结,这些总结提升了知识转移的效果。新手通过AI生成的叙述获得的知识,比手动编写的叙述更易于理解和复制(基于神经透明度的余弦相似性:0.97 vs 0.73)。此外,由AI生成的设计草图也获得了更高的用户接受度(购买意愿评分:4.2 vs 3.1)。 这项工作表明,智能版本控制系统能够连接创意行动和认知文档记录,为设计教育提供了一种新的范式。
https://arxiv.org/abs/2602.06047
Modern AI systems achieve remarkable capabilities at the cost of substantial energy consumption. To connect intelligence to physical efficiency, we propose two complementary bits-per-joule metrics under explicit accounting conventions: (1) Thermodynamic Epiplexity per Joule -- bits of structural information about a theoretical environment-instance variable newly encoded in an agent's internal state per unit measured energy within a stated boundary -- and (2) Empowerment per Joule -- the embodied sensorimotor channel capacity (control information) per expected energetic cost over a fixed horizon. These provide two axes of physical intelligence: recognition (model-building) this http URL (action influence). Drawing on stochastic thermodynamics, we show how a Landauer-scale closed-cycle benchmark for epiplexity acquisition follows as a corollary of a standard thermodynamic-learning inequality under explicit subsystem assumptions, and we clarify how Landauer-scaled costs act as closed-cycle benchmarks under explicit reset/reuse and boundary-closure assumptions; conversely, we give a simple decoupling construction showing that without such assumptions -- and without charging for externally prepared low-entropy resources (this http URL memory) crossing the boundary -- information gain and in-boundary dissipation need not be tightly linked. For empirical settings where the latent structure variable is unavailable, we align the operational notion of epiplexity with compute-bounded MDL epiplexity and recommend reporting MDL-epiplexity / compression-gain surrogates as companions. Finally, we propose a unified efficiency framework that reports both metrics together with a minimal checklist of boundary/energy accounting, coarse-graining/noise, horizon/reset, and cost conventions to reduce ambiguity and support consistent bits-per-joule comparisons, and we sketch connections to energy-adjusted scaling analyses.
现代人工智能系统在显著的能力上以大量的能源消耗为代价。为了将智能与物理效率相连接,我们提出了两种互补的每焦耳比特度量标准,在明确的会计惯例下:(1) 每焦耳热力学同构——在一个声明边界内测量能量单位中,代理内部状态中新编码的关于理论环境实例变量的结构信息位数; (2) 每焦耳赋权能——在固定时间范围内预期能源成本下的身体感觉-运动通道容量(控制信息)。这些提供了物理智能的两个轴:识别(建模)和影响力。 基于随机热力学,我们展示了如何在一个标准热力学学习不等式的基础上,遵循明确的子系统假设,获得同构获取的一个兰道尺度闭环基准;同时,我们在明确重置/再用和边界封闭假设下澄清了兰道尺度成本如何作为闭合循环基准的作用方式。相反地,在没有这些假设的情况下——在没有对跨越边界的外部准备好的低熵资源(如内存)收费的情况下——信息增益与内部耗散并不一定紧密相关,我们给出了一个简单的解耦构造证明。 对于潜在结构变量不可用的实验场景中,我们将操作意义上的同构性与计算有限MDL同构性和压缩收益替代品对齐,并推荐以这些指标作为同伴进行报告。最后,我们提出了一种统一的效率框架,该框架同时报告这两个度量标准,并辅以一个简明的边界/能量会计、粗粒化/噪声、视界/重置和成本惯例清单,以减少歧义并支持一致的每焦耳比特比较,并草拟了与能源调整比例分析之间的联系。
https://arxiv.org/abs/2602.05463
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
当高质量公共文本资源接近枯竭,预训练从依赖更多标记转向利用更优质的标记。这种现象被称为“数据墙”(Data Wall)。然而,现有的方法要么依赖于忽略训练动态的启发式静态过滤器,要么使用基于原始梯度的动力学标准但与优化器无关。我们提出了一种名为OPUS(由优化器诱导的选择性投影效用选择)的数据选择框架,该框架在由优化器更新空间定义的有用性方面进行工作。OPUS通过将候选数据的有效更新映射到从稳定且符合分布特性的代理中衍生的目标方向来为其打分。为了确保可扩展性,我们采用了Ghost技术与CountSketch方法以提高计算效率,并使用Boltzmann采样增加数据多样性,仅增加了4.7%的额外计算开销。 在GPT-2 Large/XL模型预训练(分别基于FineWeb和FineWeb-Edu语料库)中,OPUS在300亿个标记的数据量下表现优于工业级基线模型,并且甚至超过了完全使用2000亿个标记进行训练的效果。此外,在与工业级别的静态过滤器结合后,即使面对质量较低的数据集,OPUS也进一步提高了预训练效率。 另外,在对Qwen3-8B-Base模型在SciencePedia上的继续预训练中,仅用50亿个标记,OPUS就实现了优于完全使用300亿个标记进行训练的效果。这表明了在专业领域内显著的数据效率提升。
https://arxiv.org/abs/2602.05400
Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.
https://arxiv.org/abs/2602.01851
Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.
https://arxiv.org/abs/2602.01541
Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
https://arxiv.org/abs/2602.00653
We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning. While CoT helps LLMs and VLMs articulate intermediate steps, its text-only form often fails on vision-intensive problems where key intermediate states are inherently visual. We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings. To bridge the modality gap without eroding the original knowledge and capability of the VLM, we use the VLM itself as an encoder and train the language backbone to reconstruct its own intermediate vision embeddings, to guarantee the semantic alignment of the visual latent space. We further attach a diffusion-based latent decoder, invoked by a special control token and conditioned on hidden states from the VLM. In this way, the diffusion head carries fine-grained perceptual details while the VLM specifies high-level intent, which cleanly disentangles roles and reduces the optimization pressure of the VLM. Training proceeds in two stages: supervised fine-tuning on traces that interleave text and latents with a joint next-token and latent-reconstruction objective, followed by reinforcement learning that teaches when to switch modalities and how to compose long reasoning chains. Extensive experiments across 11 diverse multimodal reasoning tasks, demonstrate that our method yields better performance than language-only and other CoT methods. Our code will be publicly released.
https://arxiv.org/abs/2602.00574
Sketch edit at stroke-level aims to transplant source strokes onto a target sketch via stroke expansion or replacement, while preserving semantic consistency and visual fidelity with the target sketch. Recent studies addressed it by relocating source strokes at appropriate canvas positions. However, as source strokes could exhibit significant variations in both size and orientation, we may fail to produce plausible sketch editing results by merely repositioning them without further adjustments. For example, anchoring an oversized source stroke onto the target without proper scaling would fail to produce a semantically coherent outcome. In this paper, we propose SketchMod to refine the source stroke through transformation so as to align it with the target sketch's patterns, further realize flexible sketch edit at stroke-level. As the source stroke refinement is governed by the patterns of the target sketch, we learn three key offset attributes (scale, orientation and position) from the source stroke to another, and align it with the target by: 1) resizing to match spatial proportions by scale, 2) rotating to align with local geometry by orientation, and 3) displacing to meet with semantic layout by position. Besides, a stroke's profiles can be precisely controlled during sketch edit via the exposed captured stroke attributes. Experimental results indicate that SketchMod achieves precise and flexible performances on stroke-level sketch edit.
https://arxiv.org/abs/2602.00489
We investigate the Moore-Penrose pseudoinverse and generalized inverse of a matrix product $A=CR$ to establish a unifying framework for generalized and randomized matrix inverses. This analysis is rooted in first principles, focusing on the geometry of the four fundamental subspaces. We examine: (1) the reverse order law, $A^+ = R^+C^+$, which holds when $C$ has independent columns and $R$ has independent rows, (2) the universally correct formula, $A^+ = (C^+CR)^+(CRR^+)^+$, providing a geometric interpretation of the mappings between the involved subspaces, (3) a new generalized randomized formula, $A^+_p = (P^TA)^+P^TAQ(AQ)^+$, which gives $A^+_p = A^+$ if and only if the sketching matrices $P$ and $Q$ preserve the rank of $A$, i.e., $\mathrm{rank}(P^TA) = \mathrm{rank}(AQ) = \mathrm{rank}(A)$. The framework is extended to generalized $\{1,2\}$-inverses and specialized forms, revealing the underlying structure of established randomized linear algebra algorithms, including randomized SVD, the Nyström approximation, and CUR decomposition. We demonstrate applications in sparse sensor placement and effective resistance estimation. For the latter, we provide a rigorous quantitative analysis of an approximation scheme, establishing that it always underestimates the true resistance and deriving a worst-case spectral bound on the error of resistance differences.
https://arxiv.org/abs/2602.00386
Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.
https://arxiv.org/abs/2601.22455
In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: this https URL
https://arxiv.org/abs/2601.21402
World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.
https://arxiv.org/abs/2601.19048
Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.
https://arxiv.org/abs/2601.18537
For architectural design, representation across multiple Levels of Details (LoD) is essential for achieving a smooth transition from conceptual massing to detailed modeling. However, traditional LoD modeling processes rely on manual operations that are time-consuming, labor-intensive, and prone to geometric inconsistencies. While the rapid advancement of generative artificial intelligence (AI) has opened new possibilities for generating multi-level architectural models from sketch inputs, its application remains limited by the lack of high-quality paired LoD training data. To address this issue, we propose an automatic LoD sketch extraction framework using generative AI models, which progressively simplifies high-detail architectural models to automatically generate geometrically consistent and hierarchically coherent multi-LoD representations. The proposed framework integrates computer vision techniques with generative AI methods to establish a progressive extraction pipeline that transitions from detailed representations to volumetric abstractions. Experimental results demonstrate that the method maintains strong geometric consistency across LoD levels, achieving SSIM values of 0.7319 and 0.7532 for the transitions from LoD3 to LoD2 and from LoD2 to LoD1, respectively, with corresponding normalized Hausdorff distances of 25.1% and 61.0% of the image diagonal, reflecting controlled geometric deviation during abstraction. These results verify that the proposed framework effectively preserves global structure while achieving progressive semantic simplification across different LoD levels, providing reliable data and technical support for AI-driven multi-level architectural generation and hierarchical modeling.
https://arxiv.org/abs/2601.17095
Training modern deep learning models is increasingly constrained by GPU memory and compute limits. While Randomized Numerical Linear Algebra (RandNLA) offers proven techniques to compress these models, the lack of a unified, production-grade library prevents widely adopting these methods. We present Panther, a PyTorch-compatible library that consolidates established RandNLA algorithms into a single high-performance framework. Panther engineers efficient, drop-in replacements for standard components including sketched linear layers, 2D convolution, multi-head attention, and randomized matrix decompositions (such as pivoted CholeskyQR). By implementing a custom C++/CUDA backend (pawX), Panther provides an optimized implementation that can run on both CPUs and GPUs. We demonstrate the effectiveness of RandNLA techniques and Panther's ease of adoption. By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code) we achieve significant memory savings (up to 75%) on BERT while maintaining comparable loss. Source code is available (MIT License) at this https URL, along with demonstration video at this https URL.
训练现代深度学习模型越来越受到GPU内存和计算限制的制约。虽然随机数值线性代数(RandNLA)提供了一种压缩这些模型的有效技术,但缺乏一个统一且适合生产的库阻碍了这些方法的广泛采用。我们介绍了Panther,这是一个与PyTorch兼容的库,它将已确立的RandNLA算法整合到一个高性能框架中。Panther为标准组件(包括草图线性层、2D卷积、多头注意机制以及随机矩阵分解(如带有枢轴的CholeskyQR)等)提供了高效且可直接替换的解决方案。 通过实现自定义的C++/CUDA后端(pawX),Panther提供了一个优化的版本,可以在CPU和GPU上运行。我们展示了RandNLA技术的有效性以及Panther易于采用的特点。通过用Panther层替代标准的PyTorch线性层(仅需几行代码), 我们在BERT模型上实现了显著的记忆节省(高达75%),同时保持了类似的损失函数表现。 源代码可在[MIT许可证](https://www.mit.edu/~parrt/licenses/license.html)下提供,链接为 [此 URL](https://example.com/panther_repo),并提供了演示视频的链接:[此 URL](https://example.com/demo_video)。
https://arxiv.org/abs/2601.15473
Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.
分布外(OOD)检测是一项关键任务,近年来受到了广泛关注。CLIP模型的出现激发了大量关于零样本OOD检测的研究,这些研究通常采用无需训练的方法。目前的方法依赖于大型语言模型(LLMs)中的专家知识来识别潜在异常值,但它们往往过度依赖文本空间的知识,而忽视了在图像空间中检测分布外样本所面临的固有挑战。为此,在本文中我们提出了一种新型管道——MM-OOD,该方法利用多模态大语言模型的多模态推理能力和进行多轮对话的能力来增强异常值检测效果。我们的方法旨在提升近OOD和远OOD任务中的性能表现。具体来说: 1. 对于近OOD任务,我们将标准ID图像及相应的文本提示直接输入到多模态LLMs中以识别潜在异常; 2. 对于远OOD任务,我们引入了草图生成详细说明框架:首先使用文本提示进行分布外样本的草图绘制,然后生成对应的视觉OOD样本,并通过利用多模态提示来进一步详述。 实验结果表明,我们的方法在诸如Food-101等广泛使用的多模态数据集上取得了显著改进,同时验证了其在ImageNet-1K上的可扩展性。
https://arxiv.org/abs/2601.14052
Intent-Based Networking (IBN) allows operators to specify high-level network goals rather than low-level configurations. While recent work demonstrates that large language models can automate configuration tasks, a distinct class of intents requires generating optimization code to compute provably optimal solutions for traffic engineering, routing, and resource allocation. Current systems assume text-based intent expression, requiring operators to enumerate topologies and parameters in prose. Network practitioners naturally reason about structure through diagrams, yet whether Vision-Language Models (VLMs) can process annotated network sketches into correct optimization code remains unexplored. We present IntentOpt, a benchmark of 85 optimization problems across 17 categories, evaluating four VLMs (GPT-5-Mini, Claude-Haiku-4.5, Gemini-2.5-Flash, Llama-3.2-11B-Vision) under three prompting strategies on multimodal versus text-only inputs. Our evaluation shows that visual parameter extraction reduces execution success by 12-21 percentage points (pp), with GPT-5-Mini dropping from 93% to 72%. Program-of-thought prompting decreases performance by up to 13 pp, and open-source models lag behind closed-source ones, with Llama-3.2-11B-Vision reaching 18% compared to 75% for GPT-5-Mini. These results establish baseline capabilities and limitations of current VLMs for optimization code generation within an IBN system. We also demonstrate practical feasibility through a case study that deploys VLM-generated code to network testbed infrastructure using Model Context Protocol.
基于意图的网络(IBN)允许操作员指定高层次的网络目标,而不是低层次的配置。尽管最近的研究表明大型语言模型可以自动化配置任务,但仍然有一类特定的意图需要生成优化代码来计算流量工程、路由和资源分配方面的最优解。当前系统假定使用文本表达意图,要求操作员通过文字描述拓扑结构和参数。网络从业人员自然会通过图表形式进行结构化推理,而视觉语言模型(VLMs)是否能够将标注的网络草图转换为正确的优化代码仍未被探索。 我们提出了一个名为IntentOpt的新基准测试,包含85个涵盖17类别的优化问题,并评估了四种不同的视觉语言模型(GPT-5-Mini、Claude-Haiku-4.5、Gemini-2.5-Flash和Llama-3.2-11B-Vision)在三种不同提示策略下的性能表现,这些模型分别处理多模态输入与文本单一输入。我们的评估结果显示,视觉参数提取会导致执行成功率降低12%至21%,GPT-5-Mini从93%下降到72%。程序思维方法的提示降低了最多13个百分点的表现。开源模型在性能上落后于闭源模型,Llama-3.2-11B-Vision达到18%,而GPT-5-Mini为75%。 这些结果确立了当前VLMs在基于意图网络系统中生成优化代码的基本能力和限制。此外,我们通过一个案例研究展示了其实用可行性,该研究部署了由VLM生成的代码到网络测试平台基础设施,并使用模型上下文协议进行管理。
https://arxiv.org/abs/2601.12744