Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO)}, which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization. A URLLC case study shows AE-MAPPO resolves a latency spike in $18$ms, restores latency to $0.98$ms with $99.9999\%$ reliability, and reduces troubleshooting time by $93\%$ while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing.
第六代(6G)无线接入网(RAN)必须为异构切片执行严格的服务水平协议(SLAs),然而,使用传统的深度强化学习(DRL)或可解释的RL(XRL)来诊断和解决突发的延迟峰值仍然非常困难。我们提出了\emph{增强注意力的多智能体近端策略优化(AE-MAPPO)}方法,该方法将六种专门化的注意机制集成到多代理切片控制中,并以零成本的形式提供忠实解释。此框架在O-RAN时间尺度上运行,采用三阶段策略:预测、响应和跨切片优化。 通过一个URLLC案例研究显示,AE-MAPPO可以在18毫秒内解决延迟峰值问题,将延迟恢复到0.98毫秒,并且以99.9999%的可靠性保持eMBB和mMTC服务连续性。同时,它将故障排除时间减少了93%,并维持了其他业务类型的稳定性。这些结果确认了AE-MAPPO能够结合SLA合规性和固有的可解释性,为6G RAN切片提供值得信赖且实时的自动化解决方案。 此技术的进步解决了当前无线网络在面对突发延迟挑战时难以自动诊断和解决问题的关键问题,并为未来通信网路提供了更加可靠、高效的运行保障。
https://arxiv.org/abs/2602.11076
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
当前的大规模视觉-语言模型(LVLMs)通常依赖于基于单次视觉编码的纯文本推理,这往往导致细粒度视觉信息的丢失。最近提出的“通过图像思考”方法试图通过使用外部工具或代码来操作图像以缓解这一限制;然而,由此产生的视觉状态常常未能充分扎根于语言语义之中,从而影响了跨模态对齐的效果——特别是在需要跨越遥远区域或多张图片进行视觉语义推理和几何关系分析时。为了解决这些挑战,我们提出了“与图像对话”,这是一种新框架,将视觉操作重新定义为由表达性语言提示引导的特征调制。在丰富语言提示的指导下,模型可以动态地对多个图像区域执行联合再编码,从而实现了语言推理与视觉状态更新之间的更紧密耦合。 我们将这一理念实现在ViLaVT中,这是一种新型LVLM,配备了专为这种交互式视觉推理设计的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练来促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT取得了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现出特别突出的优势。
https://arxiv.org/abs/2602.11073
We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at this https URL.
我们提出了一种名为PuriLight的轻量级且高效的框架,用于自监督单目深度估计,以解决计算效率和细节保留这两大挑战。尽管最近在自监督深度估计领域的进展已经减少了对地面真实数据监管的依赖,但现有的方法仍然受到庞大架构影响实际应用或轻量化模型牺牲结构精度的限制。这些双重局限强调了开发既轻量又精确的架构的重要性。 我们的框架通过一个三阶段的架构解决了这些问题,该架构整合了三个新的模块:用于局部特征提取的Shuffle-Dilation Convolution(SDC)模块、用于分层特征增强的Rotation-Adaptive Kernel Attention(RAKA)模块以及用于全局特征净化的Deep Frequency Signal Purification(DFSP)模块。这些模块通过有效协作,使PuriLight能够同时实现轻量级且精确的特征提取和处理。 广泛的实验表明,在保持卓越计算效率的同时,PuriLight在使用最少训练参数的情况下达到了最先进的性能水平。代码可在以下网址获取:[此URL]。
https://arxiv.org/abs/2602.11066
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
人类对话是由一系列隐含的思维链组织起来,表现为有时间顺序的语言行为。捕捉这种感知路径是构建自然全双工交互系统的关键。我们介绍了一个框架,该框架将这一过程建模为多层次感知,并通过思想图(GoT)来推理对话行为。我们的方法使用分层标记方案形式化了意图到行动的路径,预测高层次的交流意图和低层次的语言行为,以学习它们之间的因果关系和时间依赖性。为了训练这个系统,我们开发了一个高质量的数据集,该数据集将可控、事件丰富的对话与人类标注标签配对。 GoT框架结构化连续流式的预测为一个不断演变的图,使变压器能够预测下一个语言行为、生成简洁的理由为其决策,并动态地改进其推理过程。在合成和真实的全双工对话上的实验表明,该框架提供了强大的行为检测能力,产生了可解释的推理链,并为进一步基准测试全双工语音对话系统的会话推理奠定了基础。
https://arxiv.org/abs/2602.11065
Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.
图是跨领域的基础,但在没有深厚专业知识的情况下难以使用。大型语言模型(LLM)承诺提供易于使用的自然语言(NL)图分析功能,但它们在处理大规模属性图时表现不佳:这些数据集庞大、高度异构、结构复杂且不断变化。为了解决这一问题,我们设计了一种新的抽象方法,用于执行复杂的多查询分析操作。其核心思想是通过一个语义目录进行规划,而不是直接从自然语言生成脆弱的图查询。这个语义目录描述了图表模式和图表操作。 具体而言,这种方法将LLM规划与广泛推理之间的语义平面与确定性、数据库级别的查询执行分离到执行平面上,后者能够处理整个数据集以及工具实施。这种设计即使对于小上下文的LLM也能在令牌效率和任务效果方面取得显著提升。我们以此抽象为基础开发了第一个增强型图分析框架GraphSeek。 GraphSeek实现了更高的成功几率(例如,比改进后的LangChain高出86%),并指出了下一代可负担且易于使用的图分析技术的发展方向,这些技术将大型语言模型的推理与大规模复杂属性图上的数据库级执行相结合。
https://arxiv.org/abs/2602.11052
Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).
尽管有关语言模型(Language Models, LM)的研究不断涌现,但很少有方法分析LM的逆向可解性。也就是说,给定一个LM和期望的目标输出序列,确定哪些输入提示可以产生该目标输出仍然是一个悬而未决的问题。我们将这个问题表述为经典的基于梯度的优化问题。首先,我们提出了一种简单的算法,以实现给定的(冻结的)LM在端到端上的可微分性,然后通过梯度下降法找到最优的输入提示。我们的核心见解是将LM视为作用于令牌分布序列上的函数(而不是传统的将其视为作用于令牌序列上的函数)。实验和消融研究表明,在几个白盒语言模型中,我们基于DLM的逆向方法可以可靠且高效地优化长度为10和80的输入提示,以实现目标输出长度为20的目标。
https://arxiv.org/abs/2602.11044
Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.
背景:自发语言产生中的细微变化是认知衰退最早期的指标之一。识别可被语言解释的认知障碍标志物可以支持透明且具有临床基础的筛查方法。 方法:本研究使用DementiaBank Pitt语料库中自发语音记录进行分析,采用了三种语言表示形式:原始清理后的文本、结合词汇和语法信息的词性(POS)增强型表示以及仅包含词性的句法表示。评估了逻辑回归模型与随机森林模型在两种协议下的表现:根据转录文件划分训练集和测试集以及基于五折交叉验证防止说话人重叠的受试者级别方法。通过全局特征重要性和使用Mann-Whitney U检验(Cliff效应大小)进行统计验证来评估模型解释性。 结果:在所有表示形式下,模型均表现出稳定的性能,句法和语法特征即使没有词汇内容也能保持强大的区分能力。受试者级别的评价产生更保守但一致的结果,特别是在POS增强型和仅POS表示中更为明显。统计分析揭示了功能词使用、词汇多样性、句子结构以及话语连贯性等组别间的显著差异,并与机器学习特征重要性的发现高度吻合。 结论:研究结果表明,在临床现实评估下,抽象语言特征能捕捉到早期认知衰退的稳健标记。结合可解释机器学习和非参数统计验证,该研究支持使用具有语言基础的特征进行透明且可靠的基于语言的认知筛查。
https://arxiv.org/abs/2602.11028
Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
带有可验证奖励的强化学习(RLVR)能提高大型语言模型(LLM)的推理能力,但越来越多的证据表明存在探索上限:它经常重新加权现有的解轨迹,而不是发现新的策略,这在大量采样预算下(例如pass-at-256)限制了收益。为了解决这一局限性,我们提出了PSN-RLVR方法,该方法通过扰动策略参数来诱导时间上一致、轨迹级别的探索,在保持长期思维链一致性方面优于动作空间噪声。为了缓解由此产生的采样更新不匹配问题,我们引入了截断的重要采样(TIS)。为了避免昂贵的基于KL的方法自适应噪音控制,我们提出了一种计算效率高的实时自适应噪音调度器,该调度器由结合语义多样性与归一化自我确信度的轻量级代理驱动。在广泛使用的RLVR方法GRPO上实现PSN-RLVR后,在多个数学推理基准和模型族中,PSN-GRPO能一致地扩展有效推理能力边界,从而在大量采样预算下获得更高的pass-at-k,并超越了此前探索导向的RLVR方法(如基于pass-at-k风格训练的方法),同时保持正交性并可组合以获取额外收益。
https://arxiv.org/abs/2602.02555
Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
在手术室(OR)中对手术器械的准确计数是确保患者安全的关键前提。尽管大型视觉语言模型和代理AI近期取得了进展,但在复杂且密集的情景下精确计数仍然极具挑战性,尤其是在手术器械紧密堆积的情况下。为了解决这个问题,我们引入了一种名为“Chain-of-Look”的新型视觉推理框架。该框架模仿了人类的顺序计数过程,并通过施加一种结构化的视觉链来实现这一目标,而不是依赖于传统的无序对象检测方法。这种视觉链指导模型沿着连贯的空间轨迹进行计数,从而提高了复杂场景中的准确性。 为了进一步确保视觉链的物理合理性,我们引入了一种相邻损失函数(neighboring loss function),它明确地建模了密集排列手术器械所固有的空间约束条件。 此外,我们还推出了SurgCount-HD,这是一个包含1,464张高密度手术器械图像的新数据集。通过广泛的实验,我们的方法在具有挑战性的、复杂且密集的手术器械计数任务中超越了现有最佳的方法(如CountGD和REC),以及多模态大型语言模型(如Qwen和ChatGPT)。
https://arxiv.org/abs/2602.11024
Developing world models that understand complex physical interactions is essential for advancing robotic planning and this http URL, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic this http URL address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video this http URL framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual this http URL simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization this http URL, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.
开发能够理解复杂物理交互的世界模型对于推进机器人规划和强化学习至关重要。然而,现有方法在数据稀缺且动态环境复杂的条件下(如充满接触的情况)往往难以准确建模。为了应对这些挑战,我们提出了ContactGaussian-WM,这是一种基于物理的刚体世界模型,具备从稀疏而富含接触信息的视频中直接学习复杂物理规律的能力。该框架主要由两个核心组成部分构成:(1)统一高斯表示法,用于视觉外观和碰撞几何;(2)一个端到端可微的学习框架,通过闭式形式的物理引擎来推断稀疏视觉数据中的物理属性。 模拟与真实世界评估表明,ContactGaussian-WM在学习复杂场景时优于现有最先进的方法,并展示了强大的泛化能力。此外,我们在下游应用中展示我们框架的实际效用,包括数据合成和实时模型预测控制(MPC)。
https://arxiv.org/abs/2602.11021
This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.
这项工作解决了离线安全模仿学习(IL)的问题,目标是从没有每一步安全成本或奖励信息的演示中学习出既安全又能最大化收益的策略。在许多实际领域中,在环境中进行在线学习可能风险很高,并且准确指定安全成本也很困难。然而,通常可以收集到反映不希望出现或不安全行为的轨迹,这些行为隐含地表明了智能体应该避免的情况。我们将这些称为非首选轨迹。我们提出了一种新颖的离线安全IL算法OSIL,该算法从非优选演示中推断出安全性。我们将安全策略学习公式化为一个受约束的马尔可夫决策过程(CMDP)。不同于依赖于显式的安全成本和奖励标注,OSIL通过导出最大化收益目标的一个下界,并学习一个估算非首选行为概率的成本模型,重新定义了CMDP问题。我们的方法使智能体能够完全基于离线演示来学习既安全又能最大化收益的行为。我们实证地展示了这种方法能够在不降低收益表现的情况下,学习到满足成本约束的安全策略,从而优于多个基准算法。
https://arxiv.org/abs/2602.11018
Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM roundtrips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46-93% energy consumption and achieves 1.4x-7.6x speedups compared to state-of-the-art 2D and 3D designs.
基于Transformer的模型主导了现代AI工作负载,但由于其二次方级别的注意力复杂度和不断增长的规模,加剧了内存瓶颈问题。现有的加速器(如Groq和Cerebras)通过使用大容量片上缓存来减少外部芯片传输流量,而算法创新(例如FlashAttention)则通过融合操作符以避免生成大规模的关注矩阵。然而,在减少外部芯片传输的同时,我们的测量显示在处理长序列任务时,片上的SRAM访问占据了超过60%的能量消耗,使缓存访问成为了新的瓶颈。 为此,我们提出了3D-Flow,这是一种混合键合、三维堆叠的空间加速器,它允许跨垂直划分的PE层级之间的寄存器到寄存器通信。与受限于NoC(网络-on-chip)基路由器间传输的传统二维多阵列架构不同,3D-Flow利用了亚10微米的垂直TSV(硅通孔),能够在最小开销下维持每周期级别的操作流水线化。 在此基础上,我们设计了一种名为3D-FlashAttention的细粒度调度方法,在层级之间平衡延迟,并形成无气泡的垂直数据流,无需片内SRAM往返访问。在Transformer工作负载(如OPT和QWEN模型)上的评估表明,我们的三维空间加速器相比最先进的二维和三维设计分别减少了46%至93%的能量消耗,并获得了1.4倍到7.6倍的速度提升。
https://arxiv.org/abs/2602.11016
Formal privacy metrics provide compliance-oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster-Vector-Projection Linkage), a geometric framework for post-hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario-dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold-aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi-Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over-linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k-anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non-quasi-identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility-risk trade-off analysis.
正式的隐私度量提供了合规保证,但往往无法量化发布数据集中的实际连通性。我们引入了CVPL(聚类-向量-投影连通性),这是一种用于事后评估原始和保护后的表格数据之间连通风险的几何框架。CVPL将连通性分析表示为一个操作管道,包括分块、向量化、潜在投影和相似度评估等步骤,从而提供了连续且场景依赖的风险估计,而非简单的二元合规判断。 我们基于明确的安全威胁模型正式定义了CVPL,并介绍了阈值感知风险面R(λ,τ),该模型捕捉到了保护强度与攻击者严格性之间的联合效应。我们确立了一种具有单调性的渐进式分块策略,使得可以在任何时间进行有效且带有下限的风险评估。 我们展示了经典的Fellegi-Sunter连通在这种假设严格的条件下是CVPL的一个特例,并指出对这些假设的违背可能导致系统性的过度连通偏差。通过对10,000条记录在19种保护配置下的实证验证,表明正式的k匿名合规性可以与大量的经验连通性共存,而其中相当大的一部分来源于非准标识符的行为模式。 CVPL提供了可解释的诊断方法来识别哪些特征驱动了连通性的可行性,从而支持隐私影响评估、保护机制比较以及效用-风险权衡分析。
https://arxiv.org/abs/2602.11015
We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulates layer-wise compression allocation as a multi-choice knapsack problem, selecting the optimal compression level for each layer to minimize total reconstruction error while adhering to a target model size. Second, it introduces a single-step sparse matrix factorization inspired by dictionary learning: using only a small calibration set, it sparsifies weight coefficients based on activation-weights sensitivity and then updates the dictionary in closed form via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50\% compression rates. Notably, it retains over 90\% of the original model's performance at 30\% compression without any fine-tuning. Moreover, when applying a light fine-tuning phase, recovery is substantially enhanced: for instance, compressing Qwen3-14B to an 8B-parameter model and healing it with just 30 million tokens yields performance nearly on par with the original Qwen3-8B. The code for ROCKET is at this http URL.
我们介绍了ROCKET,这是一种无需训练的模型压缩方法,在与因子分解、结构化稀疏化和动态压缩基准相比时,实现了最先进的性能。在全局压缩预算下运行,ROCKET包含两项关键创新:首先,它将逐层压缩分配表述为一个多选择背包问题,根据目标模型大小选择每个层次的最佳压缩级别以最小化总体重构误差;其次,它引入了一种基于字典学习的单步稀疏矩阵分解方法:仅使用一个小的校准集,在激活权重敏感度的基础上稀疏化权重系数,并通过直接计算(而非迭代优化、稀疏编码或反向传播)更新字典。在20-50%的压缩率下,ROCKET在不同的模型架构中始终优于现有的压缩方法。值得注意的是,在不进行任何微调的情况下,其在30%的压缩率下仍能保持原始模型性能的90%以上。此外,在应用轻量级微调阶段时,恢复效果显著增强:例如,将Qwen3-14B模型压缩至8B参数模型,并仅使用30百万令牌进行修复后,其表现几乎与原版Qwen3-8B相当。ROCKET的代码位于此链接:[http URL]。
https://arxiv.org/abs/2602.11008
Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at this https URL.
https://arxiv.org/abs/2602.11007
Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
单目深度估计是计算机视觉中的一个核心问题,它在机器人技术、增强现实(AR)和自动驾驶等领域都有广泛应用。然而,驱动现代Transformer架构的自注意力机制仍然难以理解。我们向密集预测变换器(Dense Prediction Transformer, DPT)引入了受奇异值分解(SVD)启发的关注机制(SVDA),这是首个为密集预测任务提供谱结构化表述的关注机制。 SVDA通过在标准化查询-键交互中嵌入一个可学习的对角矩阵,将方向对齐与频谱调制解耦开来,从而生成内在可解释而非后验近似的关注图。实验表明,在KITTI和NYU-v2数据集上,SVDA能够保持或略微提升预测准确性,并且仅增加了轻微的计算开销。 更重要的是,SVDA解锁了六个频谱指标,这些指标可以量化熵、秩、稀疏性、对齐度、选择性和鲁棒性。这揭示了训练过程中注意力在跨数据集和深度维度上如何组织的一致模式,这些都是标准Transformer中无法获得的见解。通过将注意机制的作用从不透明机制转变为可量化的描述符,SVDA重新定义了单目深度估计中的可解释性,并为透明密集预测模型提供了原则性的途径。
https://arxiv.org/abs/2602.11005
Autonomous vehicles (AVs) rely on sensors and deep neural networks (DNNs) to perceive their surrounding environment and make maneuver decisions in real time. However, achieving real-time DNN inference in the AV's perception pipeline is challenging due to the large gap between the computation requirement and the AV's limited resources. Most, if not all, of existing studies focus on optimizing the DNN inference time to achieve faster perception by compressing the DNN model with pruning and quantization. In contrast, we present a Predictable Perception system with DNNs (PP-DNN) that reduce the amount of image data to be processed while maintaining the same level of accuracy for multi-tenant DNNs by dynamically selecting critical frames and regions of interest (ROIs). PP-DNN is based on our key insight that critical frames and ROIs for AVs vary with the AV's surrounding environment. However, it is challenging to identify and use critical frames and ROIs in multi-tenant DNNs for predictable inference. Given image-frame streams, PP-DNN leverages an ROI generator to identify critical frames and ROIs based on the similarities of consecutive frames and traffic scenarios. PP-DNN then leverages a FLOPs predictor to predict multiply-accumulate operations (MACs) from the dynamic critical frames and ROIs. The ROI scheduler coordinates the processing of critical frames and ROIs with multiple DNN models. Finally, we design a detection predictor for the perception of non-critical frames. We have implemented PP-DNN in an ROS-based AV pipeline and evaluated it with the BDD100K and the nuScenes dataset. PP-DNN is observed to significantly enhance perception predictability, increasing the number of fusion frames by up to 7.3x, reducing the fusion delay by >2.6x and fusion-delay variations by >2.3x, improving detection completeness by 75.4% and the cost-effectiveness by up to 98% over the baseline.
自主车辆(AV)依靠传感器和深度神经网络(DNN)来感知周围的环境,并实时做出操作决策。然而,由于计算需求与自动驾驶汽车有限资源之间的巨大差距,在其感知管道中实现实时的DNN推理是非常具有挑战性的。现有的大多数研究都集中在通过剪枝和量化压缩DNN模型以加快感知速度上。相比之下,我们提出了一种基于动态选择关键帧和感兴趣区域(ROI)来减少处理图像数据量的同时保持多租户DNN相同精度水平的可预测感知系统与DNN结合使用(PP-DNN)。我们的核心洞察是对于AV而言,关键帧和ROI会随着周围环境的变化而变化。然而,在多租户DNN中识别并利用关键帧和ROI以实现可预测推理具有挑战性。 给定图像序列流时,PP-DNN通过基于连续帧之间的相似度以及交通场景来使用一个ROI生成器来确定关键帧和区域兴趣(ROI)。随后,它应用一个FLOPs预测器从动态的关键帧和ROI中预测乘积累加操作(MACs)的数量。此外,一个ROI调度程序协同处理多个DNN模型中的关键帧和ROI。 最后,我们为非关键帧的感知设计了一个检测预测器。我们在基于ROS的自动驾驶汽车管道中实现了PP-DNN,并使用BDD100K和nuScenes数据集进行了评估。结果表明,与基线相比,PP-DNN在很大程度上提高了感知可预测性:它增加了最多7.3倍的数据融合帧数,减少了超过2.6倍的融合延迟及大于2.3倍的融合延迟变化,并且将检测完整性提高了75.4%,而成本效益则最高提升了98%。
https://arxiv.org/abs/2602.11004
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
开发高效的GPU内核对于扩展现代AI系统至关重要,但由于复杂的硬件架构和对专业优化技能的需求,这仍然是一个复杂任务。尽管大型语言模型(LLMs)在生成通用顺序代码方面表现出强大的能力,但在生成GPU代码时却面临重大挑战,原因是高质量标记训练数据的稀缺、合成解决方案生成中的编译器偏见以及跨硬件世代泛化的能力有限。这些因素阻碍了监督微调(SFT)成为改进当前LLM的有效可扩展方法。相比之下,强化学习(RL)提供了一种数据效率高且适应性强的替代方案,但需要访问相关工具、仔细选择训练问题,并建立一个稳健的评估环境。 我们提出了Makora的环境和工具,用于前沿模型的强化学习微调,并报告了在Triton代码生成方面对GPT-5进行微调的结果。在单次尝试设置下,我们的微调模型将内核正确性从43.7%提高到77.0%(+33.3个百分点),并将优于PyTorch TorchInductor的问题比例从14.8%提升至21.8%(+7个百分点),与基线GPT-5相比,同时在KernelBench上超过了之前的最先进模型。当集成到完整的编程代理中时,它可以解决扩展后的KernelBench套件中的97.4%的问题,并且在72.9%的问题上优于PyTorch TorchInductor编译器,几何平均速度提升达到了2.12倍。 我们的工作证明了,在数据可用性限制传统监督学习能力的高度专业化技术领域中,使用强化学习进行有针对性的后训练能够解锁LLM的能力。这为AI辅助加速器编程开辟了新的途径。
https://arxiv.org/abs/2602.11000
Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
代理式编码要求代理能够有效地与运行时环境(例如命令行界面 CLI)进行交互,以完成诸如解决依赖问题、修复系统问题等任务。然而,如何在大规模上获取此类环境密集型任务以增强代理的能力仍然研究不足。为了解决这个问题,我们基于 Dockerfile 和代理任务之间的类比,提出让代理模拟和探索环境历史,并通过执行反馈进行指导的方法。通过追踪健康环境的历史记录,可以将其状态回溯到带有运行时故障的早期状态,在该状态下可以通过打包错误的状态及其相应的错误消息来派生出一项任务。 使用这种方法,我们构建了一个名为 CLI-Gym 的系统,总共衍生出了 1,655 个环境密集型任务,这是同类集合中最大的。此外,通过精心挑选的成功轨迹,我们的微调模型 LiberCoder 在 Terminal-Bench 上取得了绝对改进 +21.1%(达到 46.1%)的显著提升,优于多种强大的基线方法。据我们所知,这是第一个用于大规模导出环境密集型任务的公开流水线。
https://arxiv.org/abs/2602.10999
Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
视觉变压器(ViT)在图像分类任务中取得了最先进的性能,但它们的注意力机制往往仍然显得不透明,并且表现出密集而非结构化的行为。在这项工作中,我们将之前提出的基于奇异值分解的注意机制(SVDA)应用到了ViT架构上,引入了一种以几何为基础的形式化方法来增强可解释性、稀疏性和谱结构。我们使用了与SVDA一起最初提出的一系列可解释指标,在训练过程中监控注意力动态,并评估学习表示的结构性质。在四个广泛使用的基准数据集(CIFAR-10、FashionMNIST、CIFAR-100和ImageNet-100)上的实验评估表明,SVDA能够持续提供更可解释的注意模式,而不会牺牲分类准确性。 尽管当前框架主要提供了描述性见解而非规范性的指导原则,但我们的结果将SVDA确立为分析和发展结构化注意力模型在计算机视觉领域中的全面且信息丰富的工具。这项工作奠定了未来可解释人工智能、谱诊断以及基于注意力模型压缩方面进展的基础。
https://arxiv.org/abs/2602.10994