LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author's actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.
基于大型语言模型的方法最近在零样本立场检测中取得了显著成果。然而,它们仍然难以应对复杂的现实场景,在这些场景下,理解立场需要动态的背景知识;目标定义涉及复合实体或事件,必须明确链接到立场标签上;并且修辞手法如讽刺等常常掩盖了作者的真实意图。为了解决这些问题,我们提出了MSME(多阶段、多专家框架)来进行零样本立场检测。MSME包括三个阶段:(1) 知识准备阶段,在此阶段检索相关背景知识并明确立场标签;(2) 专家推理阶段,包含三个专门模块——知识专家从知识角度提炼显著事实和原因,标签专家根据情况细化立场标签,语用专家则通过识别诸如讽刺等修辞线索来推断意图;以及 (3) 决策聚合阶段,在此阶段一个元裁判整合所有专家分析,产生最终的立场预测。在三个公共数据集上的实验表明,MSME在各方面均达到了最先进的性能水平。
https://arxiv.org/abs/2512.04492
Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.
深度神经网络(DNN)在自然语言处理(NLP)领域取得了重大进展,但其可解释性仍然难以捉摸,尤其是在评估复杂的决策过程时。传统的方法通常依赖于事后解释,例如敏感度图或特征可视化,这些方法可能并不直接适用于NLP中单词数据的离散性质。为了解决这一问题,我们引入了模型无关的显著性估计(Model-agnostic Saliency Estimation, MASE)框架。MASE能够在无需深入了解模型内部架构的情况下,对基于文本的预测模型提供局部解释。通过在嵌入层上应用归一化线性高斯扰动(Normalized Linear Gaussian Perturbations, NLGP),而不是直接使用原始单词输入,MASE能够高效地估计输入显著性。我们的结果显示,与其它模型无关的解释方法相比,MASE在Delta准确性方面表现出色,这使它成为阐明NLP中基于文本模型操作的一个有前景的工具。
https://arxiv.org/abs/2512.04386
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
在线视频中的仇恨言论对数字平台构成了越来越严重的威胁,尤其是在多媒体内容变得日益复杂且依赖于上下文的情况下。现有的方法往往难以有效地融合这些复杂的语义关系,并且缺乏理解微妙的仇恨内容的能力。为了解决这些问题,我们提出了一种创新的基于推理感知多模态融合(RAMF)框架。 为了应对第一个挑战,我们设计了局部-全局上下文融合(LGCF),以捕捉局部显著线索和全局时间结构,并提出了语义交叉注意机制(SCA),使细粒度的跨模态语义交互成为可能。为了解决第二个挑战,我们引入了一种对抗性推理方法——一个分三个阶段进行的过程:视觉-语言模型生成(i)客观描述,(ii)假设仇恨存在的推断,以及(iii)假设不存在仇恨的推断,提供了互补的语义视角,从而丰富了模型对微妙仇恨意图的理解。 在两个真实世界的仇恨视频数据集上的评估表明,我们的方法实现了稳健的泛化性能,在Macro-F1和仇恨类别召回率上分别比现有最佳方法提高了3%和7%。匿名期结束后,我们将发布代码。
https://arxiv.org/abs/2512.02743
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.
视觉语言模型(VLM)在图像理解任务中表现出色,但大量的视觉标记带来了显著的计算成本,阻碍了其在移动设备上的部署。许多剪枝方法仅依赖于标记的重要性,因此忽视了标记之间的冗余性,保留了大量的重复标记,导致资源浪费。尽管已经提出了一些考虑冗余性的方法,但这些方法往往忽略了视觉标记之间的空间关系,从而会导致选择的保留标记过于稀疏,无法充分覆盖目标物体所在区域。 为了解决这些问题,我们提出了VLM-Pruner,这是一种无需训练的数据去标记化算法,它明确地平衡了冗余性和空间稀疏性。我们引入了一种离心去标记方法,这种方法支持从近到远的选择,并优先保留精细的物体细节。此外,我们设计了一个用于空间稀疏性的缓存标准(BSS),推迟选择空间上距离较远的标记。我们还采用并行贪婪策略来高效地进行标记选择。 为了缓解剪枝过程中信息丢失的问题,我们将被丢弃标记中的显著信息有选择性地融合到保留的标记中。全面比较表明,VLM-Pruner在五种不同的视觉语言模型上,在88.9%的去标记化率下始终优于强大的基线,并且还提供了端到端推理加速的优势。
https://arxiv.org/abs/2512.02700
Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
近期在以对象为中心的表示学习领域取得的进步表明,基于槽注意力的方法能够有效地将视觉场景分解为对象槽表示,并且无需监督。然而,现有的方法通常会不分青红皂白地处理前景和背景区域,这往往会导致背景干扰以及在真实数据上的实例发现性能不佳。为了克服这一限制,我们提出了一个名为前景感知槽注意(Foreground-Aware Slot Attention, FASA)的两阶段框架,该框架明确地将前景与背景分离出来,以实现精确的对象发现。 FASA的第一阶段执行粗略场景分解,通过双槽竞争机制区分前景和背景区域。这些槽是基于聚类策略进行初始化的,从而生成显著区域的良好结构化表示。在第二阶段,我们引入了一个掩码槽注意机制,在该机制中,第一个槽捕捉背景,而其余的槽则相互竞争以代表单独的前景对象。为了进一步解决对前景对象过度分割的问题,我们整合了伪掩模指导,这些伪掩模是从使用自我监督图像特征构建的补丁亲和图中得出的,用以引导前景槽的学习。 在合成数据集和真实世界数据集上的广泛实验表明,FASA始终优于现有的最新方法,这验证了显式前景建模及伪掩码引导对稳健场景分解和对象一致表示的有效性。代码将在公开平台上发布。
https://arxiv.org/abs/2512.02685
While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.
尽管近期在手术场景分割的深度学习领域取得了显著成果,尤其是在单一中心和单一成像模式的数据上,但这些方法通常不能推广到未见过的数据分布(即来自其他中心)或新模态。目前针对处理未见数据分布和因模态变化导致域差距问题的研究已经广泛展开,但大多集中在自然场景数据上。然而,由于手术场景中的视觉线索有限且与自然场景相比场景多样性极大,这些方法无法直接应用于手术场景。 受自然场景中解决OOD(Out-of-Distribution)数据泛化能力工作的启发,我们假设通过利用手术场景中的风格和内容信息可以减少外观的变异性,从而降低如血液或成像伪影等突发变化的影响。这可以通过执行实例归一化和特征协方差映射技术来实现,以达到鲁棒且具有泛化的特征表示。此外,为了消除移除与感兴趣对象相关的显著特征表示的风险,在特征学习ResNet骨干网络中引入了一个恢复模块,能够保留有用的、任务相关性高的特征。 为了解决手术场景分割中缺乏多类和多中心数据的问题,我们还提供了一套新整理的数据集,这对于解决该领域的泛化能力问题至关重要。我们的提案RobustSurg在基线DeepLabv3+上取得了近23%的改进,在最先进的方法(SOTA)方面也从10-32%不同幅度地提高了均值IoU评分,具体取决于未见中心HeiCholSeg数据集上的训练情况。同样,对于EndoUDA息肉数据集的目标集合,RobustSurg相对于基线取得了近22%的改进,并且在最近最先进方法上也有了约11%的进步。
https://arxiv.org/abs/2512.02188
Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
接触丰富的机器人操作需要能够编码局部几何结构的表示方法。视觉提供了全局上下文,但缺乏对纹理和硬度等属性的直接测量,而触觉则提供这些线索。现代视触融合传感器可以在单一融合图像中捕捉这两种模式,产生内在一致的输入数据,非常适合需要视觉和触觉信息的操作任务。然而,大多数自我监督学习(SSL)框架将特征图压缩成一个全局向量,这会丢弃空间结构并与其所需的空间一致性需求不匹配。为解决这个问题,我们提出了SARL,这是一种空间感知SSL框架,它通过在Bootstrap Your Own Latent (BYOL) 架构中添加三个地图级目标——注意力对齐(SAL)、补丁原型分布对齐(PPDA)和区域亲和匹配(RAM),来保持跨视角的注意焦点、部分组合和几何关系的一致性。这些损失作用于中间特征图,补充了全局目标。在融合视觉-触觉数据的六个下游任务中,SARL始终优于九个SSL基准方法。特别是在对几何敏感的边缘姿态回归任务上,SARL实现了0.3955的平均绝对误差(MAE),比下一个最好的SSL方法(0.5682 MAE)提高了约30% 的相对改进,并接近于监督上限。这些发现表明,在融合视觉-触觉数据的情况下,最有效的信号是具有结构化空间等变性的特征,这种特征能够根据物体几何的变化可预测地变化,从而促进更强大的机器人感知能力。
https://arxiv.org/abs/2512.01908
Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multi-modal action distribution to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs) on common behavior cloning (BC) benchmarks. We find that GCPs do not owe their success to their ability to capture multi-modality or to express more complex observation-to-action mappings. Instead, we find that their advantage stems from iterative computation, as long as intermediate steps are supervised during training and this supervision is paired with a suitable level of stochasticity. As a validation of our findings, we show that a minimum iterative policy (MIP), a lightweight two-step regression-based policy, essentially matches the performance of flow GCPs, and often outperforms distilled shortcut models. Our results suggest that the distribution-fitting component of GCPs is less salient than commonly believed, and point toward new design spaces focusing solely on control performance. Project page: this https URL
最近,生成模型(如流(flow)和扩散(diffusion))在机器人控制策略参数化中变得流行且有效。关于这些成功背后因素的猜测很多,从捕捉多模态动作分布到表达更复杂的观测量到行动映射等都有涉及。在这项工作中,我们对常用行为克隆(BC)基准上流行的生成性控制策略(GCPs)进行了全面评估。我们发现,GCP的成功并非源于其捕捉多模态或表达更复杂观测到动作映射的能力。相反,我们认为它们的优势在于迭代计算过程,只要在训练过程中监督中间步骤,并且这种监督与合适的随机性水平相匹配。 为了验证我们的发现,我们展示了最小迭代策略(MIP),这是一种基于两步回归的轻量级政策,在性能上基本上可以媲美流GCPs,并且通常会超过蒸馏捷径模型。我们的结果表明,GCP中用于分布拟合的部分并没有像普遍认为的那样重要,并指向了一种新的设计空间,该空间仅关注控制性能。 项目页面: [此URL](this https URL)
https://arxiv.org/abs/2512.01809
In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at this https URL.
在这篇论文中,我们介绍了SAM3-UNet,这是一种简化的Segment Anything Model 3(SAM3)变体,旨在以低成本将SAM3适应于下游任务。我们的SAM3-UNet包含三个组件:一个SAM3图像编码器、一个用于高效参数微调的简单适配器,以及一个轻量级的U-Net风格解码器。在多个任务上的初步实验(例如镜像检测和显著对象检测)表明,所提出的SAM3-UNet超越了之前的SAM2-UNet及其他最先进的方法,并且在批量大小为12的情况下进行训练时所需的GPU内存少于6 GB。代码可在此[https URL]处公开获取。
https://arxiv.org/abs/2512.01789
Accurate segmentation of retinal vessels is crucial for the clinical diagnosis of numerous ophthalmic and systemic diseases. However, traditional Convolutional Neural Network (CNN) methods exhibit inherent limitations, struggling to capture long-range dependencies and complex nonlinear relationships. To address the above limitations, an Adaptive Dual Branch Kolmogorov-Arnold UNet (DB-KAUNet) is proposed for retinal vessel segmentation. In DB-KAUNet, we design a Heterogeneous Dual-Branch Encoder (HDBE) that features parallel CNN and Transformer pathways. The HDBE strategically interleaves standard CNN and Transformer blocks with novel KANConv and KAT blocks, enabling the model to form a comprehensive feature representation. To optimize feature processing, we integrate several critical components into the HDBE. First, a Cross-Branch Channel Interaction (CCI) module is embedded to facilitate efficient interaction of channel features between the parallel pathways. Second, an attention-based Spatial Feature Enhancement (SFE) module is employed to enhance spatial features and fuse the outputs from both branches. Building upon the SFE module, an advanced Spatial Feature Enhancement with Geometrically Adaptive Fusion (SFE-GAF) module is subsequently developed. In the SFE-GAF module, adaptive sampling is utilized to focus on true vessel morphology precisely. The adaptive process strengthens salient vascular features while significantly reducing background noise and computational overhead. Extensive experiments on the DRIVE, STARE, and CHASE_DB1 datasets validate that DB-KAUNet achieves leading segmentation performance and demonstrates exceptional robustness.
准确分割视网膜血管对于多种眼科和全身性疾病的确诊至关重要。然而,传统的卷积神经网络(CNN)方法存在固有的局限性,在捕捉长距离依赖性和复杂非线性关系方面表现不佳。为了解决上述限制,提出了一种自适应双分支柯尔莫哥洛夫-阿恩德 UNet (DB-KAUNet),用于视网膜血管分割。在 DB-KAUNet 中,我们设计了一个异构双分支编码器(HDBE),其特征是并行的 CNN 和 Transformer 路径。HDBE 通过交替使用标准 CNN 和 Transformer 块以及新颖的 KANConv 和 KAT 块来战略性地构建全面的特征表示。为了优化特征处理,我们在 HDBE 中整合了几个关键组件。首先,嵌入了跨分支通道交互(CCI)模块,以促进并行路径之间通道特征的有效交互。其次,使用基于注意力的空间特征增强(SFE)模块来提高空间特征,并融合来自两个分支的输出。在此基础上,进一步开发了一种先进的具有几何自适应融合的空间特征增强(SFE-GAF)模块。在 SFE-GAF 模块中,利用自适应采样来精确聚焦于真正的血管形态。该自适应过程强化了显著的血管特征,并大幅减少了背景噪声和计算负担。在 DRIVE、STARE 和 CHASE_DB1 数据集上的大量实验验证了 DB-KAUNet 达到了领先的分割性能并展示了卓越的鲁棒性。
https://arxiv.org/abs/2512.01657
As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features'' -- weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable. In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance. We introduce a novel selection heuristic that preserves the top-$k$ weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06\%, outperforming both AWQ (65.34\%) and SpQR (65.34\%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data.
随着大型语言模型(LLMs)参数数量的不断增加,在通用硬件上部署这些模型变得越来越具有挑战性。为了应对这一问题,后训练量化(PTQ)通过降低模型权重的精度来解决此问题,通常降至4位或更低。然而,均匀量化往往会导致性能显著下降,因为存在所谓的“异常特征”——尽管数量较少,但对维持模型准确性至关重要的一些权重。当前最先进的方法如激活感知权重量化(AWQ)和稀疏量表示法(SpQR),依赖于校准数据通过激活幅度或Hessian灵敏度来识别这些关键权重。然而,在数据隐私极为重要或者无法获取校准数据的情况下,这些方法并不适用。 在本工作中,我们提出了一种无需数据且结构感知的假设:通过奇异值分解(SVD)被确定为主成分的那些权重对于模型的后续性能来说是内在重要的。我们引入了一个新颖的选择规则,即保留与主分量对齐的前k个权重,并以FP32格式存储,而其余部分则强力量化为较低精度。我们在GLUE基准测试(MRPC, RTE, QNLI)上使用DistilBERT作为模型骨干的情况下,将我们的方法与激活感知(AWQ)和二阶(SpQR)方法进行了比较。 实验结果表明结构上的重要性高度关联于功能上的重要性。在具有挑战性的RTE任务中,基于SVD的方法实现了66.06%的准确率,超越了AWQ(65.34%)和SpQR(65.34%),且保护预算较高。这验证了内在矩阵结构可以作为权重显著性的一个稳健替代方案,并且无需前向传递或校准数据的支持。
https://arxiv.org/abs/2512.01343
Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbf{S}treaming \textbf{T}oken \textbf{C}ompression (\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf{99\%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf{24.5\%} and \textbf{45.3\%}.
流式视频大型语言模型(VideoLLMs)在各种视频理解任务中表现出色,但它们在实时部署方面面临重大挑战,原因是处理连续视频流中的密集视觉标记的计算成本很高。在流媒体视频场景中,主要瓶颈在于Vision Transformer (ViT) 编码阶段,在此阶段,对时间上相似帧的冗余处理导致了效率低下。此外,LLM预填充过程中膨胀的令牌序列进一步加剧了延迟和内存开销。为了解决这些挑战,我们提出了**STC(Streaming Token Compression)**,这是一个即插即用的分层框架,它可以无缝集成到现有的流媒体VideoLLMs中,并通过优化ViT编码阶段和LLM预填充阶段来加速处理过程。 STC引入了两个令牌级加速器:**STC-Cacher** 和 **STC-Pruner**。其中,**STC-Cacher** 通过缓存并重用时间上相似帧的特征减少ViT编码开销;而**STC-Pruner** 在视觉令牌序列进入LLM之前对其进行压缩,仅保留基于空间和时间相关性的最关键令牌。 在四个基准流式VideoLLMs上的五个评估指标中进行了广泛的实验,结果显示STC优于其他压缩方法。值得注意的是,在ReKV框架下,STC能够保持高达99%的准确性的同时,将ViT编码延迟减少了24.5%,并将LLM预填充延迟降低了45.3%。
https://arxiv.org/abs/2512.00891
We introduce HBLLM, a wavelet-enhanced high-fidelity $1$-bit post-training quantization method for Large Language Models (LLMs). By leveraging Haar wavelet transforms to enhance expressive capacity through frequency decomposition, HBLLM significantly improves quantization fidelity while maintaining minimal overhead. This approach features two innovative structure-aware grouping strategies: (1) frequency-aware multi-parameter intra-row grouping and (2) $\ell_2$-norm-based saliency-driven column selection. For non-salient weights, a shared mean is employed across quantization groups within each frequency band to optimize storage efficiency. Experiments conducted on the OPT and LLaMA models demonstrate that HBLLM achieves state-of-the-art performance in $1$-bit quantization, attaining a perplexity of $6.71$ on LLaMA$2$-$13$B with an average weight storage of only $1.08$ bits. Code available at: this https URL.
我们介绍了一种名为HBLLM的方法,这是一种增强型的高保真度1位后训练量化方法,专门用于大规模语言模型(LLMs)。通过利用Haar小波变换对频率进行分解来提升表达能力,HBLLM在保持极低开销的同时显著提高了量化精度。该方法具有两种创新的认知结构分组策略:(1) 基于频率的多参数行内分组;(2) 基于$\ell_2$范数的重要度驱动列选择。对于非重要权重,HBLLM在每个频带内的量化组之间采用共享均值来优化存储效率。 在OPT和LLaMA模型上的实验表明,HBLLM在1位量化中取得了最先进的性能,在LLaMA2-13B模型上实现了6.71的困惑度,并且平均权重存储量仅为每权重1.08比特。代码可在以下链接获取:[此URL](this https URL)。
https://arxiv.org/abs/2512.00862
Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm, which alternates between global temporal reasoning and local frame examination, has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft's proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
长视频理解对于类人智能至关重要,它能够支持在长时间背景下进行连贯的感知和推理。尽管新兴的“基于帧思考”范式——通过在全球时间推理与局部帧检查之间交替来进行操作——已经提升了视频多模态大型语言模型(MLLMs)的推理能力,但由于不断增长且冗余的多模态上下文问题,它面临着显著的效率瓶颈。为了解决这个问题,我们提出了SpecTemp,这是一种基于强化学习的推测性时间推理框架,通过合作双模型设计将时间感知与推理解耦开来。在SpecTemp中,一个轻量级草案MLLM快速探索并从密集采样的时间段提出关键帧,而强大的目标MLLM则专注于时间推理,并验证草稿的建议,迭代细化其注意力直到收敛。这种设计模仿了人脑协同路径的方式,在效率和准确性之间取得了平衡。为了支持训练,我们构建了SpecTemp-80K数据集,该数据集具有同步双层注释,包括粗略证据跨度和细粒度帧级证据。跨多个视频理解基准的实验表明,与现有的基于帧思考方法相比,SpecTemp不仅保持了竞争性的准确性,而且还显著加快了推理速度。
https://arxiv.org/abs/2512.00805
Intelligent driving systems are vulnerable to physical adversarial attacks on traffic signs. These attacks can cause misclassification, leading to erroneous driving decisions that compromise road safety. Moreover, within V2X networks, such misinterpretations can propagate, inducing cascading failures that disrupt overall traffic flow and system stability. However, a key limitation of current physical attacks is their lack of stealth. Most methods apply perturbations to central regions of the sign, resulting in visually salient patterns that are easily detectable by human observers, thereby limiting their real-world practicality. This study proposes TESP-Attack, a novel stealth-aware adversarial patch method for traffic sign classification. Based on the observation that human visual attention primarily focuses on the central regions of traffic signs, we employ instance segmentation to generate edge-aligned masks that conform to the shape characteristics of the signs. A U-Net generator is utilized to craft adversarial patches, which are then optimized through color and texture constraints along with frequency domain analysis to achieve seamless integration with the background environment, resulting in highly effective visual concealment. The proposed method demonstrates outstanding attack success rates across traffic sign classification models with varied architectures, achieving over 90% under limited query budgets. It also exhibits strong cross-model transferability and maintains robust real-world performance that remains stable under varying angles and distances.
智能驾驶系统容易受到交通标志上的物理对抗攻击。这些攻击可能导致误分类,从而引发错误的驾驶决策,威胁道路安全。此外,在车路协同(V2X)网络中,此类误解会传播开来,导致连锁故障并破坏整体交通流和系统的稳定性。然而,当前物理攻击的一个关键限制是缺乏隐蔽性。大多数方法对标志中央区域施加扰动,从而形成视觉上显眼的模式,人类观察者容易察觉这些模式,这限制了它们在现实世界中的实用性。 本研究提出了一种新颖的方法——TESP-Attack(一种针对交通标志分类的隐蔽对抗贴纸方法)。基于人类视觉注意力主要集中在交通标志中央区域这一观察结果,我们利用实例分割技术生成与标志形状特征相匹配、边缘对齐的掩模。然后采用U-Net生成器设计对抗性补丁,并通过颜色和纹理约束以及频域分析进行优化,以实现与背景环境的无缝集成,从而达到高度有效的视觉隐蔽效果。 所提出的方法在不同架构的交通标志分类模型上表现出卓越的成功率,在有限查询预算下超过90%。此外,该方法还展示了强大的跨模型迁移能力,并且保持了稳健的真实世界性能,即使在不同的角度和距离条件下也依然稳定。
https://arxiv.org/abs/2512.00765
Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.
https://arxiv.org/abs/2511.22850
Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.
https://arxiv.org/abs/2511.22107
Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.
https://arxiv.org/abs/2511.21514
We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.
我们提出了一种名为SemImage的新方法,用于将文本文档表示为二维语义图像,以便使用卷积神经网络(CNN)进行处理。在SemImage中,每个单词都作为2D图像中的一个像素来表示:行对应于句子,并且会在每两个句子之间插入额外的边界行以标记语义转换。每个像素不是典型的RGB值,而是在解耦合HSV颜色空间中的向量,编码不同的语言特征:色调(包括用于处理循环性的H_cos和H_sin两部分)表示主题,饱和度表示情感,亮度表示强度或确定性。我们通过多任务学习框架强制执行这种解耦:ColorMapper网络将每个词嵌入映射到HSV空间,并对色调和饱和度通道进行辅助监督以预测主题和情感标签,同时实现主要的任务目标。在句子之间插入动态计算的边界行会产生当连续句子语义不相似时图像中的清晰视觉边界,从而有效地使段落之间的分隔变得显著。我们将SemImage与标准2D CNN(例如ResNet)集成用于文档分类。在具有主题和情感标注的多标签数据集以及单标签基准上的实验表明,SemImage可以达到甚至优于强大的文本分类基线(包括BERT和层次注意力网络)的准确性,并且提供增强的可解释性。消融研究表明HSV多通道表示和动态边界行的重要性。最后,我们展示了SemImage的可视化效果,这些可视化的定性结果揭示了生成图像中对应于主题转换和情感变化的明显模式,表明我们的表示方法使语言特征对人和机器都可见。
https://arxiv.org/abs/2512.00088
Spiking neural networks (SNNs) have emerged as prominent candidates for embedded and edge AI. Their inherent low power consumption makes them far more efficient than conventional ANNs in scenarios where energy budgets are tightly constrained. In parallel, federated learning (FL) has become the prevailing training paradigm in such settings, enabling on-device learning while limiting the exposure of raw data. However, gradient inversion attacks represent a critical privacy threat in FL, where sensitive training data can be reconstructed directly from shared gradients. While this vulnerability has been widely investigated in conventional ANNs, its implications for SNNs remain largely unexplored. In this work, we present the first comprehensive empirical study of gradient leakage in SNNs across diverse data domains. SNNs are inherently non-differentiable and are typically trained using surrogate gradients, which we hypothesized would be less correlated with the original input and thus less informative from a privacy perspective. To investigate this, we adapt different gradient leakage attacks to the spike domain. Our experiments reveal a striking contrast with conventional ANNs: whereas ANN gradients reliably expose salient input content, SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure. These results indicate that the combination of event-driven dynamics and surrogate-gradient training substantially reduces gradient informativeness. To the best of our knowledge, this work provides the first systematic benchmark of gradient inversion attacks for spiking architectures, highlighting the inherent privacy-preserving potential of neuromorphic computation.
https://arxiv.org/abs/2511.21181