State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.
https://arxiv.org/abs/2603.13070
Computational musicology enables systematic analysis of performative and structural traits in recorded music, yet existing approaches remain largely tailored to notated, score-based repertoires. This study advances a methodology for analyzing voice-guitar interaction in Carlos Paredes's vocal collaborations - an oral-tradition context where compositional and performative layers co-emerge. Using source-separated stems, physics-informed harmonic modelling, and beat-level audio descriptors, we examine melodic, harmonic, and rhythmic relationships across eight recordings with four singers. Our commonality-diversity framework, combining multi-scale correlation analysis with residual-based detection of structural deviations, reveals that expressive coordination is predominantly piece-specific rather than corpus-wide. Diversity events systematically align with formal boundaries and textural shifts, demonstrating that the proposed approach can identify musically salient reorganizations with minimal human annotation. The framework further offers a generalizable computational strategy for repertoires without notated blueprints, extending Music Performance Analysis into oral-tradition and improvisation-inflected practices.
https://arxiv.org/abs/2603.12854
This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.
https://arxiv.org/abs/2603.12685
Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.
https://arxiv.org/abs/2603.12680
Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.
https://arxiv.org/abs/2603.12647
Reverse engineering and rapid prototyping of computer-aided design (CAD) models from 3D scans, sketches, or simple text prompts are vital in industrial product design. However, recent advances in geometric deep learning techniques lack a multi-modal understanding of parametric CAD features stored in their boundary representation (BRep). This study presents the largest compilation of 10 million multi-modal annotations and metadata for 1 million ABC CAD models, namely A2Z, to unlock an unprecedented level of BRep learning. A2Z comprises (i) high-resolution meshes with salient 3D scanning features, (ii) 3D hand-drawn sketches equipped with (iii) geometric and topological information about BRep co-edges, corners, and surfaces, and (iv) textual captions and tags describing the product in the mechanical world. Creating such carefully structured, large-scale data, which requires nearly 5 terabytes of storage to leverage unparalleled CAD learning/retrieval tasks, is very challenging. The scale, quality, and diversity of our multi-modal annotations are assessed using novel metrics, GPT-5, Gemini, and extensive human feedback mechanisms. To this end, we also merge an additional 25,000 CAD models of electronic enclosures (e.g., tablets, ports) designed by skilled professionals with our A2Z dataset. Subsequently, we train and benchmark a foundation model on a subset of 150K CAD models to detect BRep co-edges and corner vertices from 3D scans, a key downstream task in CAD reverse engineering. The annotated dataset, metrics, and checkpoints will be publicly released to support numerous research directions.
https://arxiv.org/abs/2603.12605
Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.
在遥感图像中进行显著目标检测(SOD)面临诸多挑战,包括目标大小的大幅变化、自注意力机制计算成本高以及基于CNN提取器在捕获全局上下文和长距离依赖关系方面的局限性。现有的方法通常依靠固定的卷积核来适应各种目标尺度,但它们往往难以应对这种多样性,导致细节丢失或无关特征聚集的问题。为了解决这些问题,本研究旨在提高对尺度变化的鲁棒性和实现精确的目标定位。为此,我们提出了区域比例感知动态自适应显著物体检测网络(RDNet),该网络用SwinTransformer取代了CNN骨干网以进行全局上下文建模,并引入了三个关键模块:(1) 动态自适应细节感知(DAD)模块,它通过对象区域的比例来应用不同的卷积核;(2) 频率匹配上下文增强(FCE)模块,该模块通过小波交互和注意力机制丰富了上下文信息;以及 (3) 区域比例感知定位(RPL)模块,该模块采用交叉注意机制突出语义细节,并集成了一个比例引导(PG)块以辅助DAD模块。结合这些模块,RDNet在应对尺度变化方面表现出更高的鲁棒性并实现精确的定位,在显著物体检测性能上超越了现有的先进方法。
https://arxiv.org/abs/2603.12215
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
多模态大型语言模型(MLLMs)通过生成用于伪造检测的文本理由来实现可解释的多媒体取证。然而,处理密集型视觉序列会带来高昂的计算成本,特别是在高分辨率图像和视频的情况下。视觉令牌修剪是一种实用的加速策略,但现有的方法大多侧重于语义驱动,保留显著对象的同时丢弃背景区域,而这些背景区域往往是操纵痕迹如高频异常和时间抖动所在的地方。为了解决这个问题,我们提出了ForensicZip,这是一种无需训练的框架,从伪造物驱动的角度重新定义了令牌压缩问题。ForensicZip 将时序令牌演化建模为具有松弛虚拟节点的“生灭最优传输”问题,量化指示瞬态生成性伪影的物理不连续性。取证评分进一步将基于传输的新颖性与高频先验相结合,在大规模比例压缩下区分证据和语义内容。 实验表明,在深度伪造和人工智能生成内容(AIGC)基准测试中,ForensicZip 在保持最先进的检测性能的同时,在10%令牌保留的情况下实现了2.97倍的速度提升和超过90%的浮点运算次数减少。
https://arxiv.org/abs/2603.12208
Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at this https URL
现有的密集视频字幕(Dense Video Captioning,简称 DVC)增强检索方法常常难以实现与真实事件边界准确对齐的时序分割,因为它们依赖于忽视了地面真实事件边界的启发式策略。所提出的框架**STaRC**通过帧级显著性监督模块克服了这一限制,该模块用于检测亮点。需要注意的是,这个亮点检测模块是基于从 DVC 地面真相注释中直接得出的二进制标签进行训练的,并且不需要额外标注。 我们还建议利用显著性得分作为统一的时间信号,通过显著性引导分割来驱动检索,并通过显式注入解码器中的显著性提示(Saliency Prompts)为字幕生成提供信息。通过强制执行显著性约束的分割,我们的方法能够产生与实际事件转换紧密对齐的时间一致性片段,从而实现更准确的检索和基于上下文的字幕生成。 我们在 YouCook2 和 ViTT 基准测试上进行了全面评估,STaRC 在大多数指标上取得了最先进的性能。代码可在 [此处](此链接需您提供具体URL) 获取。
https://arxiv.org/abs/2603.11460
The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.
神经网络模型的输入特征对其输出的影响研究是一个活跃的研究领域。尽管已经提出了许多可解释的人工智能(XAI)技术来解释这些模型,但对序列到序列(seq2seq)模型中这些方法的系统性和自动化评估却相对较少探讨。本文提出了一种新的方法,用于在基于变压器的seq2seq模型中评估解释性方法的有效性。我们利用老师模型生成的归因图作为结构化的侧信号来指导学生模型,并通过学生模型模拟目标的能力量化不同归因方法的效用。使用Inseq库,我们在源-目标序列对上提取归因分数,并将这些分数通过四种组合操作符(加法、乘法、平均和替换)注入学生变压器模型的注意力机制中。 在三种语言对(德英、法英、阿英)和Marian-MT及mBART模型的归因下,注意力、价值零化以及层梯度*激活始终相对于基线带来了最大的BLEU得分提升(并相应地提高了chrF)。相反,其他基于梯度的方法(显着性、集成梯度、DeepLIFT、输入×梯度、GradientShap)则带来较小且不那么一致的改进。这些结果表明不同的归因方法捕捉到了不同的信号,并且注意力衍生的归因更好地捕获了seq2seq模型中源和目标表示之间的对齐关系。 最后,我们引入了一个Attributor变压器,在给定源-目标对的情况下,它可以学习重构老师的归因图。我们的研究发现表明,Attributor能够越准确地再现归因图,注入这些映射对学生任务就越有用。源代码可在GitHub上找到。
https://arxiv.org/abs/2603.11342
Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
大型多模态模型(LMMs)在适应不同的计算预算时面临挑战,这主要是由于存在大量的视觉标记。之前的方法试图通过减少输入到LLMs之前的视觉标记数量来解决这个问题。然而,这些策略不可避免地会导致视觉语义的丢失。为了解决这些问题,我们引入了FMVR(Frequency-Modulated Visual Restoration),这是一种插件式且极其简单的策略,旨在增强在视觉标记减少情况下的LMM推理能力。 具体来说,FMVR通过使用AvgPool和MaxPool将较少视觉标记的视觉表示分解为低频和高频分量。随后,利用轻量级可学习参数对这些频率进行调制。来自AvgPool的高频部分充当了显著性过滤器,用于增强显著性的视觉语义;而来自MaxPool的低频部分则充当非显著性过滤器,用来加强弱化的视觉语义。这使得在使用少量视觉标记的情况下保留主导视觉语义,并恢复稀释的视觉语义成为可能。 此外,我们将FMVR整合到Matryoshka表示学习中,以从粗到细地学习视觉令牌集,从而能够在推理过程中灵活调整视觉令牌的数量,同时保持与原模型相当的表现力。跨10个基于图像和4个基于视频的基准测试的实验表明,在减少LLaVA-1.5-7B的FLOPs(89%)的同时,FMVR-LLaVA能够几乎维持原始准确性的100%不变。代码将会公开。
https://arxiv.org/abs/2603.11220
Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.
场景图生成(SGG)的目标是从图像中提取详细的图结构,这种表示形式作为复杂下游任务(如具身代理的推理)的强大中间步骤具有重要意义。然而,在资源受限的实际设备上进行实际部署需要速度和资源效率,这是现有研究关注较少的问题。为了解决这一问题,我们引入了DSFlash——一种用于全景场景图生成的低延迟模型,旨在克服这些限制。DSFlash能够在标准RTX 3090 GPU上以每秒56帧的速度处理视频流,并且其性能不亚于现有的最先进方法。尤为重要的是,与以往只关注显著关系的方法不同,DSFlash能够计算全面的场景图,在提供更丰富的上下文信息的同时保持了其低延迟的优势。此外,DSFlash在资源使用上也较为轻量,只需要不到24小时就可以在一个九年前的GTX 1080 GPU上完成训练。这种可访问性使得DSFlash特别适合那些拥有有限计算资源的研究人员和从业者,使他们能够适应并微调SGG模型以满足特定应用的需求。
https://arxiv.org/abs/2603.10538
Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of "why does it look there" by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.
深度学习模型实现了显著的预测性能,但其黑箱性质限制了透明度和可信度。尽管提出了许多可解释的人工智能(XAI)方法,它们主要提供的是热图或概念(即无结构的可解释性)。现有的方法往往依赖于辅助模型(如GPT、CLIP),以描述原模型的行为,从而牺牲对原始模型的忠实度。我们提出了一种名为“从可解释性到解释”(Interpretability to Explainability, I2X) 的框架,该框架通过量化训练过程中选定检查点的进步,直接从无结构的可解释性中构建结构化的解释,使用的是后验XAI方法(如GradCAM)提取出的原型。I2X通过提供关于训练过程中的内类和跨类决策结构化视图来回答“为什么它看起来是那样的”这一问题。 在MNIST和CIFAR10数据集上的实验展示了I2X能够揭示各种图像分类模型中基于原型的推理过程的有效性。此外,我们还证明了I2X可以用于改善不同架构和数据集中模型的预测:通过I2X识别由不确定原型标记出的目标,并对这些样本进行针对性扰动以允许微调,最终提高准确率。因此,I2X不仅能忠实解释模型行为,还能提供一种指导优化向所需目标发展的实用方法。
https://arxiv.org/abs/2603.10234
Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model's class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.
人类视觉在严格的代谢限制下实现了卓越的感知性能,其中关键因素之一是选择性注意力机制。这一机制通过快速的眼跳运动(saccadic eye movements)来不断重新定位高分辨率的视网膜中央凹区域到任务相关的部位,与传统的人工智能系统不同,后者会以同等重视程度处理整个图像。 我们的研究旨在从人类视觉系统中汲取灵感,创建更聪明、更高效的图像处理模型。通过使用DINO——一种自监督的Vision Transformer,它产生的注意力图与人类注视模式惊人地相似,我们探索了一种受眼跳启发的方法,即聚焦于视觉空间中的关键区域进行信息处理。为此,我们在标准分类任务中利用ImageNet数据集,并测量每个连续眼跳如何影响模型的类别得分。 这种选择性处理策略能够保持大多数全图像分类性能,并在某些情况下甚至可以超越全图像分类效果。通过与现有的用于人类注视预测的显著性模型进行基准测试,我们证明了DINO为选取信息丰富的区域提供了更好的固定点指导。 这些发现强调了Vision Transformer注意力机制作为生物启发式主动视觉研究的基础具有巨大潜力,并且开启了高效、仿生视觉处理的新方向。
https://arxiv.org/abs/2603.09613
Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.
准确的医学图像分割需要有效地建模长距离依赖关系和精细边界细节。虽然Transformer通过缓解卷积神经网络固有的有限感受野导致的语义信息不足问题,解决了这一难题,但也引入了新的挑战:标准的自注意力机制会带来二次方级别的计算复杂度,并且常常会给不相关的区域分配非可忽略的注意权重,这削弱了对区分性结构的关注,最终影响分割精度。现有的注意力变体虽然在减少计算复杂度方面有效,却未能抑制冗余计算,无意中损害了全局上下文建模。此外,在编码器-解码器架构中的传统融合策略通常基于简单的拼接或求和操作,无法自适应地将高层次的语义信息与低层次的空间细节相结合。 为了解决这些局限性,我们提出了DCAU-Net,这是一个新颖而高效的分割框架,包含两个关键想法。首先,设计了一种新的差分交叉注意力(Differential Cross Attention, DCA),它通过计算两个独立的softmax注意图之间的差异来自适应地突出区分性结构。通过用窗口级别的摘要令牌替换像素级键值令牌,DCA大大降低了计算复杂度而不牺牲精度。其次,引入了通道-空间特征融合(Channel-Spatial Feature Fusion, CSFF)策略,该策略通过使用顺序的通道和空间注意力来自适应地重新校准跳层连接和上采样路径中的特征,有效地抑制冗余信息并放大显著线索。 在两个公共基准上的实验表明,DCAU-Net实现了具有增强分割精度和鲁棒性的竞争力性能。
https://arxiv.org/abs/2603.09530
Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
医学图像检索旨在识别具有临床相关性的病变案例,以支持诊断决策、教育和质量控制。实践中,检索查询常常结合参考病变图像与文本描述符(如皮肤镜特征)。我们研究了一种针对皮肤癌的复合视觉-语言检索方法,其中每个查询由一张图片及其对应的文本组成,并且数据库包含多种类别疾病的确诊病例。我们提出了一种基于变换器的框架,该框架可以学习层次化的组合查询表示并执行查询与候选图像之间的全局和局部联合对齐。局部对齐通过多个空间注意力掩码聚合判别性区域,而全局对齐则提供全面的语义监督。最终相似度是通过一种凸优化、领域信息加权方法计算得出,该方法强调临床重要的局部证据同时保持全局一致性。 在公开的Derm7pt数据集上的实验表明,所提出的方法比现有最佳方法具有持续改进的效果。提出的框架能够实现对相关医疗记录的有效访问,并支持实际临床应用部署。
https://arxiv.org/abs/2603.09108
VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $\mu \in \mathbb{R}^3$, log-scale covariance $\log \sigma \in \mathbb{R}^3$, and learned opacity $\alpha \in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.
VLA模型将视觉观察编码为没有内在几何结构的二维补丁令牌。我们引入了GST-VLA,带来了两个创新: 首先,高斯空间标记器(Gaussian Spatial Tokenizer,GST)将固定的密集深度和固定语义补丁特征转换成$N_g{=}128$个各向异性三维高斯原语,每个原语由度量残差均值$\mu \in \mathbb{R}^3$、对数尺度协方差$\log \sigma \in \mathbb{R}^3$和学习到的不透明度$\alpha \in (0,1)$来参数化。这种协方差特征结构编码了局部表面的方向,而不透明度提供了每个原语的几何信心,这些都是从标量深度无法获取的信息。空间注意池化结合学习到的查询集中固定的令牌预算在几何上显著区域而不是均匀分布。 其次,3D深度感知思维链(Depth-Aware Chain-of-Thought, DA-CoT)推理监督了四个结构化的中间空间思想,涵盖三维物体定位、抓取效能接触几何、成对度量距离和粗略的SE(3)航路点,并将其作为显式生成目标纳入训练损失中。在每个VLM变压器模块中的交叉注意力子层提供了直接访问原始256个高斯原语场的机会,以进行DA-CoT生成。一个参数为3亿的流匹配动作专家通过混合专家前馈子层解码7自由度的动作增量块,并且该过程基于条件常微分方程积分,在VLM隐藏状态和DA-CoT输出的基础上通过双重交叉注意力完成。 GST-VLA在三个渐进阶段中,使用复合$\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$进行训练,在LIBERO数据集上达到了96.4%(+2.0%),在SimplerEnv数据集上达到80.2%(+5.4%)。通过消融研究,我们独立地评估了每个GST组件、每个DA-CoT思想和每个训练阶段的贡献,确认这些部分能够为需要高精度的任务集中带来独立且协同增强的效果。
https://arxiv.org/abs/2603.09079
Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: this https URL
视觉语言模型(VLMs)如CLIP的少样本适应通常依赖于学习与全局图像嵌入匹配的文本提示。近期的研究通过引入局部图文对齐来捕捉细粒度的视觉线索,扩展了这一范式,但这些方法往往独立为每个提示选择局部区域,导致局部特征使用重复和提示重叠的问题。我们提出了SOT-GLP,它引入了一个共享的稀疏补丁支持和平衡最优传输分配,以明确地将显著视觉区域在类别特定的本地提示之间划分,同时保持全局对齐。我们的方法学习了共享的全局提示和类别的局部提示。 全球分支维护标准图像文本匹配,用于稳健的类别级别的对齐。局部分支使用V-V注意力构建一个条件于类别的稀疏补丁集,并通过平衡熵最优传输将其与多个特定于类别的提示对齐,从而生成防止提示重叠和崩溃的软划分。我们在两个互补的目标上评估了我们的方法:(i) 在11个标准基准测试上的少样本分类准确性;(ii) 出现分布(OOD)检测。 在具有16次拍摄ViT-B/16的标准11数据集基准测试中,SOT-GLP实现了85.1%的平均准确率,超过了之前的提示学习方法。我们发现了一个明确的学习可调节投影优化内部分布适应的同时会改变基础特征空间的准确性与鲁棒性权衡:而无投影局部对齐保留了CLIP流形的原始几何结构,从而在状态-of-the-art OOD检测性能(AUC 94.2%)中超过了完全适配模型。 该项目代码可在此链接访问:this https URL
https://arxiv.org/abs/2603.08347
Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: this https URL
多实例学习(MIL)在计算病理学领域取得了显著进展,其中来自数十亿像素全滑片图像的大量补丁被聚合为整体水平预测。热图广泛用于验证MIL模型并发现组织生物标志物。然而,这些热图的有效性几乎未受到调查。 在这项工作中,我们介绍了一个通用框架,用于在不需额外标签的情况下评估MIL热图的质量。我们进行了大规模基准实验,以评估六种解释方法在不同病理学任务类型(分类、回归、生存)、MIL模型架构(基于注意力的、基于Transformer的和Mamba的)以及补丁编码器骨干网络(UNI2、Virchow2)上的表现。 我们的结果显示,解释质量主要取决于MIL模型架构和任务类型。扰动方法("Single")、层相关性传播(LRP) 和集成梯度(IG) 方法通常优于基于注意力和梯度的热图,后者往往无法准确反映模型决策机制。我们进一步展示了最佳解释方法的高级功能:(i)提供了一个概念证明,即MIL热图可以与空间转录组学相关联,以进行生物验证;(ii)展示出不同的预测人类乳头瘤病毒(HPV)感染的模型策略。 我们的工作强调了验证MIL热图的重要性,并确立了改进可解释性能够使模型验证更加可靠并产生生物学见解的事实。这为在数字病理领域更广泛地采用可解释人工智能提供了依据。我们提供的代码可在公共GitHub存储库中获得:this https URL
https://arxiv.org/abs/2603.08328
Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.
可靠的无人机(UAV)检测对于自主空中监测至关重要,但在整合分辨率、视角和视野差异显著的传感器流时仍面临挑战。传统的融合方法——如小波变换法、拉普拉斯变换法以及决策级的方法——往往难以在不同模式之间保持空间对应关系,并且由于标注不一致的问题而受到限制,在实际应用场景中的鲁棒性较差。这项研究引入了两种融合策略:注册感知引导图像融合(RGIF)和可靠性门控模态注意融合(RGMAF),旨在克服这些局限。 RGIF采用基于增强相关系数(ECC)的仿射注册,并结合引导滤波,以保持热成像的关键特征同时提升结构细节。RGMAF则综合了仿射和光流注册技术,并引入了一个可靠性加权的关注机制,能够自适应地平衡热对比度与视觉锐利度。 实验在包含147,417帧标注航空图片的多传感器及多视角固定翼(MMFW)无人机数据集上进行。在单一模态检测器中,YOLOv10x展现了最稳定的跨域性能,并被选作评估融合图像的基础检测框架。RGIF将视觉基准提高了2.13%的mAP@50(达到97.65%),而RGMAF则达到了最高的召回率98.64%。 这些结果表明,注册感知和可靠性适应性的融合提供了一个在异构模式整合中稳健的基础框架,并显著提升了多模态环境下的无人机检测性能。
https://arxiv.org/abs/2603.08208