The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
VLM(视觉语言模型)的成功往往依赖于能够动态生成高分辨率图像的方案,这些方案会自适应地将输入图像分割成多个区域块,以保留图像中的细节信息。然而,这样的方法会产生大量的冗余视觉标记,从而大大降低了VLM的效率。为了在不增加额外训练成本的情况下提高VLM的效率,许多研究工作提出了通过过滤掉无用的视觉标记或聚合它们的信息来减少视觉标记的方法。一些方法提出根据VLM中的自注意力机制来减少视觉标记,但由于这种机制存在偏差,会导致响应不准确。仅仅依赖于视觉线索进行标记减少的方法对文本是不可知的,在处理与问题最相关的区域时会失败,尤其是在查询对象在图像中不太突出的情况下。 在这项工作中,我们首先进行了实验以证明原始文本嵌入与视觉标记对齐,并且对于尾部视觉标记没有偏差。然后,我们提出了一种自我适应的跨模态注意力混合机制,在预训练大语言模型(LLM)层中动态利用视觉显著性和文本到图像相似性的有效性来选择那些信息丰富的视觉标记。广泛的实验表明,所提出的这种方法在无训练成本的情况下实现了最先进的VLM加速性能,尤其是在减少率足够大的情况下尤其有效。
https://arxiv.org/abs/2501.09532
Video synthetic aperture radar (ViSAR) has attracted substantial attention in the moving target detection (MTD) field due to its ability to continuously monitor changes in the target area. In ViSAR, the moving targets' shadows will not offset and defocus, which is widely used as a feature for MTD. However, the shadows are difficult to distinguish from the low scattering region in the background, which will cause more missing and false alarms. Therefore, it is worth investigating how to enhance the distinction between the shadows and background. In this study, we proposed the Shadow Enhancement and Background Suppression for ViSAR (SE-BSFV) algorithm. The SE-BSFV algorithm is based on the low-rank representation (LRR) theory and adopts online subspace learning technique to enhance shadows and suppress background for ViSAR images. Firstly, we use a registration algorithm to register the ViSAR images and utilize Gaussian mixture distribution (GMD) to model the ViSAR data. Secondly, the knowledge learned from the previous frames is leveraged to estimate the GMD parameters of the current frame, and the Expectation-maximization (EM) algorithm is used to estimate the subspace parameters. Then, the foreground matrix of the current frame can be obtained. Finally, the alternating direction method of multipliers (ADMM) is used to eliminate strong scattering objects in the foreground matrix to obtain the final results. The experimental results indicate that the SE-BSFV algorithm significantly enhances the shadows' saliency and greatly improves the detection performance while ensuring efficiency compared with several other advanced pre-processing algorithms.
视频合成孔径雷达(ViSAR)在移动目标检测(MTD)领域引起了广泛关注,因为它能够持续监测目标区域的变化。在ViSAR中,移动目标的阴影不会偏移和模糊,这被广泛用作MTD的一个特征。然而,由于背景中的低散射区难以与阴影区分,会导致更多的漏报和误报。因此,研究如何增强阴影与背景之间的区别变得非常重要。 为此,在本研究中我们提出了用于ViSAR的阴影增强及背景抑制算法(SE-BSFV)。该算法基于低秩表示(LRR)理论,并采用在线子空间学习技术来增强ViSAR图像中的阴影并压制背景。首先,我们使用一个配准算法对ViSAR图像进行配准,并利用高斯混合分布(GMD)模型化ViSAR数据。其次,从先前帧中获得的知识被用来估计当前帧的GMD参数,同时采用期望最大算法来估算子空间参数。然后可以获取当前帧的前景矩阵。最后,使用交替方向乘子法(ADMM)消除前景矩阵中的强散射物体以得到最终结果。 实验结果显示,与几种其他先进的预处理算法相比,SE-BSFV算法显著增强了阴影的重要性,并在确保效率的同时大幅提高了检测性能。
https://arxiv.org/abs/2501.09341
We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.
我们提出了一种简单的方法,利用预训练的视觉变压器(Vision Transformers, ViT)进行细粒度分析,旨在识别并定位区分类似视觉类别的特征,例如不同的鸟类或犬品种。像DINO这样的预训练ViT已经显示出能够提取局部化、有信息量特征的非凡能力。然而,使用诸如Grad-CAM之类的注意力图很难指出这些特征:它们通常通过模糊且粗略的热力图来定位整个对象,而不是具体的特征。为此,我们提出了一种新的方法——Prompt Class Attention Map(Prompt-CAM),以解决这一问题。 Prompt-CAM通过对预训练ViT进行类别特定的提示学习,并使用相应的输出来进行分类。为了正确地对图像进行分类,真实类别的提示必须关注在其他类别中未出现的独特图块(即特征)。因此,真实的多头注意力图揭示了特征及其位置。从实现角度来看,Prompt-CAM几乎是一顿免费午餐,只需修改视觉提示调整(Visual Prompt Tuning, VPT)的预测头部即可。这使得Prompt-CAM训练和应用起来相对容易,与需要设计特定模型和训练过程的其他可解释方法形成了鲜明对比。 甚至比最近发布的用于Transformer的INterpretable TRansformer (INTR)还要简单,后者由于其编码器-解码器架构而无法利用预训练的ViT。在来自不同领域的十几个数据集上进行广泛的实证研究验证了Prompt-CAM优越的解释能力。
https://arxiv.org/abs/2501.09333
In this work we introduce Salient Information Preserving Adversarial Training (SIP-AT), an intuitive method for relieving the robustness-accuracy trade-off incurred by traditional adversarial training. SIP-AT uses salient image regions to guide the adversarial training process in such a way that fragile features deemed meaningful by an annotator remain unperturbed during training, allowing models to learn highly predictive non-robust features without sacrificing overall robustness. This technique is compatible with both human-based and automatically generated salience estimates, allowing SIP-AT to be used as a part of human-driven model development without forcing SIP-AT to be reliant upon additional human data. We perform experiments across multiple datasets and architectures and demonstrate that SIP-AT is able to boost the clean accuracy of models while maintaining a high degree of robustness against attacks at multiple epsilon levels. We complement our central experiments with an observational study measuring the rate at which human subjects successfully identify perturbed images. This study helps build a more intuitive understanding of adversarial attack strength and demonstrates the heightened importance of low-epsilon robustness. Our results demonstrate the efficacy of SIP-AT and provide valuable insight into the risks posed by adversarial samples of various strengths.
在这项工作中,我们引入了显著信息保护对抗训练(SIP-AT),这是一种直观的方法,用于缓解传统对抗训练带来的鲁棒性和准确性之间的权衡。SIP-AT 使用显着图像区域来指导对抗训练过程,使得那些被认为具有意义的脆弱特征在训练过程中保持不变,从而使模型能够在不牺牲整体鲁棒性的前提下学习高度预测但非鲁棒性特征。此技术兼容基于人类和自动产生的显著图估计,使 SIP-AT 可以作为以人为导向的模型开发的一部分使用,并且不需要额外的人类数据支持 SIP-AT 的实施。 我们在多个数据集和架构上进行了实验,并展示了 SIP-AT 能够在保持对不同 ε 值攻击的高度鲁棒性的同时提高模型的干净准确率。我们通过一项观察研究来补充核心实验,该研究测量了人类受试者成功识别扰动图像的速度。这项研究有助于建立对抗攻击强度的更直观理解,并展示了低ε值鲁棒性的关键重要性。 我们的结果证明了 SIP-AT 的有效性,并提供了关于各种强度的对抗样本所带来的风险的重要见解。
https://arxiv.org/abs/2501.09086
Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
基础模型(FMs)在放射学中展示了变革潜力,能够在不同的成像模式下执行多样且复杂的任务。在这里,我们开发了CT-FM,这是一个大规模的基于3D图像的预训练模型,专门针对各种放射学任务设计。CT-FM通过无标签对比学习方法,在来自影像数据公用库(Imaging Data Commons)的148,000个计算机断层扫描(CT)扫描基础上进行了预训练。 我们评估了CT-FM在四个类别的任务中的表现,包括全身和肿瘤分割、头部CT分类、医学图像检索以及语义理解。结果显示,在所有类别中,CT-FM的表现均优于最先进的模型。 除了量化的成功之外,CT-FM还展示出将解剖区域进行聚类,并能够在不同扫描之间识别类似的解剖结构的能力。此外,它在测试-重测设置下仍然保持了其稳健性,并显示出了与其嵌入关联的合理显著区域。 本研究证明了大规模医学影像基础模型的价值,并通过开源该模型权重、代码和数据,旨在支持放射学领域中更灵活、可靠且可解释的人工智能解决方案。
https://arxiv.org/abs/2501.09001
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at this https URL
特征提取技术在医学图像分类中至关重要;然而,传统的机器学习分类器与经典的特征提取方法经常表现出提供足够区分信息的能力不足,特别是在处理复杂的图像集合时。虽然卷积神经网络(CNN)和视觉变换器(ViT)在特征提取方面展现出巨大潜力,但由于医疗影像数据固有的特性,如样本量小或类内变异大,它们容易出现过拟合问题。 本文提出了一种新的方法——基于医学图像注意力的特征提取器(MIAFEx),该方法利用可学习的改进机制来增强变换器编码器架构中的分类标记。这种机制根据学到的权重调整标记,从而提高显著性特征的抽取,并增强了模型对医疗影像数据挑战的适应能力。本文将使用传统和混合分类器对比MIAFEx输出特征的质量与经典特征提取方法的结果;同时还将这些特征在分类任务上的性能与现代CNN和ViT模型进行比较,证明其在多个复杂医学图像分类数据集上具有更高的准确性和鲁棒性。这种优势尤其明显于样本量有限的场景中,在这种情况下,传统和现代模型往往难以有效地泛化。 该项目的源代码可以在以下链接找到:[此URL](在此处插入实际URL)
https://arxiv.org/abs/2501.08562
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter-frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in unsupervised video object segmentation but also delivers competitive results in video salient object detection. These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. Source code is available on this https URL.
在这篇论文中,我们通过提出一种名为MTNet的高效算法来解决无监督视频对象分割(UVOS)中的挑战。该算法同时利用了运动和时间线索。与以往专注于将外观与运动相结合或建模时间关系的方法不同,我们的方法通过在一个统一框架内整合这两个方面,实现了它们的有效结合。MTNet的设计在于,在编码器的特征提取过程中有效地融合了外观和运动特征,从而促进更互补的表示形式。 为了捕捉视频中复杂的长距离上下文动态和信息,我们引入了一个时间变换模块,这有助于在整个视频片段中实现有效的帧间交互。此外,我们在所有特征级别上使用了一连串的解码器来充分利用提取到的特征,并致力于生成越来越精确的分割掩模。 因此,MTNet提供了一个强大而紧凑的框架,探索了时间和跨模式的知识,从而能够在各种复杂场景下高效地准确定位和跟踪主要对象。在多个基准测试中的广泛实验最终证明,我们的方法不仅在无监督视频对象分割方面达到了最先进的性能,在视频显著目标检测中也提供了具有竞争力的结果。 这些发现突显了该方法的稳健性和适应性,以及其对一系列分割任务的有效应对能力。源代码可在[这个链接](https://this_https_URL.com)获取。
https://arxiv.org/abs/2501.07806
Roadside billboards and other forms of outdoor advertising play a crucial role in marketing initiatives; however, they can also distract drivers, potentially contributing to accidents. This study delves into the significance of roadside advertising in images captured from a driver's perspective. Firstly, it evaluates the effectiveness of neural networks in detecting advertising along roads, focusing on the YOLOv5 and Faster R-CNN models. Secondly, the study addresses the determination of billboard significance using methods for saliency extraction. The UniSal and SpectralResidual methods were employed to create saliency maps for each image. The study establishes a database of eye tracking sessions captured during city highway driving to assess the saliency models.
路边广告牌及其他户外广告在市场营销活动中扮演着重要角色;然而,它们也可能分散司机的注意力,从而可能导致交通事故。本研究探讨了从驾驶员视角拍摄的照片中路边广告的重要性。首先,该研究评估了神经网络检测道路沿线广告的有效性,重点比较了YOLOv5和Faster R-CNN模型的表现。其次,该研究还讨论了利用提取显著性的方法来确定广告牌的重要性的问题。为了每张图片生成显著图,采用了UniSal和SpectralResidual两种方法。此外,该研究建立了一个基于城市高速公路驾驶过程中眼动追踪数据的数据库,以评估这些显著性模型的效果。
https://arxiv.org/abs/2501.07342
In recent years, point cloud upsampling has been widely applied in fields such as 3D reconstruction. Our study investigates the factors influencing point cloud upsampling on both global and local levels through representation learning. Specifically, the paper inputs global and local information of the same point cloud model object into two encoders to extract these features, fuses them, and then feeds the combined features into an upsampling decoder. The goal is to address issues of sparsity and noise in point clouds by leveraging prior knowledge from both global and local inputs. And the proposed framework can be applied to any state-of-the-art point cloud upsampling neural network. Experiments were conducted on a series of autoencoder-based models utilizing deep learning, yielding interpretability for both global and local inputs, and it has been proven in the results that our proposed framework can further improve the upsampling effect in previous SOTA works. At the same time, the Saliency Map reflects the differences between global and local feature inputs, as well as the effectiveness of training with both inputs in parallel.
近年来,点云上采样技术在3D重建等领域得到了广泛应用。我们的研究通过表示学习方法探讨了影响点云全局和局部层面上采样的因素。具体来说,论文中提出了将同一点云模型对象的全局信息和局部信息分别输入到两个编码器中提取特征,并将这些特征融合后送入一个上采样解码器的过程。该方法旨在通过利用来自全局和局部输入的先验知识来解决点云稀疏性和噪声问题。而且,提出的框架可以应用于任何最先进的点云上采样神经网络。 实验在一系列基于自编码器的深度学习模型上进行了验证,并且为全局和局部输入提供了可解释性。结果显示我们的方法能够进一步提升先前最先进工作的上采样效果。同时,显著图(Saliency Map)反映了全局与局部特征输入之间的差异,以及同时使用这两种输入进行训练的有效性。
https://arxiv.org/abs/2501.07076
Most of the current salient object detection approaches use deeper networks with large backbones to produce more accurate predictions, which results in a significant increase in computational complexity. A great number of network designs follow the pure UNet and Feature Pyramid Network (FPN) architecture which has limited feature extraction and aggregation ability which motivated us to design a lightweight post-decoder refinement module, the crossed post-decoder refinement (CPDR) to enhance the feature representation of a standard FPN or U-Net framework. Specifically, we introduce the Attention Down Sample Fusion (ADF), which employs channel attention mechanisms with attention maps generated by high-level representation to refine the low-level features, and Attention Up Sample Fusion (AUF), leveraging the low-level information to guide the high-level features through spatial attention. Additionally, we proposed the Dual Attention Cross Fusion (DACF) upon ADFs and AUFs, which reduces the number of parameters while maintaining the performance. Experiments on five benchmark datasets demonstrate that our method outperforms previous state-of-the-art approaches.
目前大多数显著物体检测方法使用更深的网络和大型主干网来产生更准确的预测,这导致了计算复杂性的大幅增加。许多网络设计遵循纯UNet(U形网络)和特征金字塔网络(FPN)架构,这种架构在特征提取和聚合能力方面存在局限性,这促使我们设计了一个轻量级的后解码器精炼模块——交叉后解码器精炼(CPDR),以增强标准FPN或U-Net框架的特征表示能力。具体来说,我们引入了注意下采样融合(ADF)机制,它利用通道注意力机制和由高级表示生成的关注图来优化低级特征;同时,我们也提出了注意上采样融合(AUF),通过空间注意力引导低级信息以指导高级特征。此外,在ADF与AUF的基础上,我们还提出了一种双重注意力交叉融合(DACF)方法,该方法在减少参数数量的同时保持了性能。实验结果表明,我们的方法在五个基准数据集上的表现优于之前的最先进的方法。
https://arxiv.org/abs/2501.06441
Salient Object Detection (SOD) aims to identify and segment prominent regions within a scene. Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy, which is time-consuming. We developed a low-cost, high-precision annotation method by leveraging large foundation models to address the challenges. Specifically, we use a weakly supervised approach to guide large models in generating pseudo-labels through textual prompts. Since large models do not effectively focus on the salient regions of images, we manually annotate a subset of text to fine-tune the model. Based on this approach, which enables precise and rapid generation of pseudo-labels, we introduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset, BDS-TR is more prominent in scale and encompasses a wider variety of categories and scenes. This expansion will enhance our model's applicability across a broader range of scenarios and provide a more comprehensive foundational dataset for future SOD research. Additionally, we present an edge decoder based on dynamic upsampling, which focuses on object edges while gradually recovering image feature resolution. Comprehensive experiments on five benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches and also surpasses several existing fully-supervised SOD methods. The code and results will be made available.
目标物体检测(SOD)旨在识别和分割场景中的显著区域。传统模型依赖于手动标注的伪标签,这些伪标签需要精确到像素级别,耗时且成本高昂。我们开发了一种低成本、高精度的标注方法,通过利用大型基础模型来应对这一挑战。具体而言,我们采用弱监督的方法,通过文本提示引导大型模型生成伪标签。由于大规模模型在处理图像显著区域方面表现不佳,我们对一部分文本进行手动注释以微调模型。 基于这种方法,能够实现精确且快速的伪标签生成,我们引入了一个新的数据集BDS-TR。与之前的DUTS-TR数据集相比,BDS-TR在规模上更为宏大,并涵盖了更多种类和场景。这一扩展将增强我们的模型在更广泛场景中的适用性,并为未来的目标物体检测研究提供更加全面的基础数据集。 此外,我们还提出了一种基于动态上采样的边缘解码器,该解码器专注于对象的边缘,同时逐渐恢复图像特征分辨率。我们在五个基准数据集上的综合实验表明,我们的方法显著优于当前最先进的方法,并且在一些现有的全监督SOD方法中也表现出色。 代码和结果将公开提供。
https://arxiv.org/abs/2501.04582
Prostate cancer is a leading cause of cancer-related mortality in men. The registration of magnetic resonance (MR) and transrectal ultrasound (TRUS) can provide guidance for the targeted biopsy of prostate cancer. In this study, we propose a salient region matching framework for fully automated MR-TRUS registration. The framework consists of prostate segmentation, rigid alignment and deformable registration. Prostate segmentation is performed using two segmentation networks on MR and TRUS respectively, and the predicted salient regions are used for the rigid alignment. The rigidly-aligned MR and TRUS images serve as initialization for the deformable registration. The deformable registration network has a dual-stream encoder with cross-modal spatial attention modules to facilitate multi-modality feature learning, and a salient region matching loss to consider both structure and intensity similarity within the prostate region. Experiments on a public MR-TRUS dataset demonstrate that our method achieves satisfactory registration results, outperforming several cutting-edge methods. The code is publicly available at this https URL.
前列腺癌是男性癌症相关死亡的主要原因之一。磁共振成像(MR)和经直肠超声波(TRUS)的注册可以为前列腺癌的目标活检提供指导。在本研究中,我们提出了一种用于全自动MR-TRUS配准的显著区域匹配框架。该框架包括前列腺分割、刚性对齐和变形配准三个步骤。 具体来说,使用两个分割网络分别针对MR和TRUS图像进行前列腺分割,并利用预测出的显著区域来进行刚性对齐。刚性对齐后的MR和TRUS图像将作为后续变形配准的初始条件。变形配准网络采用双流编码器并配备跨模态空间注意力模块,以促进多模式特征学习;同时使用显著区域匹配损失函数来考虑前列腺区域内结构及强度相似度。 在公共MR-TRUS数据集上的实验表明,该方法取得了满意的配准结果,并优于若干前沿方法。代码可在以下网址公开获取:[此URL](请将方括号中的文字替换为实际的链接地址)。
https://arxiv.org/abs/2501.03510
In the domain of 3D object classification, a fundamental challenge lies in addressing the scarcity of labeled data, which limits the applicability of traditional data-intensive learning paradigms. This challenge is particularly pronounced in few-shot learning scenarios, where the objective is to achieve robust generalization from minimal annotated samples. To overcome these limitations, it is crucial to identify and leverage the most salient and discriminative features of 3D objects, thereby enhancing learning efficiency and reducing dependency on large-scale labeled datasets. This work introduces RW-Net, a novel framework designed to address the challenges above by integrating Rate-Distortion Explanation (RDE) and wavelet transform into a state-of-the-art projection-based 3D object classification architecture. The proposed method capitalizes on RDE to extract critical features by identifying and preserving the most informative data components while reducing redundancy. This process ensures the retention of essential information for effective decision-making, optimizing the model's ability to learn from limited data. Complementing RDE, incorporating the wavelet transform further enhances the framework's capability to generalize in low-data regimes. By emphasizing low-frequency components of the input data, the wavelet transform captures fundamental geometric and structural attributes of 3D objects. These attributes are instrumental in mitigating overfitting and improving the robustness of the learned representations across diverse tasks and domains. To validate the effectiveness of our RW-Net, we conduct extensive experiments on three datasets: ModelNet40, ModelNet40-C, and ScanObjectNN for few-shot 3D object classification. The results demonstrate that our approach achieves state-of-the-art performance and exhibits superior generalization and robustness in few-shot learning scenarios.
在三维物体分类领域,一个基本的挑战在于应对标注数据稀缺的问题,这限制了传统数据密集型学习范式的应用。这一问题在少量样本学习场景中尤为突出,在这些场景中,目标是从极少数标注样本中实现稳健的一般化能力。为克服这些局限性,至关重要的是识别并利用三维物体中最显著和最具判别力的特征,从而提高学习效率,并减少对大规模标注数据集的依赖。 本文介绍了一种名为RW-Net的新框架,旨在通过将率失真解释(RDE)和小波变换集成到最先进的基于投影的三维物体分类架构中来解决上述挑战。所提出的方法利用RDE提取关键特征,通过识别并保留最具有信息量的数据成分同时减少冗余来实现这一目标。这个过程确保了对有效决策至关重要的基本信息得以保留,并优化了模型从有限数据学习的能力。结合RDE,融入小波变换进一步增强了框架在低数据环境下的一般化能力。通过强调输入数据的低频部分,小波变换捕捉到了三维物体的基本几何和结构属性。这些属性对于减少过拟合并提高所学表示在整个任务和领域中的稳健性至关重要。 为了验证RW-Net的有效性,我们在三个用于少量样本三维物体分类的数据集上进行了广泛的实验:ModelNet40、ModelNet40-C 和 ScanObjectNN。结果表明,我们的方法在性能方面达到了最先进的水平,并且在少量样本学习场景中表现出色的一般化能力和稳健性。
https://arxiv.org/abs/2501.03221
Multiple Instance Learning (MIL) methods allow for gigapixel Whole-Slide Image (WSI) analysis with only slide-level annotations. Interpretability is crucial for safely deploying such algorithms in high-stakes medical domains. Traditional MIL methods offer explanations by highlighting salient regions. However, such spatial heatmaps provide limited insights for end users. To address this, we propose a novel inherently interpretable WSI-classification approach that uses human-understandable pathology concepts to generate explanations. Our proposed Concept MIL model leverages recent advances in vision-language models to directly predict pathology concepts based on image features. The model's predictions are obtained through a linear combination of the concepts identified on the top-K patches of a WSI, enabling inherent explanations by tracing each concept's influence on the prediction. In contrast to traditional concept-based interpretable models, our approach eliminates the need for costly human annotations by leveraging the vision-language model. We validate our method on two widely used pathology datasets: Camelyon16 and PANDA. On both datasets, Concept MIL achieves AUC and accuracy scores over 0.9, putting it on par with state-of-the-art models. We further find that 87.1\% (Camelyon16) and 85.3\% (PANDA) of the top 20 patches fall within the tumor region. A user study shows that the concepts identified by our model align with the concepts used by pathologists, making it a promising strategy for human-interpretable WSI classification.
多重实例学习(Multiple Instance Learning,MIL)方法允许仅使用幻灯片级别的标注进行数十亿像素的全滑块图像(Whole-Slide Image,WSI)分析。在高风险医疗领域部署此类算法时,可解释性至关重要。传统的MIL方法通过突出显示关键区域来提供解释,但这些空间热图对最终用户提供的见解有限。为解决这一问题,我们提出了一种新的、固有的可解释的WSI分类方法,该方法利用人类可以理解的病理概念生成解释。我们的提出的概念多重实例学习(Concept MIL)模型利用最近在视觉-语言模型上的进展,直接根据图像特征预测病理学概念。通过将顶级K个补丁中识别出的概念进行线性组合来获得模型的预测,从而能够追踪每个概念对预测的影响,并提供内在的解释。与传统的基于概念的可解释模型不同,我们的方法通过利用视觉-语言模型,消除了昂贵的人类注释的需求。 我们使用两个广泛使用的病理数据集:Camelyon16和PANDA来验证我们的方法。在这些数据集中,Concept MIL模型实现了超过0.9的AUC和准确率得分,使其与最先进的模型相媲美。此外,我们发现87.1%(Camelyon16)和85.3%(PANDA)的前20个补丁位于肿瘤区域内。一项用户研究表明,我们的模型识别的概念与病理学家使用的概念一致,表明这是一种有前途的人类可理解的WSI分类策略。
https://arxiv.org/abs/2501.02922
RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at this https URL.
RGB-D显著物体检测(Salient Object Detection,SOD)是一种旨在通过同时建模RGB和深度信息来突出给定场景中重要区域的具有挑战性的像素级预测任务。最近,由于其增强检测过程的能力,双注意力机制被引入到这一领域。然而,大多数现有方法在人工强制融合模式下直接融合注意跨模态特征,并未考虑RGB与深度之间的内在差异,这可能会导致性能下降。此外,源自全局和局部信息的长程依赖性使得难以利用统一高效的融合策略。 因此,在本文中,我们提出了GL-DMNet(具有全局局部感知能力的新型双相互学习网络)。具体来说,我们提出了一种位置互融模块和一种通道互融模块,以在空间维度和信道维度上探索不同模态之间的相互依赖性。此外,我们采用了一种基于级联Transformer融合重建的有效解码器来整合多层次融合特征。 广泛的实验表明,在六个基准数据集上,我们的GL-DMNet方法优于24种RGB-D SOD方法,并且在四个评价指标下比第二名模型(S3Net)平均提高了约3%。代码和结果可以在提供的链接中获取:[原文链接](请根据实际文档中的URL替换此占位符)。
https://arxiv.org/abs/2501.01648
In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and modular framework that enhances both the explainability and accuracy of Convolutional Neural Networks (CNNs) and Transformer-based models. MHEX consists of three core components: an Attention Gate that dynamically highlights task-relevant features, Deep Supervision that guides early layers to capture fine-grained details pertinent to the target class, and an Equivalent Matrix that unifies refined local and global representations to generate comprehensive saliency maps. Our approach demonstrates superior compatibility, enabling effortless integration into existing residual networks like ResNet and Transformer architectures such as BERT with minimal modifications. Extensive experiments on benchmark datasets in medical imaging and text classification show that MHEX not only improves classification accuracy but also produces highly interpretable and detailed saliency scores.
在这项研究中,我们介绍了多头解释器(MHEX),这是一种灵活且模块化的框架,旨在增强卷积神经网络(CNN)和基于变压器模型的可解释性和准确性。MHEX 包含三个核心组件:注意力门控,该组件可以动态突出显示任务相关特征;深度监督,它引导早期层捕捉与目标类别相关的细粒度细节;以及等价矩阵,用于统一精炼后的局部和全局表示以生成全面的显著性图。 我们的方法展示了出色的兼容性,能够轻松地将现有残差网络(如ResNet)和变压器架构(如BERT)集成到MHEX中,并且仅需进行最小的修改。在医学影像和文本分类等基准数据集上的广泛实验表明,MHEX 不仅提高了分类准确性,还生成了高度可解释且详细的显著性得分。
https://arxiv.org/abs/2501.01311
Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character's persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS, which is diligently curated and annotated to facilitate research in this domain. The experimental results demonstrate our proposed FaceSpeak can generate portrait-aligned voice with satisfactory naturalness and quality.
人类可以通过观察说话人的外观(如身份、性别、性格和情绪)来感知他们的特征,这些特征通常与其语音风格相一致。最近,基于视觉驱动的文本转语音(TTS)的研究人员主要以真实人物的脸部为基础进行研究,这限制了有效的语音合成技术在具有多样化角色和图像风格的潜在应用场景中的应用范围。为了解决这一问题,我们提出了一种新的FaceSpeak方法。该方法可以从各种类型的图像样式中提取显著的身份特征和情感表示,同时消除背景、服装和发色等冗余信息,从而生成与人物个性紧密匹配的合成语音。 此外,为了克服多模态TTS数据稀缺的问题,我们设计了一个创新的数据集——具有表现力的多模态TTS数据集,并且该数据集经过精心策划和标注以促进这一领域的研究。实验结果表明,我们提出的FaceSpeak方法可以生成与肖像一致、自然度和质量都令人满意的语音。
https://arxiv.org/abs/2501.03181
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at this https URL.
最近,零样本图像描述(zero-shot image captioning)受到了越来越多的关注,在这种任务中,只有文本数据可用于训练。基于文本到图像的扩散模型取得的进步展现了解决这一问题的潜力,通过利用预训练前缀生成的合成图像-描述对来实现这一点。然而,这些合成图中的显著区域存在缺陷细节的问题,导致图像和文字之间的语义不一致,进而影响了结果的质量。 为解决此挑战,我们提出了一种新颖的Patch-wise Cross-modal feature Mix-up(PCM)机制,能够以细粒度的方式在训练过程中自适应地缓解不忠实的内容问题。该机制可以集成到大多数编码-解码框架中,由此引入我们的PCM-Net模型。具体来说,对于每个输入图像,首先基于CLIP空间中的图-文相似性检测出图像中的显著视觉概念。然后,选择性地将输入图像的补丁级视觉特征与这些显著视觉概念的文字特征融合在一起,从而产生缺陷更少的混合特征图。最后,利用一个视觉语义编码器来精炼所得的特征图,并将其进一步集成到句子解码器中用于描述生成。 此外,为了方便使用合成数据进行模型训练,我们设计了一种新颖的CLIP加权交叉熵损失函数,旨在优先处理高质量的图像-文本对而非低质量的。在MSCOCO和Flickr30k数据集上进行的大量实验表明,与现有的基于视觉语言模型(VLMs)的方法相比,我们的PCM-Net具有显著的优势。值得注意的是,我们的PCM-Net在域内以及跨域零样本图像描述任务中均排名第一。 合成数据集SynthImgCap和代码可在此处获取:[提供链接]
https://arxiv.org/abs/2501.00437
Introduction: Healthcare AI models often inherit biases from their training data. While efforts have primarily targeted bias in structured data, mental health heavily depends on unstructured data. This study aims to detect and mitigate linguistic differences related to non-biological differences in the training data of AI models designed to assist in pediatric mental health screening. Our objectives are: (1) to assess the presence of bias by evaluating outcome parity across sex subgroups, (2) to identify bias sources through textual distribution analysis, and (3) to develop a de-biasing method for mental health text data. Methods: We examined classification parity across demographic groups and assessed how gendered language influences model predictions. A data-centric de-biasing method was applied, focusing on neutralizing biased terms while retaining salient clinical information. This methodology was tested on a model for automatic anxiety detection in pediatric patients. Results: Our findings revealed a systematic under-diagnosis of female adolescent patients, with a 4% lower accuracy and a 9% higher False Negative Rate (FNR) compared to male patients, likely due to disparities in information density and linguistic differences in patient notes. Notes for male patients were on average 500 words longer, and linguistic similarity metrics indicated distinct word distributions between genders. Implementing our de-biasing approach reduced diagnostic bias by up to 27%, demonstrating its effectiveness in enhancing equity across demographic groups. Discussion: We developed a data-centric de-biasing framework to address gender-based content disparities within clinical text. By neutralizing biased language and enhancing focus on clinically essential information, our approach demonstrates an effective strategy for mitigating bias in AI healthcare models trained on text.
介绍:医疗人工智能模型经常从其训练数据中继承偏见。虽然目前的努力主要集中在结构化数据上的偏见问题上,但心理健康则高度依赖于非结构化的文本数据。本研究旨在检测并缓解用于儿科心理健康筛查的AI模型训练数据中的与生物学无关的语言差异。我们的目标是:(1)通过评估性别的亚群之间的结果一致性来评估偏差的存在;(2)通过文本分布分析识别偏差来源;以及(3)开发针对心理健康文本数据去偏的方法。 方法:我们考察了不同人口群体间的分类一致性和性别语言对模型预测的影响。应用了一种以数据为中心的去偏方法,该方法专注于中和有偏见的语言术语,同时保留关键的临床信息。这种方法在儿科患者自动焦虑检测模型上进行了测试。 结果:我们的研究发现,在诊断青少年女性患者时存在系统性的低估现象,相较于男性患者,其准确率低4%,假阴性率(FNR)高出9%。这可能是因为患者记录中的信息密度和语言差异不同所致。男性的患者笔记平均长500个单词,并且基于语义相似度的指标显示性别间的词汇分布有显著区别。实施我们的去偏策略将诊断偏差减少了多达27%,显示出在增强人口群体间公平性方面的有效性。 讨论:我们开发了一种以数据为中心的去偏框架,旨在解决临床文本中的性别内容差异问题。通过中和语言上的偏见,并提高对关键临床信息的关注度,我们的方法展示了减轻训练于非结构化文本数据上的人工智能医疗模型中的偏差的有效策略。
https://arxiv.org/abs/2501.00129
The AUTO-PCOS Classification Challenge seeks to advance the diagnostic capabilities of artificial intelligence (AI) in identifying Polycystic Ovary Syndrome (PCOS) through automated classification of healthy and unhealthy ultrasound frames. This report outlines our methodology for building a robust AI pipeline utilizing transfer learning with the InceptionV3 architecture to achieve high accuracy in binary classification. Preprocessing steps ensured the dataset was optimized for training, validation, and testing, while interpretability methods like LIME and saliency maps provided valuable insights into the model's decision-making. Our approach achieved an accuracy of 90.52%, with precision, recall, and F1-score metrics exceeding 90% on validation data, demonstrating its efficacy. The project underscores the transformative potential of AI in healthcare, particularly in addressing diagnostic challenges like PCOS. Key findings, challenges, and recommendations for future enhancements are discussed, highlighting the pathway for creating reliable, interpretable, and scalable AI-driven medical diagnostic tools.
《自动PCOS分类挑战》旨在通过自动识别健康和不健康的超声图像,提升人工智能(AI)在诊断多囊卵巢综合症(PCOS)方面的能力。本报告概述了我们使用迁移学习与InceptionV3架构构建稳健的AI管道的方法,以实现二元分类的高精度。预处理步骤确保数据集经过优化,可以用于训练、验证和测试,而像LIME和热图这样的解释方法则提供了关于模型决策过程的重要见解。 我们的方法达到了90.52%的准确率,在验证数据上,精确度、召回率和F1分数均超过90%,表明其有效性。该项目强调了AI在医疗保健中的变革潜力,特别是在解决像PCOS这样的诊断挑战方面。报告讨论了关键发现、面临的挑战以及未来改进的建议,突出了解释性、可靠性与可扩展性的路径,以创建基于人工智能的医学诊断工具。 总的来说,《自动PCOS分类挑战》展示了通过使用先进的AI技术可以有效地提升医疗诊断的质量和效率,并为未来的医学研究提供了宝贵的指导。
https://arxiv.org/abs/2501.01984