Saliency maps have become one of the most widely used interpretability techniques for convolutional neural networks (CNN) due to their simplicity and the quality of the insights they provide. However, there are still some doubts about whether these insights are a trustworthy representation of what CNNs use to come up with their predictions. This paper explores how rescuing the sign of the gradients from the saliency map can lead to a deeper understanding of multi-class classification problems. Using both pretrained and trained from scratch CNNs we unveil that considering the sign and the effect not only of the correct class, but also the influence of the other classes, allows to better identify the pixels of the image that the network is really focusing on. Furthermore, how occluding or altering those pixels is expected to affect the outcome also becomes clearer.
视觉相关度图(Saliency Map)已经成为卷积神经网络(CNN)中最常用的可解释技术之一,因为它们的简单易用以及它们提供的深刻 insights。然而,仍然存在一些疑问,这些 insights 是否足以代表 CNN 用来预测结果的可信表示。本文探讨了如何从视觉相关度图中提取梯度 sign 并深入理解多分类问题。使用预训练和从零开始训练的 CNN 模型,我们发现不仅考虑正确的类别及其影响,还要考虑其他类别的影响,可以更好地确定网络真正关注的图像像素。此外,如何遮蔽或改变这些像素预计会影响结果也变得更加清晰。
https://arxiv.org/abs/2309.12913
While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines.
尽管图像数据开始享受基于遮蔽和自重构目标的简单但有效的自我监督学习方案,由于引入了 tokenization 过程和视觉Transformer骨架,卷积神经网络也成为了另一种重要且广泛应用的图像数据架构。尽管卷积神经网络有对比学习技术来驱动自我监督学习,但它们仍然面临着利用这种简单而普遍的遮蔽操作来显著改善其学习过程的困难。在这项工作中,我们旨在减轻将遮蔽操作纳入对比学习框架,作为增加的增广方法,对卷积神经网络作为对比学习框架额外的增广方法的负担。除了无害的边缘(在遮蔽和未被遮蔽区域之间)以及由卷积神经网络的遮蔽操作引起的其他不利效应,我们特别发现了一种潜在问题,即在一个对比样本对中,随机选择的遮蔽区域可能过于集中在重要或显著的对象上,从而导致对另一个视图的学习对比度产生误导。为此,我们建议 explicitly 考虑可见性约束,其中遮蔽区域在 foreground 和 background 之间更均匀地分布以实现基于遮蔽的增广。此外,我们通过在输入图像中遮蔽较大的显著斑点来引入硬负样本。我们对多种数据集、对比学习和后续任务进行了广泛的实验,并成功地证明了我们提出的方法和几个前沿基准之间的差距。
https://arxiv.org/abs/2309.12757
Surface defect inspection is a very challenging task in which surface defects usually show weak appearances or exist under complex backgrounds. Most high-accuracy defect detection methods require expensive computation and storage overhead, making them less practical in some resource-constrained defect detection applications. Although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The proposed DSA performs element-wise similarity in channel dimension while maintaining linear complexity. In addition, we introduce a novel Channel Reference Attention (CRA) module before each decoder block to strengthen the representation of multi-level features in the bottom-up path. The proposed CRA exploits the channel correlation between features at different layers to adaptively enhance feature representation. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods. Specifically, GCANet achieves competitive accuracy (91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, and 97.35% $E_\phi$) on SD-saliency-900 while running 272fps on a single gpu.
表面缺陷检查是一项极具挑战性的任务,通常表面缺陷呈现出较弱的外观或存在于复杂的背景中。大多数高精度的缺陷检测方法都需要昂贵的计算和存储 overhead,因此在一些资源受限的缺陷检测应用中不太实用。虽然一些轻量级方法在仅有几个参数的情况下已经可以实现实时推断速度,但在复杂的缺陷场景中表现出较差的检测精度。为此,我们开发了一种全球上下文聚合网络(GCANet),用于在编码器和解码器结构中 lightweight saliency检测表面缺陷。我们首先在轻量级骨架的顶部引入了一个新的transformer编码器,该编码器通过 novel Depth-wise Self-Attention (DSA)模块实现了全球上下文信息捕捉。 proposed DSA 在通道维度上进行元素相似性计算,同时保持线性复杂性。此外,我们在每个解码块前引入了一个 novel Channel Reference Attention (CRA)模块,以加强从bottom-up路径上下来的多级特征表示。 proposed CRA利用不同层上特征之间的通道相关性,自适应地增强特征表示。在三个公开缺陷数据集上的实验结果显示,与另外17个先进方法相比, proposed 网络在准确性和运行效率之间的更好权衡。具体来说,GCANet 在SD-saliency-900上实现了 competitive accuracy(91.79% $F_{\beta}^{w}$,93.55% $S_\alpha$,97.35% $E_\phi$),同时运行在单个GPU上的帧率为272fps。
https://arxiv.org/abs/2309.12641
Video memorability is a measure of how likely a particular video is to be remembered by a viewer when that viewer has no emotional connection with the video content. It is an important characteristic as videos that are more memorable are more likely to be shared, viewed, and discussed. This paper presents results of a series of experiments where we improved the memorability of a video by selectively cropping frames based on image saliency. We present results of a basic fixed cropping as well as the results from dynamic cropping where both the size of the crop and the position of the crop within the frame, move as the video is played and saliency is tracked. Our results indicate that especially for videos of low initial memorability, the memorability score can be improved.
视频记忆性是指当观众对视频内容没有情感联系时,一个视频被观众记住的可能性。这是一个重要的特征,因为那些更容易被记住的视频更有可能被分享、观看和讨论。本文介绍了一组实验结果,我们在视频中通过选择性裁剪帧来提高记忆性。我们介绍了基本固定裁剪和动态裁剪的结果,其中裁剪的大小和位置在帧中移动,并且随着视频的播放而跟踪saliency。我们的结果显示,特别是对于初始记忆性较低的视频,记忆性分数可以得到改善。
https://arxiv.org/abs/2309.11881
Recently, several approaches have emerged for generating neural representations with multiple levels of detail (LODs). LODs can improve the rendering by using lower resolutions and smaller model sizes when appropriate. However, existing methods generally focus on a few discrete LODs which suffer from aliasing and flicker artifacts as details are changed and limit their granularity for adapting to resource limitations. In this paper, we propose a method to encode light field networks with continuous LODs, allowing for finely tuned adaptations to rendering conditions. Our training procedure uses summed-area table filtering allowing efficient and continuous filtering at various LODs. Furthermore, we use saliency-based importance sampling which enables our light field networks to distribute their capacity, particularly limited at lower LODs, towards representing the details viewers are most likely to focus on. Incorporating continuous LODs into neural representations enables progressive streaming of neural representations, decreasing the latency and resource utilization for rendering.
近年来,出现了多种方法来生成具有多个细节级别的神经网络表示(LODs)。当适当时,可以使用更低的分辨率和更小的模型大小来提高渲染性能。然而,现有的方法通常关注几个离散的LODs,这些LODs在细节发生变化时会表现出 alias 和闪烁 artifacts,并且限制了它们的颗粒度以适应资源限制。在本文中,我们提出了一种方法来编码连续的LODs,以便精细调整渲染条件。我们的训练过程使用总面积表过滤,可以在各种LODs上高效且连续地过滤。此外,我们使用基于重要性采样的方法,使我们的光场网络能够将其能力(特别是在较低的LODs中受到限制)分布到表示观众最有可能关注的细节。将连续的LODs融入神经网络表示中,实现了渐进的神经网络表示流,减少了渲染的延迟和资源利用率。
https://arxiv.org/abs/2309.11591
The adoption of machine learning in healthcare calls for model transparency and explainability. In this work, we introduce Signature Activation, a saliency method that generates holistic and class-agnostic explanations for Convolutional Neural Network (CNN) outputs. Our method exploits the fact that certain kinds of medical images, such as angiograms, have clear foreground and background objects. We give theoretical explanation to justify our methods. We show the potential use of our method in clinical settings through evaluating its efficacy for aiding the detection of lesions in coronary angiograms.
机器学习在医疗保健领域的应用要求模型的透明度和可解释性。在本文中,我们介绍了SignatureActivation,一个基于局部重要性的方法,用于生成对整个卷积神经网络(CNN)输出的全局和无类别解释。我们的方法利用了许多医学图像的特点,例如心脏磁共振成像( angiograms),具有明确的前因和背景对象。我们提供了理论解释来支持我们的方法和展示我们方法在临床环境中的潜在用途,通过评估其帮助心脏磁共振成像中 Lesion 检测的有效性。
https://arxiv.org/abs/2309.11443
We present a set of metrics that utilize vision priors to effectively assess the performance of saliency methods on image classification tasks. To understand behavior in deep learning models, many methods provide visual saliency maps emphasizing image regions that most contribute to a model prediction. However, there is limited work on analyzing the reliability of saliency methods in explaining model decisions. We propose the metric COnsistency-SEnsitivity (COSE) that quantifies the equivariant and invariant properties of visual model explanations using simple data augmentations. Through our metrics, we show that although saliency methods are thought to be architecture-independent, most methods could better explain transformer-based models over convolutional-based models. In addition, GradCAM was found to outperform other methods in terms of COSE but was shown to have limitations such as lack of variability for fine-grained datasets. The duality between consistency and sensitivity allow the analysis of saliency methods from different angles. Ultimately, we find that it is important to balance these two metrics for a saliency map to faithfully show model behavior.
我们提出了一组指标,利用视觉先验来有效地评估图像分类任务中视差方法的性能。为了理解深度学习模型的行为,许多方法提供视觉视差地图,强调对模型预测贡献最大的图像区域。然而,关于解释模型决策视差方法的可靠性的研究相对较少。我们提出了一个指标COnsistency-SEnsitivity(COSE),它通过简单的数据增强来量化视觉模型解释的等效性和不变性。通过我们的指标,我们表明尽管视差方法被认为与架构无关,但大多数方法可能更好地解释基于卷积模型而不是基于Transformer模型的模型。此外,GradCAM在COSE指标上的表现比其他方法更好,但表明存在例如细粒度数据缺失等限制。一致性和灵敏度的二元性允许从不同的角度分析视差方法。最终,我们发现,为了建立一个准确的视差地图,必须平衡这两个指标,以忠实地显示模型行为。
https://arxiv.org/abs/2309.10989
Accurately determining salient regions of an image is challenging when labeled data is scarce. DINO-based self-supervised approaches have recently leveraged meaningful image semantics captured by patch-wise features for locating foreground objects. Recent methods have also incorporated intuitive priors and demonstrated value in unsupervised methods for object partitioning. In this paper, we propose SEMPART, which jointly infers coarse and fine bi-partitions over an image's DINO-based semantic graph. Furthermore, SEMPART preserves fine boundary details using graph-driven regularization and successfully distills the coarse mask semantics into the fine mask. Our salient object detection and single object localization findings suggest that SEMPART produces high-quality masks rapidly without additional post-processing and benefits from co-optimizing the coarse and fine branches.
精确确定图像的关键区域在标记数据稀缺时是一项挑战性的任务。基于DINO的自我监督方法最近利用了 patch-wise 特征用于定位前景对象,从而利用了图像语义中有意义的信息。最近的方法和方法还考虑了直觉先验并证明了在对象分割方面的 unsupervised 方法的价值。在本文中,我们提出了SEMPART,它通过 graph-driven regularization Jointly infers the fine and coarse bi-partitions over an image's DINO-based semantic graph。此外,SEMPART 通过Graph 驱动的 Regularization 保留了 fine 边界细节,成功将粗面Mask 语义蒸馏到 fine 面。我们的关键物体检测和单个物体定位发现表明,SEMPART 可以快速生产高质量的掩膜,而无需额外的后处理,并从粗和 Fine 分支的共同优化中受益。
https://arxiv.org/abs/2309.10972
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that aim to encode RGB features,DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design; 2) We pre-train the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations. It avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pre-trained backbones, which widely lies in existing methods but has not been resolved. We fine-tune the pre-trained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D segmentation datasets and five RGB-D saliency datasets. Our code is available at: this https URL.
我们提出了D former,一个新颖的RGB-D预训练框架,用于学习可迁移的RGB-D分割表示。D former有两项全新的关键创新:1) 与之前的工作旨在编码RGB特征不同,D former包括一组RGB-D块,通过独特的块设计专门设计用于编码RGB和深度信息;2) 我们使用ImageNet-1K中的图像深度对作为骨干的预训练,从而使D former具备编码RGB-D表示的能力。它避免了RGB预训练骨干在深度图深度映射中的不匹配编码,这在现有方法中广泛存在,但尚未解决。我们对预训练的D former进行了微调,针对两个流行的RGB-D任务,即RGB-D语义分割和RGB-D高光物体检测,使用轻量级解码头。实验结果显示,我们的D former在这些两个任务上实现了新的先进技术性能,计算成本不到当前最佳方法的两个RGB-D分割数据和五个RGB-D高光数据集的一半。我们的代码可在以下httpsURL获取。
https://arxiv.org/abs/2309.09668
Although multiple instance learning (MIL) methods are widely used for automatic tumor detection on whole slide images (WSI), they suffer from the extreme class imbalance within the small tumor WSIs. This occurs when the tumor comprises only a few isolated cells. For early detection, it is of utmost importance that MIL algorithms can identify small tumors, even when they are less than 1% of the size of the WSI. Existing studies have attempted to address this issue using attention-based architectures and instance selection-based methodologies, but have not yielded significant improvements. This paper proposes cross-attention-based salient instance inference MIL (CASiiMIL), which involves a novel saliency-informed attention mechanism, to identify breast cancer lymph node micro-metastasis on WSIs without the need for any annotations. Apart from this new attention mechanism, we introduce a negative representation learning algorithm to facilitate the learning of saliency-informed attention weights for improved sensitivity on tumor WSIs. The proposed model outperforms the state-of-the-art MIL methods on two popular tumor metastasis detection datasets, and demonstrates great cross-center generalizability. In addition, it exhibits excellent accuracy in classifying WSIs with small tumor lesions. Moreover, we show that the proposed model has excellent interpretability attributed to the saliency-informed attention weights. We strongly believe that the proposed method will pave the way for training algorithms for early tumor detection on large datasets where acquiring fine-grained annotations is practically impossible.
虽然多个实例学习(MIL)方法广泛用于整个Slide图像(WSI)上的自动肿瘤检测,但在小型肿瘤WSIs内部存在极端类不平衡。这种情况通常在肿瘤只有几个孤立细胞时发生。对于早期检测来说,最重要的是MIL算法能够识别小型肿瘤,即使它们的大小小于WSI的1%也非常重要。现有研究已经尝试使用注意力机制和实例选择方法来解决这一问题,但尚未取得显著进展。本文提出了跨注意力机制的突出实例推断(CASiiMIL),其中涉及一种新颖的注意力 informed 注意力机制,可以在WSIs上识别乳腺癌淋巴结微转移,而无需进行任何标注。除了新的注意力机制外,我们引入了一种负表示学习算法,以促进注意力 informed 注意力权重的学习,以提高肿瘤WSIs的灵敏度。该模型在两个流行的肿瘤转移检测数据集上优于最先进的MIL方法,并表现出跨中心泛化能力。此外,它在对小型肿瘤病变的分类方面表现出出色的准确性。此外,我们表明,该模型得益于注意力 informed 注意力权重的出色解释性。我们强烈认为,该方法将为在大型数据集上实现早期肿瘤检测算法的训练方法铺平道路,而获取精细标注在实践中是不可能的。
https://arxiv.org/abs/2309.09412
Automatically evaluate the correctness of programming assignments is rather straightforward using unit and integration tests. However, programming tasks can be solved in multiple ways, many of which, although correct, are inelegant. For instance, excessive branching, poor naming or repetitiveness make the code hard to understand and maintain. These subjective qualities of code are hard to automatically assess using current techniques. In this work we investigate the use of CodeBERT to automatically assign quality score to Java code. We experiment with different models and training paradigms. We explore the accuracy of the models on a novel dataset for code quality assessment. Finally, we assess the quality of the predictions using saliency maps. We find that code quality to some extent is predictable and that transformer based models using task adapted pre-training can solve the task more efficiently than other techniques.
自动评估编程任务的是否正确性使用单元和集成测试是 rather straightforward。然而,编程任务可以通过多种方式解决,尽管这些方式是正确的,但它们有些很无聊。例如,过多的分支、不良命名或重复性使代码很难理解和维护。这些代码的主观特性很难使用现有技术自动评估。在本工作中,我们研究如何使用 CodeBERT 自动为 Java 代码分配质量分数。我们试验了不同模型和训练范式。我们探索了模型在代码质量评估新数据集上的准确性。最后,我们使用吸引力映射评估了预测质量。我们发现,代码质量在一定程度上是可以预测的,使用任务适应的前训练Transformer模型可以比其他技术更快地解决问题。
https://arxiv.org/abs/2309.09264
Detecting firearms and accurately localizing individuals carrying them in images or videos is of paramount importance in security, surveillance, and content customization. However, this task presents significant challenges in complex environments due to clutter and the diverse shapes of firearms. To address this problem, we propose a novel approach that leverages human-firearm interaction information, which provides valuable clues for localizing firearm carriers. Our approach incorporates an attention mechanism that effectively distinguishes humans and firearms from the background by focusing on relevant areas. Additionally, we introduce a saliency-driven locality-preserving constraint to learn essential features while preserving foreground information in the input image. By combining these components, our approach achieves exceptional results on a newly proposed dataset. To handle inputs of varying sizes, we pass paired human-firearm instances with attention masks as channels through a deep network for feature computation, utilizing an adaptive average pooling layer. We extensively evaluate our approach against existing methods in human-object interaction detection and achieve significant results (AP=77.8\%) compared to the baseline approach (AP=63.1\%). This demonstrates the effectiveness of leveraging attention mechanisms and saliency-driven locality preservation for accurate human-firearm interaction detection. Our findings contribute to advancing the fields of security and surveillance, enabling more efficient firearm localization and identification in diverse scenarios.
检测枪支并在图像或视频中准确地定位携带枪支的人对于安全、监控和内容定制至关重要。然而,在复杂的环境下,由于枪支的混乱和多种形状,这项任务面临着巨大的挑战。为了解决这一问题,我们提出了一种创新的方法,利用人类-枪支交互信息,这些信息为定位枪支携带者提供了宝贵的线索。我们的方法包括一个注意力机制,通过关注相关区域,有效地将人类和枪支从背景中区分开来。此外,我们引入了一种基于注意力的局部保留约束,以在学习输入图像中的重要特征的同时,保留前景信息。通过将这些组件结合起来,我们在新提出的数据集上取得了卓越的结果。为了处理不同大小输入,我们使用注意力掩码将一对人类-枪支实例作为通道通过深度网络进行特征计算,并使用自适应平均池化层。我们广泛评估了我们的方法在人类-物体交互检测方面的现有方法,并取得了显著的结果(AP=77.8\%),与基准方法(AP=63.1\%)相比。这表明利用注意力机制和基于注意力的局部保留约束进行准确人类-枪支交互检测的有效性。我们的发现为推进安全和监控领域,使能够在各种情况下更有效地定位和识别枪支。
https://arxiv.org/abs/2309.09236
Plant diseases remain a considerable threat to food security and agricultural sustainability. Rapid and early identification of these diseases has become a significant concern motivating several studies to rely on the increasing global digitalization and the recent advances in computer vision based on deep learning. In fact, plant disease classification based on deep convolutional neural networks has shown impressive performance. However, these methods have yet to be adopted globally due to concerns regarding their robustness, transparency, and the lack of explainability compared with their human experts counterparts. Methods such as saliency-based approaches associating the network output to perturbations of the input pixels have been proposed to give insights into these algorithms. Still, they are not easily comprehensible and not intuitive for human users and are threatened by bias. In this work, we deploy a method called Testing with Concept Activation Vectors (TCAV) that shifts the focus from pixels to user-defined concepts. To the best of our knowledge, our paper is the first to employ this method in the field of plant disease classification. Important concepts such as color, texture and disease related concepts were analyzed. The results suggest that concept-based explanation methods can significantly benefit automated plant disease identification.
植物疾病仍然是粮食安全和农业生产的严重威胁。快速和早期识别这些疾病已成为一个重要关注点,促使多项研究依赖日益全球数字化和基于深度学习的计算机视觉进展。事实上,基于深度学习的的植物疾病分类表现令人印象深刻。然而,这些方法尚未在全球范围内得到采用,因为它们相对于人类专家方法的鲁棒性、透明度和解释性不足。例如,基于注意力的方法将网络输出与输入像素的变化联系起来,以提供对这些算法 insights。然而,它们对于人类用户并不易于理解,且受到偏见的威胁。在这项工作中,我们采用一种方法称为“测试概念激活向量”(TCAV),将焦点从像素转移到用户定义的概念。据我们所知,我们的论文是植物疾病分类领域第一个采用这种方法的。我们分析了许多重要的概念,如颜色、纹理和与疾病相关的概念。结果表明,基于概念的解释方法可以 significantly benefit automated 植物疾病识别。
https://arxiv.org/abs/2309.08739
Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1% of extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments outperform full fine-tuning on 18 out of 19 tasks in the VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot learning surpass other PEFT methods with lower parameter costs, demonstrating our proposed tuning technique's strong capability and effectiveness in the low-data regime.
预训练的视觉转换器对多种后续任务具有强大的表示 benefits。最近,许多参数高效的微调方法(PEFT)被提出,并且它们的实验表明,在数据资源有限的情况下,仅调整额外的参数的1%可以超过完整的微调。但是这些方法在微调多种后续任务时忽略了任务特定信息。在本文中,我们提出了一种简单但有效的方法,称为“关键通道微调”(SCT),通过将任务图像与模型传递,选择部分通道在特征映射中,使我们只能调整1/8通道,从而降低了参数成本。实验在VTAB-1K基准测试中19个任务中优于完整的微调,仅使用ViT-B的0.11M参数,比完整的微调任务少了780$\times$。此外,对于域泛化和少量学习实验,与其他PEFT方法相比,降低了参数成本,超越了它们。这表明我们提出的调整方法在低数据状态下的强大能力和有效性。
https://arxiv.org/abs/2309.08513
Most existing salient object detection methods mostly use U-Net or feature pyramid structure, which simply aggregates feature maps of different scales, ignoring the uniqueness and interdependence of them and their respective contributions to the final prediction. To overcome these, we propose the M$^3$Net, i.e., the Multilevel, Mixed and Multistage attention network for Salient Object Detection (SOD). Firstly, we propose Multiscale Interaction Block which innovatively introduces the cross-attention approach to achieve the interaction between multilevel features, allowing high-level features to guide low-level feature learning and thus enhancing salient regions. Secondly, considering the fact that previous Transformer based SOD methods locate salient regions only using global self-attention while inevitably overlooking the details of complex objects, we propose the Mixed Attention Block. This block combines global self-attention and window self-attention, aiming at modeling context at both global and local levels to further improve the accuracy of the prediction map. Finally, we proposed a multilevel supervision strategy to optimize the aggregated feature stage-by-stage. Experiments on six challenging datasets demonstrate that the proposed M$^3$Net surpasses recent CNN and Transformer-based SOD arts in terms of four metrics. Codes are available at this https URL.
现有的突出物体检测方法大多数使用U-Net或特征金字塔结构, simply aggregate feature maps of different scales,并忽视它们的独特性和相互关系,以及它们对最终预测的贡献。为了克服这些问题,我们提出了M$^3$Net,即多层次、混合和多层注意力网络,用于突出物体检测(SOD)。首先,我们提出了多层次互动块,创造性地引入了交叉注意力方法,实现多层次特征之间的交互,使高层次特征指导低层次特征学习,从而增强突出区域。其次,考虑到先前基于Transformer的SOD方法仅使用全局自注意力,而不可避免地忽略复杂的物体细节,我们提出了混合注意力块。这个块将全局自注意力和窗口自注意力相结合,旨在建模全球和本地级别的上下文,进一步改善预测映射的准确性。最后,我们提出了一种多层次监督策略,以优化汇总特征的每个阶段。对六个挑战性数据集的实验表明,提出的M$^3$Net在四个指标上超越了最近的CNN和Transformer-basedSOD方法。代码可在this https URL上获取。
https://arxiv.org/abs/2309.08365
Video saliency prediction and detection are thriving research domains that enable computers to simulate the distribution of visual attention akin to how humans perceiving dynamic scenes. While many approaches have crafted task-specific training paradigms for either video saliency prediction or video salient object detection tasks, few attention has been devoted to devising a generalized saliency modeling framework that seamlessly bridges both these distinct tasks. In this study, we introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection. In addition to extracting representations of frame sequences, a saliency-aware transformer is designed to learn the spatio-temporal representations at progressively increased resolutions, while incorporating effective cross-scale saliency information to produce a robust representation. Furthermore, a task-specific decoder is proposed to perform the final prediction for each task. To the best of our knowledge, this is the first work that explores designing a transformer structure for both saliency modeling tasks. Convincible experiments demonstrate that the proposed UniST achieves superior performance across seven challenging benchmarks for two tasks, and significantly outperforms the other state-of-the-art methods.
视频吸引力预测和检测是繁荣的研究领域,使计算机能够模拟类似于人类感知动态场景的视觉注意力分布。虽然许多方法已经为视频吸引力预测或视频突出物体检测任务设计了任务特定的训练范式,但很少有人专注于设计一个通用的吸引力建模框架,以无缝连接这两个截然不同的任务。在本研究中,我们介绍了统一的吸引力Transformer框架(UniST),该框架全面地利用了视频吸引力预测和视频突出物体检测的关键属性。除了提取帧序列表示,一个吸引力awareTransformer被设计来学习越来越高分辨率的空间和时间表示,同时引入有效的跨尺度吸引力信息,生成一个可靠的表示。此外,我们提出了一个任务特定的解码器,以完成每个任务的最后预测。据我们所知,这是第一个探索设计两个吸引力建模任务的统一Transformer结构的工作。有说服力的实验表明,我们提出的UniST在两个任务上的七种挑战基准测试中取得了出色的表现,并显著优于其他最先进的方法。
https://arxiv.org/abs/2309.08220
Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Network (GeleNet) for ORSI-SOD following the global-to-local paradigm. Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Then, GeleNet employs a Direction-aware Shuffle Weighted Spatial Attention Module (D-SWSAM) and its simplified version (SWSAM) to enhance local interactions, and a Knowledge Transfer Module (KTM) to further enhance cross-level contextual interactions. D-SWSAM comprehensively perceives the orientation information in the lowest-level features through directional convolutions to adapt to various orientations of salient objects in ORSIs, and effectively enhances the details of salient objects with an improved attention mechanism. SWSAM discards the direction-aware part of D-SWSAM to focus on localizing salient objects in the highest-level features. KTM models the contextual correlation knowledge of two middle-level features of different scales based on the self-attention mechanism, and transfers the knowledge to the raw features to generate more discriminative features. Finally, a saliency predictor is used to generate the saliency map based on the outputs of the above three modules. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods. The code and results of our method are available at this https URL.
现有的光学遥感图像中显著对象检测方法(ORSI-SOD)主要采用卷积神经网络(CNN)作为骨干网络,例如VGG和ResNet。由于CNN只能在特定的响应面内提取特征,大多数ORSI-SOD方法通常遵循局部到上下文范式。在本文中,我们提出了一种基于全局到局部范式的新Global ExtractionLocal Exploration Network(GeleNet),以处理ORSI-SOD方法。具体而言,GeleNet首先采用Transformer作为骨干网络,生成四级特征嵌入,并使用方向感知随机加权空间注意力模块(D-SWSAM)和其简化版本(SWSAM)增强本地交互,并使用知识传递模块(KTM)进一步增强跨级别上下文交互。D-SWSAM通过方向卷积全面感知低级别特征的方向信息,以适应ORSI中显著对象的各种方向,并通过改进的注意机制有效地增强显著对象的详细信息。SWSAM丢弃D-SWSAM的方向感知部分,专注于在最高级别特征中定位显著对象。KTM基于自注意力机制建模不同规模中等级别特征之间的上下文相关知识,并将知识传递到原始特征,生成更精细的特征。最后,一个显著预测器使用上述三个模块的输出生成显著映射。对三个公共数据集进行了广泛的实验,表明所提出的GeleNet优于相关的先进方法。我们的方法和结果可在该httpsURL上获取。
https://arxiv.org/abs/2309.08206
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:this https URL
优化视频推断效率随着各个领域对视频分析的需求不断增加变得越来越重要。一些现有方法通过明确放弃空间或时间信息实现了高效的性能,但在快速变化和精细的场景下会带来挑战。为了解决这些问题,我们提出了一种高效的视频表示网络,采用可分化分辨率压缩和对齐机制。该网络在网络的早期阶段压缩非关键信息,以降低计算成本,同时保持 consistent 的时间相关度。具体来说,我们利用一种可分化上下文 aware 压缩模块编码可见和非可见帧特征,将它们 refine 和更新为高低频分辨率的视频序列。为了处理新的序列,我们引入了一种新分辨率 align Transformer 层,以捕捉不同分辨率帧特征之间的全局时间相关度,同时通过在低分辨率非可见帧中使用更少的空间 token 以减少空间计算成本。整个网络可以通过集成可分化压缩模块进行端到端优化。实验结果显示,与我们现有的方法相比,我们的方法在近同视频检索和动态视频分类中的效率和表现实现了最佳平衡。代码: this https URL
https://arxiv.org/abs/2309.08167
Omni-directional images have been used in wide range of applications. For the applications, it would be useful to estimate saliency maps representing probability distributions of gazing points with a head-mounted display, to detect important regions in the omni-directional images. This paper proposes a novel saliency-map estimation model for the omni-directional images by extracting overlapping 2-dimensional (2D) plane images from omni-directional images at various directions and angles of view. While 2D saliency maps tend to have high probability at the center of images (center bias), the high-probability region appears at horizontal directions in omni-directional saliency maps when a head-mounted display is used (equator bias). Therefore, the 2D saliency model with a center-bias layer was fine-tuned with an omni-directional dataset by replacing the center-bias layer to an equator-bias layer conditioned on the elevation angle for the extraction of the 2D plane image. The limited availability of omni-directional images in saliency datasets can be compensated by using the well-established 2D saliency model pretrained by a large number of training images with the ground truth of 2D saliency maps. In addition, this paper proposes a multi-scale estimation method by extracting 2D images in multiple angles of view to detect objects of various sizes with variable receptive fields. The saliency maps estimated from the multiple angles of view were integrated by using pixel-wise attention weights calculated in an integration layer for weighting the optimal scale to each object. The proposed method was evaluated using a publicly available dataset with evaluation metrics for omni-directional saliency maps. It was confirmed that the accuracy of the saliency maps was improved by the proposed method.
非方向图像已经被广泛应用于各种应用。对于应用来说,估计代表注视点概率分布的感知地图非常重要,以检测非方向图像中的重要区域。本文提出了一种新的感知地图估计模型,通过从非方向图像中选择重叠的二维平面图像,从非方向图像中获取多方向的视角。虽然二维感知地图通常集中在图像的中心(中心偏差),当使用头戴显示器时,高概率区域出现在非方向感知地图的横向方向(赤道偏差)。因此,将中心偏差层替换为赤道偏差层,以满足提取二维平面图像的海拔角度条件,改进了具有中心偏差层的二维感知模型。在非方向图像感知数据集上,有限的可用性可以通过使用具有二维感知地图基准的大规模训练图像进行补偿,使用像素级注意力权重在集成层中计算,以权重每个对象的最佳大小。本文提出了一种多尺度估计方法,通过从多个视角提取二维图像,以检测具有不同响应面的大小的物体。从多个视角估计的感知地图通过使用在集成层中计算的像素级注意力权重进行整合,以权重每个物体的最佳大小。本文利用非方向感知地图评估指标的公开可用数据集进行了评估。确认了感知地图的准确性通过本文方法得到了提高。
https://arxiv.org/abs/2309.08139
In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summarization, among others. Prior work on salient entity detection mainly focused on machine learning models that require heavy feature engineering. We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. To this end, we conduct a comprehensive benchmarking of four publicly available datasets using models representative of the medium-sized pre-trained language model family. Additionally, we show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
在类似于新闻文章等文本文档中,内容和情感关键事件通常围绕文档中所提到的所有实体的特定子集旋转。这些实体通常被视为重要的实体,为读者提供了文档相关度的有用线索。发现重要的实体有助于在多个后续应用中发挥作用,例如搜索、排名和以实体为中心的概述。先前关于重要实体检测的工作主要关注需要大量特征工程的机器学习模型。我们表明,通过交叉编码风格架构优化中型语言模型可以带来显著的性能提升,比特征工程方法更有效。为此,我们使用中型预先训练语言模型家族中代表性模型对四个公开数据集进行了全面基准测试。此外,我们表明,指令优化的语言模型的零样本引导产生劣化结果,表明任务的独特性和复杂性。
https://arxiv.org/abs/2309.07990