We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
我们提出了聚类标记合并(ATC)方法,一种在图像分类、图像合成和目标检测与分割任务中始终优于先前标记合并和剪枝方法的新标记合并方法。ATC通过自下而上的层次聚类将聚类合并,无需引入额外的可学习参数。我们发现,ATC在所有任务上都实现了最先进的性能,甚至可以在未进行微调的情况下与先前的最优状态相媲美,即在没有特别调整的情况下。ATC在保留率较低的情况下特别有效,其中只有很少的标记被保留,保留任务的表现尤其难以维持。
https://arxiv.org/abs/2409.11923
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
状态空间模型(SSMs)通过将状态空间技术集成到深度学习中,引入了一种新颖的上下文建模方法。然而,由于它们的数据无关矩阵,它们在全局上下文建模方面遇到困难。Mamba模型通过S6选择性扫描算法提供了数据相关的变体,通过增强上下文建模,特别是对于长序列,解决了这个问题。然而,基于Mamba的架构在参数数量上很难扩展,这是视觉应用的主要限制。本文解决了大型SSM在图像分类和动作识别方面的可扩展性问题,而不需要使用诸如知识蒸馏等技术。我们分析了Mamba基和注意力基模型,提出了一个Mamba-Attention并行架构,通过增强可扩展性、鲁棒性和性能来解决SSM的问题。我们证明了具有稳定和高效并行架构的Mamba基架构可以解决图像和视频的扩展性问题,并增加了对常见伪影像压缩等伪影的鲁棒性。我们对ImageNet-1K、Kinetics-400和Something-Something-v2基准进行的深入评估表明,通过我们提出的方法,可以提高最先进的Mamba基架构的准确度最高达+1.7。
https://arxiv.org/abs/2409.11867
Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.
结核病(TB)是由结核杆菌(Mycobacterium tuberculosis)引起的,主要影响肺。早期诊断对提高治疗效果和降低传播风险至关重要。人工智能(AI),特别是通过胸部X光片图像分类,可以在TB诊断中发挥作用。然而,TB胸部X光片数据集中的分类不平衡给准确分类带来了挑战。在本文中,我们提出了一种使用原型网络(Prototypical Network)算法的几 shot学习(FSL)方法来解决这一问题。我们比较了ResNet-18、ResNet-50和VGG16在TBX11K胸部X光片数据集中的特征提取方面的性能。实验结果表明,ResNet-18的分类准确率为98.93%,ResNet-50的分类准确率为98.60%,而VGG16的分类准确率为33.33%。这些发现表明,与其它方法相比,所提出的方法在缓解数据不平衡方面表现优异,这对疾病分类应用特别有益。
https://arxiv.org/abs/2409.11644
Convolutional neural network (CNN) performs well in Hyperspectral Image (HSI) classification tasks, but its high energy consumption and complex network structure make it difficult to directly apply it to edge computing devices. At present, spiking neural networks (SNN) have developed rapidly in HSI classification tasks due to their low energy consumption and event driven characteristics. However, it usually requires a longer time step to achieve optimal accuracy. In response to the above problems, this paper builds a spiking neural network (SNN-SWMR) based on the leaky integrate-and-fire (LIF) neuron model for HSI classification tasks. The network uses the spiking width mixed residual (SWMR) module as the basic unit to perform feature extraction operations. The spiking width mixed residual module is composed of spiking mixed convolution (SMC), which can effectively extract spatial-spectral features. Secondly, this paper designs a simple and efficient arcsine approximate derivative (AAD), which solves the non-differentiable problem of spike firing by fitting the Dirac function. Through AAD, we can directly train supervised spike neural networks. Finally, this paper conducts comparative experiments with multiple advanced HSI classification algorithms based on spiking neural networks on six public hyperspectral data sets. Experimental results show that the AAD function has strong robustness and a good fitting effect. Meanwhile, compared with other algorithms, SNN-SWMR requires a time step reduction of about 84%, training time, and testing time reduction of about 63% and 70% at the same accuracy. This study solves the key problem of SNN based HSI classification algorithms, which has important practical significance for promoting the practical application of HSI classification algorithms in edge devices such as spaceborne and airborne devices.
卷积神经网络(CNN)在超分辨率图像(HSI)分类任务中表现良好,但其高能耗和复杂的网络结构使得直接应用于边缘计算设备较为困难。目前,由于其低能耗和事件驱动的特点,快速发展的尖峰神经网络(SNN)在HSI分类任务中得到了广泛应用。然而,通常需要更长的时间步才能达到最优准确度。为了应对上述问题,本文基于漏电神经元模型(LIF)构建了一种尖峰神经网络(SNN-SWMR)用于HSI分类任务。该网络使用尖峰宽度混合残差(SWMR)模块作为基本单位执行特征提取操作。尖峰宽度混合残差模块由尖峰混合卷积(SMC)和残差连接组成,可以有效提取空间-光谱特征。其次,本文设计了一个简单且高效的arcsin近似导数(AAD),通过调整狄拉克函数解决了尖峰放电的非可导性问题。通过AAD,我们可以直接训练监督尖峰神经网络。最后,本文基于尖峰神经网络(SNN)对六个公共超分辨率数据集进行了比较实验。实验结果表明,AAD函数具有很强的鲁棒性,良好的拟合效果。同时,与其它算法相比,SNN-SWMR需要时间步减少约84%,训练时间和测试时间分别减少约63%和70%。本研究解决了基于尖峰神经网络(SNN)的HSI分类算法中的关键问题,这对促进超分辨率分类算法在边缘设备(如太空和航空设备)的实际应用具有重要现实意义。
https://arxiv.org/abs/2409.11619
Whole Slide Images (WSIs) are critical for various clinical applications, including histopathological analysis. However, current deep learning approaches in this field predominantly focus on individual tumor types, limiting model generalization and scalability. This relatively narrow focus ultimately stems from the inherent heterogeneity in histopathology and the diverse morphological and molecular characteristics of different tumors. To this end, we propose a novel approach for multi-cohort WSI analysis, designed to leverage the diversity of different tumor types. We introduce a Cohort-Aware Attention module, enabling the capture of both shared and tumor-specific pathological patterns, enhancing cross-tumor generalization. Furthermore, we construct an adversarial cohort regularization mechanism to minimize cohort-specific biases through mutual information minimization. Additionally, we develop a hierarchical sample balancing strategy to mitigate cohort imbalances and promote unbiased learning. Together, these form a cohesive framework for unbiased multi-cohort WSI analysis. Extensive experiments on a uniquely constructed multi-cancer dataset demonstrate significant improvements in generalization, providing a scalable solution for WSI classification across diverse cancer types. Our code for the experiments is publicly available at <link>.
完整的幻灯片图像(WSIs)对于各种临床应用(包括病理学分析)至关重要。然而,该领域目前主要关注于个体肿瘤类型,这限制了模型的泛化能力和可扩展性。这种相对狭窄的焦点最终源于病理学异质性和不同肿瘤的形态学和分子特征的多样性。因此,我们提出了一个名为多期 WSI 分析的新方法,旨在利用不同肿瘤类型的多样性。我们引入了Cohort-Aware 注意力模块,使捕捉到共享和肿瘤特异性病理模式,提高跨肿瘤泛化能力。此外,我们通过相互信息最小化构建了敌对 cohort regularization 机制,以最小化 cohort-specific biases。此外,我们还开发了层次化的样本平衡策略,通过减少 cohort imbalances 促进无偏学习。这些方法共同构成了一个无偏多期 WSI 分析的完整框架。在构建独特多癌种数据集的实验中,我们证明了泛化能力的显著提高,为 WSI 分类提供了一种可扩展的解决方案。我们的实验代码公开可用,<链接>。
https://arxiv.org/abs/2409.11119
Cameras are integral components of many critical intelligent systems. However, a growing threat, known as Electromagnetic Signal Injection Attacks (ESIA), poses a significant risk to these systems, where ESIA enables attackers to remotely manipulate images captured by cameras, potentially leading to malicious actions and catastrophic consequences. Despite the severity of this threat, the underlying reasons for ESIA's effectiveness remain poorly understood, and effective countermeasures are lacking. This paper aims to address these gaps by investigating ESIA from two distinct aspects: pixel loss and color strips. By analyzing these aspects separately on image classification tasks, we gain a deeper understanding of how ESIA can compromise intelligent systems. Additionally, we explore a lightweight solution to mitigate the effects of ESIA while acknowledging its limitations. Our findings provide valuable insights for future research and development in the field of camera security and intelligent systems.
摄像头是许多关键智能系统的固有组成部分。然而,一种名为电磁信号注入攻击(ESIA)的不断增长威胁对这些系统构成了重大风险。ESIA使攻击者能够远程操纵由摄像头捕获的图像,可能导致恶意行为和灾难性后果。尽管这一威胁的严重程度不容忽视,但ESIA的有效性背后的原因仍然知之甚少,缺乏有效的应对措施。本文旨在通过研究ESIA的兩個不同方面:像素损失和色彩条来填补这些空白。通过在图像分类任务上分别分析这些方面,我们深入了解了ESIA如何破坏智能系统。此外,我们还探讨了一种轻量级的解决方案,以减轻ESIA的影响,同时承认其局限性。我们的研究结果为未来在相机安全和智能系统领域的研究和开发提供了宝贵的洞见。
https://arxiv.org/abs/2409.10922
Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods' robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through "holes" in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.
图像分类模型,包括卷积神经网络(CNNs),在各种分类任务上表现良好,但在部分遮挡情况下表现不佳,即相机视角下物体部分遮挡的情况。为了提高在遮挡条件下的性能,包括数据增强、基于部分的聚类和更本质的鲁棒架构(包括Vision Transformer(ViT)模型),已经对其在部分遮挡情况下对物体分类的准确性进行了评估。然而,这些方法的评估主要依赖包含人工遮挡的图像,这些图像通常是由计算机生成的,因此成本低廉。此外,很少有方法相互比较,并且大多数方法与早期过时的深度学习模型相比较。我们提出了Image Recognition Under Occlusion(IRUO)数据集,基于最近开发的遮蔽视频实例分割(OVIS)数据集(arXiv:2102.01558)。IRUO利用真实世界和人工遮挡的图像来测试和基准各种方法在视觉识别任务中对于部分遮挡的鲁棒性。此外,我们还提出了使用IRUO数据集中的图像进行的人类研究,以评估人类在多个遮挡水平和类型的遮挡下的分类性能。我们发现,与早期基于CNN的模型相比,现代CNN模型在遮挡图像上的识别准确性有所提高,而基于ViT的模型对于遮挡图像的识别准确性甚至超过了CNN模型,但表现略有低于人类水平。我们还发现,某些类型的遮挡,包括扩散遮挡,特别是当相关物体通过遮挡者的洞(如围栏和树叶)可见时,可以大大降低深度识别模型的准确性,尤其是那些基于CNN骨架的模型。
https://arxiv.org/abs/2409.10775
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.
我们提出了一个新颖的基于频率的自监督学习(SSL)方法,显著增强了其在预训练方面的效果。此方向的先驱工作遮蔽了输入图像中的预定义频率,并使用重构损失来预训练模型。尽管取得了很好的效果,但如我们在论文中所述,这种实现存在两个基本局限。首先,使用预定义频率会忽视图像频率响应的变异性。其次,通过使用频率过滤的图像进行预训练,预训练后的模型在微调过程中需要更多的数据来适应自然外观的图像。为了应对这些缺点,我们提出了Fourrier变换压缩与自知识蒸馏(FOLK)相结合的方法,结合了两个专门的想法。首先,受到图像压缩的启发,我们根据图像频率响应动态选择遮蔽的频率,为预训练创造了更合适的SSL任务。其次,我们使用由知识蒸馏支持的两个分支框架,使模型能够同时接受滤波和原始图像作为输入,从而大大减轻下游任务的负担。我们的实验结果表明,FOLK在实现对许多最先进的SSL方法的竞争性能方面具有有效性,包括图像分类、少样本学习和语义分割等任务。
https://arxiv.org/abs/2409.10362
Understanding the decisions made by image classification networks is a critical area of research in deep learning. This task is traditionally divided into two distinct approaches: post-hoc methods and intrinsic methods. Post-hoc methods, such as GradCam, aim to interpret the decisions of pre-trained models by identifying regions of the image where the network focuses its attention. However, these methods provide only a high-level overview, making it difficult to fully understand the network's decision-making process. Conversely, intrinsic methods, like prototypical parts models, offer a more detailed understanding of network predictions but are constrained by specific architectures, training methods, and datasets. In this paper, we introduce InfoDisent, a hybrid model that combines the advantages of both approaches. By utilizing an information bottleneck, InfoDisent disentangles the information in the final layer of a pre-trained deep network, enabling the breakdown of classification decisions into basic, understandable atomic components. Unlike standard prototypical parts approaches, InfoDisent can interpret the decisions of pre-trained classification networks and be used for making classification decisions, similar to intrinsic models. We validate the effectiveness of InfoDisent on benchmark datasets such as ImageNet, CUB-200-2011, Stanford Cars, and Stanford Dogs for both convolutional and transformer backbones.
理解图像分类网络所做的决策是一个关键的研究领域。这项任务通常分为两个截然不同的方法:后验方法和内切方法。后验方法(如GradCam)旨在通过识别网络在图像中集中注意力的区域来解释预训练模型的决策。然而,这些方法仅提供了一个高层次的概述,使得难以完全理解网络的决策过程。相反,内切方法(如原型部分模型)提供了更详细的网络预测理解,但受到具体架构、训练方法和数据集的限制。在本文中,我们介绍了InfoDisent,一种结合两种方法的混合模型。通过利用信息瓶颈,InfoDisent将预训练深度网络最后一层的混乱信息清晰地分解,使得将分类决策分解为基本、可理解的原子组件成为可能。与标准原型方法不同,InfoDisent可以解释预训练分类网络的决策,并可用于进行分类决策,类似于内切方法。我们在ImageNet、CUB-200-2011、斯坦福汽车和斯坦福狗等基准数据集上验证了InfoDisent的有效性。结果表明,InfoDisent在预训练的卷积和Transformer骨干网络上的效果都很好。
https://arxiv.org/abs/2409.10329
Accurate and robust medical image classification is a challenging task, especially in application domains where available annotated datasets are small and present high imbalance between target classes. Considering that data acquisition is not always feasible, especially for underrepresented classes, our approach introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space interpolation to improve discrimination capabilities. By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance. The method presented in this paper relies on the interpolation of latent representations within each class, thus enriching the training set and improving the model's generalizability and diagnostic accuracy. The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images. By combining real and synthetic data, an increase of over 18\% in the accuracy of the most challenging underrepresented class was observed. The proposed strategy not only benefited the underrepresented class but also led to a general improvement in other metrics, including a 6\% increase in global accuracy and precision.
精确和稳健的医学图像分类是一个具有挑战性的任务,尤其是在可用注释数据集小且目标类之间存在高不平衡的情况下。考虑到数据获取并不总是可行的,特别是对于少数代表性不足的类别,我们的方法采用类特异性变分自编码器(VAE)和潜在空间平滑来引入一种新颖的合成增强策略,以提高分类能力。通过生成填补特征空间缺口的真实、多样化的合成数据,我们解决了数据稀缺和类不平衡的问题。本文提出的方法依赖于每个类别内的潜在表示的平滑,从而丰富训练集并提高模型的泛化能力和诊断准确性。该策略在评估食管内镜检查图像质量的自动方法的小数据集上进行了测试。通过结合真实和合成数据,观察到最具挑战性的少数类别的准确度提高了超过18%。所提出的策略不仅有益于少数类别的改进,还导致了其他指标(包括全局准确度和精度)的普遍提高。
https://arxiv.org/abs/2409.10286
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy \emph{analogies} in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to \emph{text descriptions of the image differences}, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.
Vision-language models(VLMs)如CLIP是通过在文本和图像对之间进行对比学习进行训练的。这导致了对齐的图像和文本嵌入,对于许多下游任务非常有用。然而,CLIP的一个显著缺点是,生成的嵌入空间似乎缺乏其纯粹文本基础的替代品的某些结构。例如,虽然文本嵌入已经被指出在向量算术中满足相似性,但CLIP没有这种属性。在本文中,我们提出了一种在对比方式下原生训练CLIP的方法,以解释在嵌入空间中的差异。我们微调CLIP,使得图像嵌入空间中的差异对应于图像差异的文本描述,我们在图像-捕捉配对数据集上使用大型语言模型生成的。我们首先证明,我们的方法在通过特定属性排名图像方面产生了显著的改进(例如,大象比猫更大),这对于检索或构建基于属性的分类器非常有用,并且在许多下游图像分类任务上的 zero-shot 分类性能得到了提高。此外,我们的方法实现了一种新的推理机制,我们称之为比较提示,我们利用感兴趣类别的文本描述的差异先验知识,在分类中实现更大的性能提升。最后,我们说明生成的嵌入遵守了在嵌入空间中更大的几何性质,如文本到图像的生成。
https://arxiv.org/abs/2409.09721
Vision-language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero-shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challenging AI models with tasks that span image classification, object detection, instance segmentation, and deeper narrative comprehension through sequential panels. However, the unique structure of comics -- characterized by creative variations in style, reading order, and non-linear storytelling -- presents a set of challenges distinct from those in other visual-language domains. In this survey, we present a comprehensive review of Comics Understanding from both dataset and task perspectives. Our contributions are fivefold: (1) We analyze the structure of the comics medium, detailing its distinctive compositional elements; (2) We survey the widely used datasets and tasks in comics research, emphasizing their role in advancing the field; (3) We introduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy that redefines vision-language tasks within comics and lays the foundation for future work; (4) We provide a detailed review and categorization of existing methods following the LoCU framework; (5) Finally, we highlight current research challenges and propose directions for future exploration, particularly in the context of vision-language models applied to comics. This survey is the first to propose a task-oriented framework for comics intelligence and aims to guide future research by addressing critical gaps in data availability and task definition. A project associated with this survey is available at this https URL.
近年来,视觉语言模型已经发展成为可以在各种任务上实现高性能的多功能系统,例如文档理解、视觉问题回答和 grounded,往往在零散设置中。漫画理解是一个复杂而多面的领域,从这些进步中大大受益。漫画作为一种媒介,结合了丰富的视觉和文本叙事,通过连续漫画 panel 跨越图像分类、目标检测、实例分割和更深的叙事理解等任务挑战 AI 模型。然而,漫画的独特结构(以创意的变化为特征,阅读顺序和非线性叙事)在视觉语言领域与其他领域之间提出了独特的挑战。在本次调查中,我们从数据集和任务两个方面全面回顾了漫画理解。我们的贡献是五倍:(1)我们分析了漫画媒体结构的特征,详细描述了其独特的构成元素;(2)我们调查了漫画研究中广泛使用的数据集和任务,强调它们在推动领域发展中的作用;(3)我们引入了 Layer of Comics Understanding(LoCU)框架,一种新的分类框架,重新定义了漫画中的视觉语言任务,为未来的研究奠定了基础;(4)我们详细审查和分类了现有的方法,遵循 LoCU 框架;(5)最后,我们重点关注了将视觉语言模型应用于漫画的研究挑战,并提出了未来的研究方向,尤其是在视觉语言模型应用于漫画的研究中。本次调查是第一个针对漫画智能提出任务导向框架的,旨在通过解决数据可用性和任务定义方面的关键缺口来指导未来的研究。与该调查相关的项目可在此处访问:https://url.com/
https://arxiv.org/abs/2409.09502
Adversarial patches present significant challenges to the robustness of deep learning models, making the development of effective defenses become critical for real-world applications. This paper introduces DIFFender, a novel DIFfusion-based DeFender framework that leverages the power of a text-guided diffusion model to counter adversarial patch attacks. At the core of our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which enables the diffusion model to accurately detect and locate adversarial patches by analyzing distributional anomalies. DIFFender seamlessly integrates the tasks of patch localization and restoration within a unified diffusion model framework, enhancing defense efficacy through their close interaction. Additionally, DIFFender employs an efficient few-shot prompt-tuning algorithm, facilitating the adaptation of the pre-trained diffusion model to defense tasks without the need for extensive retraining. Our comprehensive evaluation, covering image classification and face recognition tasks, as well as real-world scenarios, demonstrates DIFFender's robust performance against adversarial attacks. The framework's versatility and generalizability across various settings, classifiers, and attack methodologies mark a significant advancement in adversarial patch defense strategies. Except for the popular visible domain, we have identified another advantage of DIFFender: its capability to easily expand into the infrared domain. Consequently, we demonstrate the good flexibility of DIFFender, which can defend against both infrared and visible adversarial patch attacks alternatively using a universal defense framework.
对抗性补丁对深度学习模型的稳健性提出了重大挑战,使得开发有效的防御变得至关重要。本文介绍了一种新颖的基于DIFfusion的DeFender框架,它利用文本指导的扩散模型的力量来对抗对抗性补丁攻击。我们方法的核心是发现对抗性异常感知(AAP)现象,该现象使扩散模型能够准确检测和定位对抗性补丁,通过分析分布异常。DIFFender无缝地将补丁定位和修复任务整合到一个统一的扩散模型框架中,通过其与防御任务亲密交互来提高防御效果。此外,DIFFender采用了一种高效的仅几轮提示的调优算法,无需进行广泛的重新训练,将预训练的扩散模型适应防御任务。我们对图像分类和面部识别任务以及现实场景进行全面评估,证明了DIFFender在对抗性攻击方面的稳健性能。该框架在不同设置、分类器和攻击方法上的通用性和可扩展性标志着对抗性补丁防御策略的重大进展。除了流行的可见域,我们还发现了DIFFender的一个优势:它容易扩展到红外域。因此,我们证明了DIFFender的灵活性,可以使用通用防御框架同时防御红外和可见性对抗性补丁攻击。
https://arxiv.org/abs/2409.09406
In this paper, we jointly combine image classification and image denoising, aiming to enhance human perception of noisy images captured by edge devices, like low-light security cameras. In such settings, it is important to retain the ability of humans to verify the automatic classification decision and thus jointly denoise the image to enhance human perception. Since edge devices have little computational power, we explicitly optimize for efficiency by proposing a novel architecture that integrates the two tasks. Additionally, we alter a Neural Architecture Search (NAS) method, which searches for classifiers to search for the integrated model while optimizing for a target latency, classification accuracy, and denoising performance. The NAS architectures outperform our manually designed alternatives in both denoising and classification, offering a significant improvement to human perception. Our approach empowers users to construct architectures tailored to domains like medical imaging, surveillance systems, and industrial inspections.
在本文中,我们共同结合图像分类和图像去噪,旨在提高边缘设备(如低光安全摄像头)捕获到的噪声图像中人类的感知。在这种设置中,保留人类验证自动分类决策的能力非常重要,从而共同去噪以提高人类的感知。由于边缘设备具有较少的计算能力,我们通过提出一种新颖的架构,将两种任务整合在一起,明显优化了效率。此外,我们改变了一个神经架构搜索(NAS)方法,该方法在优化目标延迟、分类准确性和去噪性能的同时,寻找分类器。与我们的自定义选项相比,NAS架构在去噪和分类方面都表现出色,显著提高了人类的感知。我们的方法使用户能够构建适应领域,如医学成像、监控系统和工业检查的架构。
https://arxiv.org/abs/2409.08943
We propose an approach for anytime continual learning (AnytimeCL) for open vocabulary image classification. The AnytimeCL problem aims to break away from batch training and rigid models by requiring that a system can predict any set of labels at any time and efficiently update and improve when receiving one or more training samples at any time. Despite the challenging goal, we achieve substantial improvements over recent methods. We propose a dynamic weighting between predictions of a partially fine-tuned model and a fixed open vocabulary model that enables continual improvement when training samples are available for a subset of a task's labels. We also propose an attention-weighted PCA compression of training features that reduces storage and computation with little impact to model accuracy. Our methods are validated with experiments that test flexibility of learning and inference. Code is available at this https URL.
我们提出了一个名为Anytime Continual Learning(AnytimeCL)的开放词汇图像分类方法,旨在摆脱批量训练和刚性模型的限制,要求系统在接收到一个或多个训练样本时能够预测任何标签,并高效地更新和改进。尽管这是一个具有挑战性的目标,但我们通过实现模型在训练样本可用时继续改进,并在最近的方法之上取得了显著的改进。我们提出了一种在部分微调的模型预测和固定开发生成模型的动态加权方法,使得在训练样本可用的情况下,模型可以持续改进。我们还提出了一种关注权重的训练特征的PCA压缩方法,该方法在降低存储和计算开销的同时,对模型精度的影响很小。我们的方法通过测试学习与推理的灵活性得到了验证。代码可在此处访问:https://www.osaka-edu.cn/zh/info/1042/1046.htm
https://arxiv.org/abs/2409.08518
In the context of few-shot classification, the goal is to train a classifier using a limited number of samples while maintaining satisfactory performance. However, traditional metric-based methods exhibit certain limitations in achieving this objective. These methods typically rely on a single distance value between the query feature and support feature, thereby overlooking the contribution of shallow features. To overcome this challenge, we propose a novel approach in this paper. Our approach involves utilizing multi-output embedding network that maps samples into distinct feature spaces. The proposed method extract feature vectors at different stages, enabling the model to capture both global and abstract features. By utilizing these diverse feature spaces, our model enhances its performance. Moreover, employing a self-attention mechanism improves the refinement of features at each stage, leading to even more robust representations and improved overall performance. Furthermore, assigning learnable weights to each stage significantly improved performance and results. We conducted comprehensive evaluations on the MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way 5-shot scenarios. Additionally, we performed a cross-domain task from MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain. These evaluations demonstrate the efficacy of our proposed method in comparison to state-of-the-art approaches. this https URL
在少样本分类的背景下,目标是使用有限的样本训练一个分类器,同时保持令人满意的性能。然而,传统的基于指标的方法在实现这一目标时存在某些局限性。这些方法通常依赖于查询特征和支持特征之间的一种距离值,从而忽略了浅层特征的贡献。为了克服这一挑战,我们在本文中提出了一个新方法。我们的方法涉及利用多输出嵌入网络将样本映射到不同的特征空间。所提出的方法在不同的阶段提取特征向量,使得模型能够捕捉到全局和抽象特征。通过利用这些不同的特征空间,我们的模型提高了其性能。此外,采用自注意机制可以在每个阶段改善特征的细化,导致更稳健的表示和提高整体性能。此外,为每个阶段分配可学习权重显著提高了性能和结果。我们在MiniImageNet和FC100数据集上进行了全面的评估,特别是5-way 1-shot和5-way 5-shot场景。此外,我们还从MiniImageNet到CUB数据集进行了跨域任务,在测试域取得了高准确率。这些评估证明了与最先进的方法的比较,我们的方法具有有效性。这个URL
https://arxiv.org/abs/2409.07989
In the field of medical microscopic image classification (MIC), CNN-based and Transformer-based models have been extensively studied. However, CNNs struggle with modeling long-range dependencies, limiting their ability to fully utilize semantic information in images. Conversely, Transformers are hampered by the complexity of quadratic computations. To address these challenges, we propose a model based on the Mamba architecture: Microscopic-Mamba. Specifically, we designed the Partially Selected Feed-Forward Network (PSFFN) to replace the last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's local feature extraction capabilities. Additionally, we introduced the Modulation Interaction Feature Aggregation (MIFA) module to effectively modulate and dynamically aggregate global and local features. We also incorporated a parallel VSSM mechanism to improve inter-channel information interaction while reducing the number of parameters. Extensive experiments have demonstrated that our method achieves state-of-the-art performance on five public datasets. Code is available at this https URL
在医学显微图像分类(MIC)领域,基于CNN和Transformer的模型已经得到了广泛研究。然而,CNN在建模长距离依赖方面存在困难,限制了其在图像中充分利用语义信息的能力。相反,Transformer受到平方计算复杂性的困扰。为了应对这些挑战,我们提出了一个基于Mamba架构的模型:显微镜-Mamba模型。具体来说,我们设计了一个部分选择性前馈网络(PSFFN)来替换视觉状态空间模块(VSSM),增强了Mamba的局部特征提取能力。此外,我们还引入了调制交互特征聚合(MIFA)模块,有效地调节和动态聚合全局和局部特征。我们还引入了并行VSSM机制,以改善跨通道信息交互,同时减少参数数量。大量实验证明,我们的方法在五个公开数据集上取得了最先进的性能。代码位于此链接:https:// this URL
https://arxiv.org/abs/2409.07896
We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).
我们提出了Vision Token Turing Machines(ViTTM),一种高效、低延迟、基于内存的视觉Transformer(ViT)。我们的方法基于神经元图灵机和标记图灵机,应用于自然语言处理和序列视觉理解任务。ViTTMs是为非序列化计算机视觉任务设计的,如图像分类和分割。我们的模型创建了两种令牌:过程令牌和内存令牌;过程令牌在网络中的每个编码器块中通过编码器模块读取并写入内存中的令牌,允许它们从内存中存储和检索信息。通过确保过程令牌的数量少于内存令牌的数量,我们能够降低网络的推理时间,同时保持其准确性。在ImageNet-1K上,最先进的ViT-B的延迟为529.5毫秒,准确率为81.0%,而我们的ViTTM-B是56% faster(234.1ms),具有2.4倍的FLOPs,准确率为82.9%。在ADE20K语义分割上,ViT-B达到45.65mIoU/13.8帧/秒(FPS),而我们的ViTTM-B模型达到45.17mIoU/26.8 FPS (+94%)。
https://arxiv.org/abs/2409.07613
Foundational models, trained on vast and diverse datasets, have demonstrated remarkable capabilities in generalizing across different domains and distributions for various zero-shot tasks. Our work addresses the challenge of retaining these powerful generalization capabilities when adapting foundational models to specific downstream tasks through fine-tuning. To this end, we introduce a novel approach we call "similarity loss", which can be incorporated into the fine-tuning process of any task. By minimizing the distortion of fine-tuned embeddings from the pre-trained embeddings, our method strikes a balance between task-specific adaptation and preserving broad generalization abilities. We evaluate our approach on two diverse tasks: image classification on satellite imagery and face recognition, focusing on open-class and domain shift scenarios to assess out-of-distribution (OOD) performance. We demonstrate that this approach significantly improves OOD performance while maintaining strong in-distribution (ID) performance.
基础模型通过训练于丰富多样的数据集表现出在不同领域和分布下的显著泛化能力。我们的工作解决了在将基础模型应用于特定下游任务时保留这些强大的泛化能力的问题。为此,我们引入了一种名为“相似性损失”的新方法,可以将其纳入任何任务的微调过程中。通过最小化微调前预训练嵌入的变形,我们的方法在任务特定适应和保留广泛的泛化能力之间取得了平衡。我们在两个具有丰富多样性的任务上评估我们的方法:卫星图像上的图像分类和面部识别,重点关注开标签和领域迁移场景以评估离散(OD)性能。我们证明了这种方法在保持强大的域内(ID)性能的同时显著提高了离散(OD)性能。
https://arxiv.org/abs/2409.07582
This work investigates the potential of seam carving as a feature pooling technique within Convolutional Neural Networks (CNNs) for image classification tasks. We propose replacing the traditional max pooling layer with a seam carving operation. Our experiments on the Caltech-UCSD Birds 200-2011 dataset demonstrate that the seam carving-based CNN achieves better performance compared to the model utilizing max pooling, based on metrics such as accuracy, precision, recall, and F1-score. We further analyze the behavior of both approaches through feature map visualizations, suggesting that seam carving might preserve more structural information during the pooling process. Additionally, we discuss the limitations of our approach and propose potential future directions for research.
这项工作研究了在卷积神经网络(CNNs)中,将边缘切割作为一种特征池化技术的效果。我们提出用边缘切割操作来替换传统的最大池化层。我们对 Caltech-UCSD Birds 200-2011 数据集的实验证明,基于边缘切割的 CNN 能够比利用最大池化操作的模型在准确率、精确率、召回率和 F1-分数等方面获得更好的性能。我们进一步通过特征图可视化分析了两者的行为,表明在池化过程中,边缘切割可能保留更多的结构信息。此外,我们讨论了我们的方法的局限性,并提出了未来研究的潜在方向。
https://arxiv.org/abs/2409.06311