Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.
零 shot学习在广泛的视觉识别领域得到了广泛研究,并吸引了最近显著的关注。然而,在文档图像分类领域,零 shot 学习的现有研究仍然很少。现有研究要么只专注于零 shot 推理,要么它们的评估标准与视觉识别域中的 established criteria 不相符。我们在 Zero-Shot Learning (ZSL) 和一般零 shot学习 (GZSL) 设置中提供了全面的文档图像分类分析,以填补这一空白。我们的方法和评估与该领域的 established practices 保持一致。此外,我们还提出了 RVL-CDIP 数据集中的零 shot 划分。此外,我们引入了 CICA(发音为 'ki-ka'),一种增强 CLIP 零 shot 学习能力的框架。CICA 包括一个新颖的“内容模块”,用于利用任何文档相关文本信息。这个模块提取的判别特征与 CLIP 的文本和图像特征通过一种新颖的“耦合对比”损失进行对齐。我们的模块在 RVL-CDIP 数据集上提高了 CLIP 的 ZSL top-1 准确率 by 6.7%,GZSL 均方误差 by 24%。我们的模块轻量级,并为 CLIP 添加了仅 3.3% 的参数。我们的工作为未来的零 shot 文献分类研究奠定了方向。
https://arxiv.org/abs/2405.03660
Deep neural networks have reached remarkable achievements in medical image processing tasks, specifically classifying and detecting various diseases. However, when confronted with limited data, these networks face a critical vulnerability, often succumbing to overfitting by excessively memorizing the limited information available. This work addresses the challenge mentioned above by improving the supervised contrastive learning method to reduce the impact of false positives. Unlike most existing methods that rely predominantly on fully supervised learning, our approach leverages the advantages of self-supervised learning in conjunction with employing the available labeled data. We evaluate our method on the BreakHis dataset, which consists of breast cancer histopathology images, and demonstrate an increase in classification accuracy by 1.45% at the image level and 1.42% at the patient level compared to the state-of-the-art method. This improvement corresponds to 93.63% absolute accuracy, highlighting our approach's effectiveness in leveraging data properties to learn more appropriate representation space.
深度神经网络在医学图像处理任务中取得了显著的成就,尤其是在分类和检测各种疾病方面。然而,当面对有限数据时,这些网络面临着一个关键的漏洞,常常通过过度依赖有限信息而陷入过拟合。本文通过改进有监督对比学习方法来应对上述挑战。与大多数现有方法主要依赖完全监督学习不同,我们的方法利用自监督学习的优势,并利用现有标记数据。我们在BreakHis数据集上评估我们的方法,该数据集包含乳腺癌 histopathology 图像,证明了在图像级别和患者级别,分类准确率分别比最先进的方法提高了1.45%和1.42%。这种提高相当于93.63%的绝对准确率,强调了我们的方法利用数据属性来学习更合适的表示空间的有效性。
https://arxiv.org/abs/2405.03642
Zero-shot learning (ZSL) aims to recognize novel classes through transferring shared semantic knowledge (e.g., attributes) from seen classes to unseen classes. Recently, attention-based methods have exhibited significant progress which align visual features and attributes via a spatial attention mechanism. However, these methods only explore visual-semantic relationship in the spatial dimension, which can lead to classification ambiguity when different attributes share similar attention regions, and semantic relationship between attributes is rarely discussed. To alleviate the above problems, we propose a Dual Relation Mining Network (DRMN) to enable more effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion and conducts spatial attention for visual to semantic embedding. Moreover, an attribute-guided channel attention is utilized to decouple entangled semantic features. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images. Additionally, a global classification branch is introduced as a complement to human-defined semantic attributes, and we then combine the results with attribute-based classification. Extensive experiments demonstrate that the proposed DRMN leads to new state-of-the-art performances on three standard ZSL benchmarks, i.e., CUB, SUN, and AwA2.
零样本学习(ZSL)旨在通过将可见类别的共享语义知识(例如属性)传递给不可见类别的类别来识别新颖类别。最近,基于注意力的方法在空间注意机制的帮助下表现出了显著的进展,这些方法通过空间注意力机制将视觉特征和属性对齐。然而,这些方法仅在空间维度上探索视觉语义关系,这可能导致不同属性的注意力区域相似时出现分类不确定性,并且很少讨论属性之间的语义关系。为了缓解上述问题,我们提出了一个双关系挖掘网络(DRMN),以实现更有效的视觉-语义交互并学习属性之间的语义关系。具体来说,我们引入了一个双注意块(DAB)进行视觉-语义关系挖掘,通过多级特征融合丰富视觉信息并进行空间注意力对齐。此外,还利用了属性引导的通道注意来解耦相关的语义特征。为了语义关系建模,我们利用了语义交互转置器(SIT)来增强图像中属性表示的泛化。此外,我们还引入了一个全局分类分支作为人类定义语义属性的补充,然后将结果与基于属性的分类相结合。大量实验证明,与预训练模型相比,DRMN在三个标准的ZSL基准测试集(即CUB、SUN和AwA2)上取得了最先进的性能。
https://arxiv.org/abs/2405.03613
Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.
声场分类(ASC)在现实生活中具有非常重要的意义。然而,基于深度学习的ASC方法目前还不够轻量级,并且其性能也不能令人满意。为了解决这些问题,我们提出了一个深度可分离分蒸馏网络。首先,网络在时域对离散余弦图进行高-低频分解,从而大大降低计算复杂度,同时保持模型性能。其次,我们专门设计三种轻量级的ASC操作符,包括分离卷积(SC)、正交分离卷积(OSC)和分离部分卷积(SPC)。这些操作符在声场分类任务中具有高度高效的特征提取能力。实验结果表明,与目前流行的深度学习方法相比,所提出的方法实现了9.8%的性能提升,同时具有更小的参数数量和计算复杂度。
https://arxiv.org/abs/2405.03567
Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples.
少样本和零样本文本分类旨在识别具有有限标注样本或完全没有标注样本的新兴类别。虽然预先存在的方法通过将可见类别的知识传递到未见类别来展示有希望的性能,但它们仍然受到以下限制:(1)类之间的固有差异使得将可见类别的特征从可见类转移到未见类的过程既困难又低效。(2)稀有标注的新兴样本通常无法提供足够的指导信号,使模型能够从源分布调整到目标分布,尤其是在复杂场景中。为了减轻上述问题,我们提出了一个简单而有效的少样本和零样本文本分类策略。我们的目标是解放模型的限制,从而使其能够预测未见类别,而无需在可见类别上进行训练。具体来说,为了挖掘更多相关的未见类别知识,我们利用一个大的预训练语言模型生成伪新样本,并选择最具代表性的作为类别锚点。然后,我们将多分类分类任务转换为二分类任务,并利用查询-锚点之间的相似性进行预测,充分利用有限的监督信号。在六个广泛使用的大数据集上进行广泛的实验证明,与使用任何可见类别的样本相比,我们提出的方法在少样本和零样本任务上可以显著提高性能。
https://arxiv.org/abs/2405.03565
Multi-label learning (MLL) requires comprehensive multi-semantic annotations that is hard to fully obtain, thus often resulting in missing labels scenarios. In this paper, we investigate Single Positive Multi-label Learning (SPML), where each image is associated with merely one positive label. Existing SPML methods only focus on designing losses using mechanisms such as hard pseudo-labeling and robust losses, mostly leading to unacceptable false negatives. To address this issue, we first propose a generalized loss framework based on expected risk minimization to provide soft pseudo labels, and point out that the former losses can be seamlessly converted into our framework. In particular, we design a novel robust loss based on our framework, which enjoys flexible coordination between false positives and false negatives, and can additionally deal with the imbalance between positive and negative samples. Extensive experiments show that our approach can significantly improve SPML performance and outperform the vast majority of state-of-the-art methods on all the four benchmarks.
多标签学习(MLL)需要全面的多语义注释,这很难完全获得,因此通常会导致缺失标签的情况。在本文中,我们研究了 Single Positive Multi-label Learning(SPML),其中每个图像仅与一个正标签相关联。现有的SPML方法主要关注通过诸如硬伪标签和鲁棒损失机制来设计损失,导致大部分情况下出现不可接受的反假负。为了应对这个问题,我们首先提出了一个基于期望风险最小化的通则损失框架,以提供软伪标签,并指出前一种损失可以无缝转换到我们的框架中。特别地,我们根据我们的框架设计了一个新颖的鲁棒损失,该损失在 false positives 和 false negatives 之间具有灵活的协调,并可以处理 positive 和 negative 样本之间的不平衡。大量实验证明,我们的方法可以显著提高SPML性能,并在所有四个基准上超越了大多数最先进的 methods。
https://arxiv.org/abs/2405.03501
In lossy image compression, the objective is to achieve minimal signal distortion while compressing images to a specified bit rate. The increasing demand for visual analysis applications, particularly in classification tasks, has emphasized the significance of considering semantic distortion in compressed images. To bridge the gap between image compression and visual analysis, we propose a Rate-Distortion-Classification (RDC) model for lossy image compression, offering a unified framework to optimize the trade-off between rate, distortion, and classification accuracy. The RDC model is extensively analyzed both statistically on a multi-distribution source and experimentally on the widely used MNIST dataset. The findings reveal that the RDC model exhibits desirable properties, including monotonic non-increasing and convex functions, under certain conditions. This work provides insights into the development of human-machine friendly compression methods and Video Coding for Machine (VCM) approaches, paving the way for end-to-end image compression techniques in real-world applications.
在损失图像压缩中,目标是实现压缩图像到指定比特率的同时最小化信号畸变。随着视觉分析应用的需求,特别是在分类任务中,考虑语义畸变在压缩图像中具有重要性。为了在图像压缩和视觉分析之间弥合差距,我们提出了一个名为 Rate-Distortion-Classification (RDC) 的模型,为损失图像压缩提供了一个统一的框架,以优化在速率、畸变和分类精度之间的平衡。RDC模型在多分布源上进行了广泛分析,同时在广泛使用的MNIST数据集上进行了实验验证。研究结果表明,在某些条件下,RDC模型表现出良好的特性,包括单调非递增和凸函数。这项工作揭示了人机友好压缩方法和视频编码机器(VCM)途径的发展,为现实应用中的端到端图像压缩技术铺平了道路。
https://arxiv.org/abs/2405.03500
Image safety classifiers play an important role in identifying and mitigating the spread of unsafe images online (e.g., images including violence, hateful rhetoric, etc.). At the same time, with the advent of text-to-image models and increasing concerns about the safety of AI models, developers are increasingly relying on image safety classifiers to safeguard their models. Yet, the performance of current image safety classifiers remains unknown for real-world and AI-generated images. To bridge this research gap, in this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough in mitigating the multifaceted problem of unsafe images. Also, we find that classifiers trained only on real-world images tend to have degraded performance when applied to AI-generated images. Motivated by these findings, we design and implement a comprehensive image moderation tool called PerspectiveVision, which effectively identifies 11 categories of real-world and AI-generated unsafe images. The best PerspectiveVision model achieves an overall F1-Score of 0.810 on six evaluation datasets, which is comparable with closed-source and expensive state-of-the-art models like GPT-4V. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.
图像安全分类器在识别和减轻网上不安全图像的传播方面发挥着重要作用(例如,包括暴力、仇恨言论等内容的图像)。与此同时,随着文本到图像模型的出现和对人工智能模型安全性的日益关注,开发人员越来越多地依赖图像安全分类器来保护他们的模型。然而,目前图像安全分类器的性能对于真实世界和人工智能生成的图像仍然是未知的。为了填补这一研究空白,在这项工作中,我们提出了UnsafeBench,一个评估图像安全分类器有效性和鲁棒性的基准框架。首先,我们收集了一个包含10K个真实世界和人工智能生成的图像的数据集,这些图像根据11个不安全图像类别(性、暴力、仇恨等)被标注为安全或不可靠。然后,我们评估了五款流行图像安全分类器和三款基于通用视觉语言模型的分类器的有效性和鲁棒性。我们的评估结果显示,现有图像安全分类器在减轻多方面不安全图像传播方面缺乏全面性和有效性。此外,我们还发现,仅基于真实世界图像训练的分类器在应用于人工智能生成的图像时表现不佳。为了激励这些发现,我们设计并实现了名为PerspectiveVision的全面图像分级工具,它有效地识别了11个真实世界和人工智能生成的不安全图像类别。 PerspectiveVision的最佳模型在六个评估数据集上的总体F1分数为0.810,与GPT-4V等开源和昂贵的模型相当。UnsafeBench和PerspectiveVision可以为研究社区在生成人工智能时代更好地理解图像安全分类器提供帮助。
https://arxiv.org/abs/2405.03486
Accurate classification of medical images is essential for modern diagnostics. Deep learning advancements led clinicians to increasingly use sophisticated models to make faster and more accurate decisions, sometimes replacing human judgment. However, model development is costly and repetitive. Neural Architecture Search (NAS) provides solutions by automating the design of deep learning architectures. This paper presents ZO-DARTS+, a differentiable NAS algorithm that improves search efficiency through a novel method of generating sparse probabilities by bi-level optimization. Experiments on five public medical datasets show that ZO-DARTS+ matches the accuracy of state-of-the-art solutions while reducing search times by up to three times.
准确地分类医学图像对于现代诊断至关重要。深度学习的发展使得临床医生越来越依赖复杂的模型来做出更快、更准确的决策,有时甚至取代了人类的判断。然而,模型开发成本高且重复。神经架构搜索(NAS)通过自动设计深度学习架构提供了解决方案。本文介绍了ZO-DARTS+,一种可差分神经架构搜索算法,通过一种新的方法通过双层优化生成稀疏概率来提高搜索效率。在五个公开医疗数据集上的实验表明,ZO-DARTS+与最先进的解决方案在准确性上相匹敌,同时将搜索时间减少了30%以上。
https://arxiv.org/abs/2405.03462
This work studies ensemble learning for graph neural networks (GNNs) under the popular semi-supervised setting. Ensemble learning has shown superiority in improving the accuracy and robustness of traditional machine learning by combining the outputs of multiple weak learners. However, adopting a similar idea to integrate different GNN models is challenging because of two reasons. First, GNN is notorious for its poor inference ability, so naively assembling multiple GNN models would deteriorate the inference efficiency. Second, when GNN models are trained with few labeled nodes, their performance are limited. In this case, the vanilla ensemble approach, e.g., majority vote, may be sub-optimal since most base models, i.e., GNNs, may make the wrong predictions. To this end, in this paper, we propose an efficient ensemble learner--E2GNN to assemble multiple GNNs in a learnable way by leveraging both labeled and unlabeled nodes. Specifically, we first pre-train different GNN models on a given data scenario according to the labeled nodes. Next, instead of directly combing their outputs for label inference, we train a simple multi-layer perceptron--MLP model to mimic their predictions on both labeled and unlabeled nodes. Then the unified MLP model is deployed to infer labels for unlabeled or new nodes. Since the predictions of unlabeled nodes from different GNN models may be incorrect, we develop a reinforced discriminator to effectively filter out those wrongly predicted nodes to boost the performance of MLP. By doing this, we suggest a principled approach to tackle the inference issues of GNN ensembles and maintain the merit of ensemble learning: improved performance. Comprehensive experiments over both transductive and inductive settings, across different GNN backbones and 8 benchmark datasets, demonstrate the superiority of E2GNN.
本文研究了在流行半监督设置下,图神经网络(GNNs)的集成学习。集成学习已经在多个弱学习器的输出上展示了比传统机器学习更准确和鲁棒性的优势。然而,采用与集成不同GNN模型类似的思路进行集成学习是一个具有挑战性的任务,因为有两个原因。首先,GNN以其推理能力较差而闻名,因此粗心地将多个GNN模型集成在一起会降低推理效率。其次,当GNN模型通过少量标记节点进行训练时,其性能有限。在这种情况下,基本的集成方法,例如多数投票,可能是不最优的,因为大多数基本模型(即GNNs)可能会做出错误的预测。因此,在本文中,我们提出了一种有效的集成学习器——E2GNN,通过利用 labeled 和 unlabeled 节点来学习多个GNN的集成。具体来说,我们首先根据标记节点对不同GNN模型进行预训练。然后,我们使用一个简单的多层感知器(MLP)模型来模仿它们的预测,该模型可以同时处理标记和未标记节点。最后,将统一的MLP模型部署用于推断未标记或新节点的标签。由于不同GNN模型对未标记节点的预测可能是不正确的,我们开发了一个强化判别器,以有效地过滤出错误预测的节点,提高MLP的性能。通过这样做,我们提出了一个有原则的方法来解决GNN集成的问题,并保持了集成的优势:提高性能。在转换式和归纳式设置下的全面实验,跨越不同的GNN骨干网络和8个基准数据集,证明了E2GNN的优越性。
https://arxiv.org/abs/2405.03401
State-of-the-art automated machine learning systems for tabular data often employ cross-validation; ensuring that measured performances generalize to unseen data, or that subsequent ensembling does not overfit. However, using k-fold cross-validation instead of holdout validation drastically increases the computational cost of validating a single configuration. While ensuring better generalization and, by extension, better performance, the additional cost is often prohibitive for effective model selection within a time budget. We aim to make model selection with cross-validation more effective. Therefore, we study early stopping the process of cross-validation during model selection. We investigate the impact of early stopping on random search for two algorithms, MLP and random forest, across 36 classification datasets. We further analyze the impact of the number of folds by considering 3-, 5-, and 10-folds. In addition, we investigate the impact of early stopping with Bayesian optimization instead of random search and also repeated cross-validation. Our exploratory study shows that even a simple-to-understand and easy-to-implement method consistently allows model selection to converge faster; in ~94% of all datasets, on average by ~214%. Moreover, stopping cross-validation enables model selection to explore the search space more exhaustively by considering +167% configurations on average within one hour, while also obtaining better overall performance.
先进的自动机器学习系统通常采用交叉验证;确保测量的性能推广到未见过的数据,或者后续集成不会过拟合。然而,使用k-fold交叉验证代替holdout验证会极大地增加验证单个配置的计算成本。虽然确保更好的泛化性能(进而更好的性能)是值得的,但为了在时间预算内有效选择模型,这种额外的成本往往过于高昂。我们旨在使使用交叉验证进行模型选择更加有效。因此,我们在模型选择过程中研究早期停止。我们研究了MLP和random forest两种算法在36个分类数据集上的随机搜索过程中早期停止的影响。我们进一步分析了3-、5-和10-fold对模型选择的影响。此外,我们还研究了使用贝叶斯优化 instead of random search 和重复交叉验证对模型选择的影响。我们的探索性研究显示,即使是简单易理解且易于实现的方法,也能使模型选择收敛更快;在~94%的数据集上,平均每次加速214%。此外,停止交叉验证使模型选择能够更充分地探索搜索空间,通过考虑+167%的配置,平均在一个小时内获得更好的整体性能。
https://arxiv.org/abs/2405.03389
In the field of computer vision, the numerical encoding of 3D surfaces is crucial. It is classical to represent surfaces with their Signed Distance Functions (SDFs) or Unsigned Distance Functions (UDFs). For tasks like representation learning, surface classification, or surface reconstruction, this function can be learned by a neural network, called Neural Distance Function. This network, and in particular its weights, may serve as a parametric and implicit representation for the surface. The network must represent the surface as accurately as possible. In this paper, we propose a method for learning UDFs that improves the fidelity of the obtained Neural UDF to the original 3D surface. The key idea of our method is to concentrate the learning effort of the Neural UDF on surface edges. More precisely, we show that sampling more training points around surface edges allows better local accuracy of the trained Neural UDF, and thus improves the global expressiveness of the Neural UDF in terms of Hausdorff distance. To detect surface edges, we propose a new statistical method based on the calculation of a $p$-value at each point on the surface. Our method is shown to detect surface edges more accurately than a commonly used local geometric descriptor.
在计算机视觉领域,对3D表面的数值编码至关重要。通常,可以使用符号距离函数(SDF)或无符号距离函数(UDF)表示表面。对于诸如表示学习、表面分类或表面重建等任务,可以通过神经网络学习此函数,称为神经距离函数。这个网络及其权重可以作为表面参数和隐式表示。网络必须尽可能准确地表示表面。在本文中,我们提出了一种学习无符号距离函数(UDFs)的方法,从而提高了获得的神经UDF与原始3D表面的忠实程度。我们提出的方法的关键思想是将神经UDF的学习努力集中在其表面边缘上。具体来说,我们证明,在表面边缘附近采样更多的训练点可以提高训练后的神经UDF的局部准确性,从而改善神经UDF在哈密尔顿距离方面的表达性。为了检测表面边缘,我们提出了一种基于在表面每个点上计算$p$值的新统计方法。我们的方法被证明比常用的局部几何描述符得更准确地检测表面边缘。
https://arxiv.org/abs/2405.03381
Transparency and explainability in image classification are essential for establishing trust in machine learning models and detecting biases and errors. State-of-the-art explainability methods generate saliency maps to show where a specific class is identified, without providing a detailed explanation of the model's decision process. Striving to address such a need, we introduce a post-hoc method that explains the entire feature extraction process of a Convolutional Neural Network. These explanations include a layer-wise representation of the features the model extracts from the input. Such features are represented as saliency maps generated by clustering and merging similar feature maps, to which we associate a weight derived by generalizing Grad-CAM for the proposed methodology. To further enhance these explanations, we include a set of textual labels collected through a gamified crowdsourcing activity and processed using NLP techniques and Sentence-BERT. Finally, we show an approach to generate global explanations by aggregating labels across multiple images.
透明度和可解释性在图像分类中至关重要,用于建立对机器学习模型的信任并检测偏见和错误。最先进的可解释性方法生成确切显示特定类别的 saliency 地图,而不会提供模型决策过程的详细解释。为了解决这个问题,我们引入了一种后置方法,该方法解释了卷积神经网络(CNN)的完整特征提取过程。这些解释包括从输入中提取的每个层的特征的层级表示。这些特征以通过聚类和合并类似特征图生成的 saliency 地图的形式表示,并附有通过扩展 Grad-CAM 获得的权重。为了进一步增强这些解释,我们在活动中通过游戏化众包活动收集了一组文本标签,并使用 NLP 技术和 Sentence-BERT 对这些标签进行处理。最后,我们展示了通过聚合多个图像上的标签来生成全局解释的方法。
https://arxiv.org/abs/2405.03301
Classifying a pedestrian in one of the three conveyor states of "elevator," "escalator" and "neither" is fundamental to many applications such as indoor localization and people flow analysis. We estimate, for the first time, the pedestrian conveyor state given the inertial navigation system (INS) readings of accelerometer, gyroscope and magnetometer sampled from the phone. Our problem is challenging because the INS signals of the conveyor state are coupled and perturbed by unpredictable arbitrary human actions, confusing the decision process. We propose ELESON, a novel, effective and lightweight INS-based deep learning approach to classify whether a pedestrian is in an elevator, escalator or neither. ELESON utilizes a motion feature extractor to decouple the conveyor state from human action in the feature space, and a magnetic feature extractor to account for the speed difference between elevator and escalator. Given the results of the extractors, it employs an evidential state classifier to estimate the confidence of the pedestrian states. Based on extensive experiments conducted on twenty hours of real pedestrian data, we demonstrate that ELESON outperforms significantly the state-of-the-art approaches (where combined INS signals of both the conveyor state and human actions are processed together), with 15% classification improvement in F1 score, stronger confidence discriminability with 10% increase in AUROC (Area Under the Receiver Operating Characteristics), and low computational and memory requirements on smartphones.
分类行人所处的三种传送状态(电梯、楼梯或者既不是)在许多应用中(如室内定位和人流分析)是至关重要的。我们估计,基于手机的惯性导航系统(INS)读数,对传送状态的行人进行分类是第一次。我们的问题具有挑战性,因为传送状态的INS信号受到不可预测的任意人类行为的耦合和扰动,导致决策过程变得混乱。为了解决这一问题,我们提出了ELESON,一种新颖、有效且轻量级的基于INS的深度学习方法,来判断行人是否在电梯、楼梯中或者既不是。ELESON利用运动特征提取器将传送状态从人类动作中解耦,并使用磁特征提取器来考虑电梯和楼梯之间的速度差。根据提取器的成果,它采用证据状态分类器估计行人的状态。在处理20小时真实行人数据的大型实验基础上,我们证明了ELESON在性能上明显优于最先进的解决方案(其中传送状态和人类动作的联合INS信号一起处理),F1得分提高了15%,AUROC(接收者操作特征)的区分能力提高了10%,同时手机上的计算和内存需求较低。
https://arxiv.org/abs/2405.03218
Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions. Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames. However, they neglect the relationship between movements and space-time, while facial cues are hidden within these relationships. To solve this issue, we propose the Hierarchical Space-Time Attention (HSTA). Specifically, we first process ME video frames and special frames or data parallelly by our cascaded Unimodal Space-Time Attention (USTA) to establish connections between subtle facial movements and specific facial areas. Then, we design Crossmodal Space-Time Attention (CSTA) to achieve a higher-quality fusion for crossmodal data. Finally, we hierarchically integrate USTA and CSTA to grasp the deeper facial cues. Our model emphasizes temporal modeling without neglecting the processing of special data, and it fuses the contents in different modalities while maintaining their respective uniqueness. Extensive experiments on the four benchmarks show the effectiveness of our proposed HSTA. Specifically, compared with the latest method on the CASME3 dataset, it achieves about 3% score improvement in seven-category classification.
微表情识别(MER)旨在识别Micro-expression(ME)视频片段中的短小精微面部动作,这些动作揭示了真实情感。最近,大多数MER方法仅利用ME视频片段的特殊帧或提取特殊帧的光学流。然而,它们忽视了运动和空间时间之间的关系,而面部线索隐藏在这些关系中。为解决这个问题,我们提出了层次空间时间关注(HSTA)。具体来说,我们通过自下而上的单模态空间时间关注(USTA)处理ME视频帧和特殊帧,以建立微妙面部动作与特定面部区域之间的联系。然后,我们设计了跨模态空间时间关注(CSTA)来提高跨模态数据的质量。最后,我们将USTA和CSTA分层集成,以捕捉更深的面部线索。我们强调在忽略特殊数据处理的同时进行时间建模,并将不同模态的内容融合在一起,保持其独特性。在四个基准测试上进行的大量实验证明了我们提出的HSTA的有效性。具体来说,与CASME3数据集中的最新方法相比,它在七类分类上实现了约3%的得分改进。
https://arxiv.org/abs/2405.03202
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
在最近几年,双编码器视觉语言模型(例如CLIP)在文本到图像检索方面取得了显著的性能。然而,我们发现,这些模型通常对于一对同义词查询的检索结果会有非常不同的情况。这种行为可能会使得检索系统更具预测性,从而导致用户不满。在这项工作中,我们考虑了同义词文本到图像检索的任务,其中模型旨在通过一对同义词查询返回类似的结果。为了开始,我们收集了一个同义词图像描述的 dataset,以便为这个任务进行定量评估。然后我们假设现有双编码器模型的不良行为是因为它们在训练时使用的文本塔,这些文本塔针对图像-句子对进行训练,无法捕捉到同义词查询之间的语义相似性。为了改进这种状况,我们研究了从预训练于大型文本语料库的语言模型开始,为双编码器模型训练多种策略。与公共的双编码器模型(如CLIP和OpenCLIP)相比,我们使用最佳适应策略训练的双编码器模型在非同义词查询的检索排名相似性方面明显更高,同时保持相同的零散分类和检索准确性。
https://arxiv.org/abs/2405.03190
Many clinical tasks require an understanding of specialized data, such as medical images and genomics, which is not typically found in general-purpose large multimodal models. Building upon Gemini's multimodal models, we develop several models within the new Med-Gemini family that inherit core capabilities of Gemini and are optimized for medical use via fine-tuning with 2D and 3D radiology, histopathology, ophthalmology, dermatology and genomic data. Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report generation based on expert evaluation, exceeding previous best results across two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as "equivalent or better" than the original radiologists' reports. We demonstrate the first ever large multimodal model-based report generation for 3D computed tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered clinically acceptable, although additional research is needed to meet expert radiologist reporting quality. Beyond report generation, Med-Gemini-2D surpasses the previous best performance in CXR visual question answering (VQA) and performs well in CXR classification and radiology VQA, exceeding SoTA or baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology image classification, Med-Gemini-2D surpasses baselines across 18 out of 20 tasks and approaches task-specific model performance. Beyond imaging, Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based approach for disease risk prediction and generalizes to genetically correlated diseases for which it has never been trained. Although further development and evaluation are necessary in the safety-critical medical domain, our results highlight the potential of Med-Gemini across a wide range of medical tasks.
许多临床任务需要对专业数据的理解,比如医学图像和基因组数据,这在通用大型多模态模型中通常不会存在。在Gemini的多模态模型的基础上,我们开发了几种新Med-Gemini家族中的模型,通过2D和3D放射学、病理学、眼科、皮肤病和基因组数据进行微调,以优化医学用途。Med-Gemini-2D为基于专家评估的AI驱动胸部X光(CXR)报告生成树立了新的标准,超越了两个不同的数据集的 previous best 结果,其绝对差值分别为1%和12%。在正常和异常病例中,AI报告与原始放射科医生的报告相比较,有57%和96%的AI报告被认为是“等同或更好”的。我们证明了使用Med-Gemini-3D生成3D计算机断层扫描(CT)体积的第一种大型多模态模型报告。在CT体积的评估中,尽管53%的AI报告在临床上是可以接受的,但需要进一步研究以满足专家放射科医生报告的质量要求。超越报告生成,Med-Gemini-2D在CXR视觉问答(VQA)方面超越了前面的最佳表现,并在CXR分类和放射学VQA方面表现出色,在20个任务中有17个任务超过了SoTA或基线。在病理学、眼科和皮肤病图像分类中,Med-Gemini-2D超越了基线,在18个任务中接近于任务特定的模型性能。除了成像之外,Med-Gemini-Polygenic在疾病风险预测方面超越了基于标准线性多基因风险评分的方法,并将其扩展到与培训无关的遗传相关疾病。尽管在安全关键医疗领域还需要进一步发展和评估,但我们的结果突出了Med-Gemini在广泛的医疗任务中的潜力。
https://arxiv.org/abs/2405.03162
In the digital age, the prevalence of misleading news headlines poses a significant challenge to information integrity, necessitating robust detection mechanisms. This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines. Utilizing a dataset of 60 articles, sourced from both reputable and questionable outlets across health, science & tech, and business domains, we employ three LLMs- ChatGPT-3.5, ChatGPT-4, and Gemini-for classification. Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy, especially in cases with unanimous annotator agreement on misleading headlines. The study emphasizes the importance of human-centered evaluation in developing LLMs that can navigate the complexities of misinformation detection, aligning technical proficiency with nuanced human judgment. Our findings contribute to the discourse on AI ethics, emphasizing the need for models that are not only technically advanced but also ethically aligned and sensitive to the subtleties of human interpretation.
在数字时代,误导性新闻头版的普遍存在对信息 integrity 的挑战,需要强大的检测机制。本研究探讨了大型语言模型(LLMs)在识别误导性和非误导性新闻标题方面的效果。利用一个来自健康、科技和商业领域的60篇文章的数据集,我们使用 ChatGPT-3.5、ChatGPT-4 和 Gemini 进行分类。我们的分析揭示了模型表现力的显著差异,ChatGPT-4 表现出优越的准确性,尤其是在模型预测一致性高的误导性标题的情况下。本研究强调了在开发能够应对 misinformation 检测复杂性的 LLM 时,关注人类中心评估的重要性,将技术能力与微妙的人类判断相结合。我们的研究结果为人工智能伦理讨论做出了贡献,强调需要具有技术先进性、伦理对齐和对他者解释细微差别的模型。
https://arxiv.org/abs/2405.03153
Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
混合专家(MoE)模型有助于实现高效的扩展;然而,训练路由器网络引入了一个非可微分的离散优化目标。最近,一种全差分的MoE架构SMEAR被提出(Muqeeth等人,2023),它通过软地合并参数空间中的专家来简化问题;然而,只有在下游分类任务上的精调效果才是有效的。在本文中,我们提出了Lory,这是第一个将此类架构扩展到自回归语言模型预训练的第一种方法。Lory引入了两个关键技术:(1)一种因果段路由策略,在专家合并操作保持语言模型的自回归性质的同时实现高效;(2)一种基于相似度的数据批注方法,通过将训练实例中的相似文档分组,鼓励专家专门化。我们在从头预训练的150B个token上训练了一系列Lory模型,包括最多32个专家和30B(1.5B个活跃)参数。实验结果表明,参数匹配的密集模型在词向量对和各种下游任务上的性能均明显低于Lory模型。尽管存在段级别路由,Lory模型在token级别路由的 MoE 模型上仍具有竞争力的性能。我们进一步证明,在Lory中训练的专家可以捕捉领域级别专门化,而不需要监督。我们的工作突出了完全差分MoE架构在语言模型预训练中的潜力,并呼吁在这个领域进行未来的研究。
https://arxiv.org/abs/2405.03133
Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
有效的图像分类依赖于从前景和背景元素中辨别相关特征,通常前景持有关键信息。虽然人类在有限曝光下也能够分类图像,但人工神经网络通常在从罕见样本中选择特征时遇到困难。为了应对这个挑战,我们提出了一种选择类相关补丁嵌入的新方法。我们的方法将支持性和查询图像分割成补丁,并使用预训练的Vision Transformer(ViT)对其进行编码,分别获得类嵌入和补丁嵌入。接下来,我们使用类嵌入过滤补丁嵌入,保留只有类相关的补丁。对于每个图像,我们计算类嵌入与每个补丁嵌入之间的相似度,将相似度序列按下降顺序排序,并仅保留排名靠前的补丁嵌入。通过优先考虑类嵌入与补丁嵌入之间的相似性,我们选择排名靠前的补丁嵌入与类嵌入融合,形成全面图像表示,增强模式识别。通过有效地减轻类无关补丁嵌入的影响,我们的策略在预训练模型上产生了改进。在流行的小样本分类基准上进行广泛的实验,证明了我们的方法的简单性、有效性和计算效率,在5-shot和1-shot场景下均优于最先进的基线。
https://arxiv.org/abs/2405.03722