Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed 'frozen feature augmentation (FroFA)', covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.
在预训练的视觉模型输出上训练线性分类器或轻量级模型,所谓的“冻点特征”,在许多下游的少样本任务上表现出令人印象深刻的性能。目前,在训练过程中不会修改冻点特征。另一方面,当网络直接在图像上训练时,数据增强是一个标准的配方,可以提高性能而不会产生实质性的开销。在本文中,我们对少样本图像分类进行了一项广泛的先导研究,探讨了在冻点特征空间中应用数据增强,称之为“冻点特征增强(FroFA)”,包括总共二十个增强。我们的研究证明了,采用看似简单的点式FroFA,例如亮度,可以显著提高三个网络架构、三个大型预训练数据集和八个传输数据集的少样本性能。
https://arxiv.org/abs/2403.10519
Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
通过从原始文本中提取图像预训练图像表示,使得零散 shot 视觉传输下游任务成为可能。通过从互联网收集数百万个样本进行预训练,多模态基础模型(如 CLIP)产生了最先进的零散 shot 结果,通常可以达到与无需任务特定训练的全监督方法相媲美的水平。除了分类准确度令人鼓舞的结果之外,据报道,这些模型通过在自然分布漂移下训练监督模型与 ImageNet 上的监督模型相匹敌,从而缩小了鲁棒性差距。因为鲁棒性对现实世界的应用(尤其是关键应用)至关重要,尤其是在本文中,我们基于覆盖7个自然、3个合成分布漂移和11个对抗攻击的大型鲁棒性基准进行全面评估。我们使用 CLIP 作为试点研究。我们发现,CLIP 在我们的基准上导致监督 ImageNet 模型在合成分布漂移和对抗攻击方面的鲁棒性显著下降。此外,数据重叠分析表明,观察到的鲁棒性在自然分布漂移上可能是由数据重叠造成的。总之,我们的评估表明,对鲁棒性的全面评估是必要的;提高零散 shot 多模态模型的鲁棒性具有重要的意义。
https://arxiv.org/abs/2403.10499
Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to link data to specific engineering actions, thus providing critical decision support for geological challenges ahead of the tunnel face. Leveraging a large and geologically diverse dataset of 500,000 drillholes from 15 tunnels, the research introduces models for accurate rock mass quality classification in a real-world tunnelling context. Both conventional machine learning and image-based deep learning are explored to classify MWD data into Q-classes and Q-values, examples of metrics describing the stability of the rock mass, using both tabular and image data. The results indicate that the K-nearest neighbours algorithm in an ensemble with tree-based models using tabular data, effectively classifies rock mass quality. It achieves a cross-validated balanced accuracy of 0.86 in classifying rock mass into the Q-classes A, B, C, D, E1, E2, and 0.95 for a binary classification with E versus the rest. Classification using a CNN with MWD-images for each blasting round resulted in a balanced accuracy of 0.82 for binary classification. Regressing the Q-value from tabular MWD-data achieved cross-validated R2 and MSE scores of 0.80 and 0.18 for a similar ensemble model as in classification. High performance in regression and classification boosts confidence in automated rock mass assessment. Applying advanced modelling on a unique dataset demonstrates MWD data's value in improving rock mass classification accuracy and advancing data-driven rock engineering design, reducing manual intervention.
目前,钻掘和爆破隧道主要依赖工程师的观察评估。测量掘进过程中的声波数据(MWD)是一个在隧道挖掘过程中收集的高分辨率传感器数据集,主要用於地质可视化。本研究旨在将MWD数据自动化为岩石工程的有用指标。它试图将数据与具体的工程行动联系起来,为即将到来的隧道面前的地质挑战提供关键决策支持。 利用来自15个隧道的500,000个钻孔的大型且地质多样性的数据集,研究引入了在现实钻掘背景下准确判断岩石质量模型的方法。研究探索了使用传统机器学习和基于图像的深度学习将MWD数据分类为Q类和Q值的模型。使用表格和图像数据描述岩石质量的指标。 结果显示,在具有树状模型的集合中,K-最近邻算法有效地分类了岩石质量。它将岩石质量分类到Q类,Q值的二分类器的交叉验证平衡精度提高到0.86。使用CNN对每个爆破周期处理的MWD图像进行分类,二分类器的平衡精度为0.82。从表格MWD数据中回归Q值,其交叉验证的R2和MSE分数分别为0.80和0.18,与分类模型的结果相似。 在先进的建模技术的帮助下,对独特数据集的深入研究证明了MWD数据在提高岩石质量分类准确性和推动数据驱动岩石工程设计方面的价值,并减少了手动干预。
https://arxiv.org/abs/2403.10404
Deep learning (DL) models have emerged as a powerful tool in avian bioacoustics to diagnose environmental health and biodiversity. However, inconsistencies in research pose notable challenges hindering progress in this domain. Reliable DL models need to analyze bird calls flexibly across various species and environments to fully harness the potential of bioacoustics in a cost-effective passive acoustic monitoring scenario. Data fragmentation and opacity across studies complicate a comprehensive evaluation of general model performance. To overcome these challenges, we present the BirdSet benchmark, a unified framework consolidating research efforts with a holistic approach for classifying bird vocalizations in avian bioacoustics. BirdSet harmonizes open-source bird recordings into a curated dataset collection. This unified approach provides an in-depth understanding of model performance and identifies potential shortcomings across different tasks. By establishing baseline results of current models, BirdSet aims to facilitate comparability, guide subsequent data collection, and increase accessibility for newcomers to avian bioacoustics.
深度学习(DL)模型在鸟类生物声学领域被证明是一个强大的工具,用于诊断环境和生物多样性。然而,研究的不一致性给这个领域的发展带来了显著的挑战,从而阻碍了进步。可靠的数据库模型需要根据不同物种和环境对鸟叫进行分析,以全面利用生物声学的潜力,实现成本有效的被动式声学监测场景。数据碎片化和研究结果的透明度限制了对模型性能的全面评估。为了克服这些挑战,我们提出了BirdSet基准,这是一个整合研究努力并采用全面方法的鸟类生物声学分类数据集。BirdSet将开源鸟类录音整合成一个 curated的数据集系列。这种统一的方法深入研究了模型表现,并指出了不同任务中的潜在不足。通过建立当前模型的基线结果,BirdSet旨在促进可比较性,指导后续数据收集,并为鸟类生物声学的新手提供易于使用的便利性。
https://arxiv.org/abs/2403.10380
How well the heart is functioning can be quantified through measurements of myocardial deformation via echocardiography. Clinical assessment of cardiac function is generally focused on global indices of relative shortening, however, territorial, and segmental strain indices have shown to be abnormal in regions of myocardial disease, such as scar. In this work, we propose a single framework to predict myocardial disease substrates at global, territorial, and segmental levels using regional myocardial strain traces as input to a convolutional neural network (CNN)-based classification algorithm. An anatomically meaningful representation of the input data from the clinically standard bullseye representation to a multi-channel 2D image is proposed, to formulate the task as an image classification problem, thus enabling the use of state-of-the-art neural network configurations. A Fully Convolutional Network (FCN) is trained to detect and localize myocardial scar from regional left ventricular (LV) strain patterns. Simulated regional strain data from a controlled dataset of virtual patients with varying degrees and locations of myocardial scar is used for training and validation. The proposed method successfully detects and localizes the scars on 98% of the 5490 left ventricle (LV) segments of the 305 patients in the test set using strain traces only. Due to the sparse existence of scar, only 10% of the LV segments in the virtual patient cohort have scar. Taking the imbalance into account, the class balanced accuracy is calculated as 95%. The performance is reported on global, territorial, and segmental levels. The proposed method proves successful on the strain traces of the virtual cohort and offers the potential to solve the regional myocardial scar detection problem on the strain traces of the real patient cohorts.
心脏功能的好坏可以通过通过超声心动图测量的心肌变形来量化。通常,临床评估心脏功能关注的是相对缩短的指标,然而,领域和片段应变指标在心肌病区域异常,例如瘢痕。在这项工作中,我们提出了一个用于预测基于卷积神经网络(CNN)的全面心肌病亚临床水平的方法,该方法使用区域心肌应变迹作为输入来训练卷积神经网络分类算法。我们提出了一个解剖学上有意义的心肌输入数据的从一个标准的 bullseye表示到多通道 2D 图像的转换,将任务表述为图像分类问题,从而使最先进的神经网络配置得以应用。 完全卷积网络(FCN)从区域左心室(LV)应变模式中训练来检测和定位心肌瘢痕。用于训练和验证的模拟区域应变数据来自于具有不同程度和位置心肌瘢痕的虚拟患者数据集。只使用应变迹成功检测和定位了测试集中305名患者中的98%的左心室(LV)段的瘢痕。由于瘢痕存在稀疏性,虚拟患者队列中只有10%的LV段有瘢痕。考虑到不平衡,计算得到分类平衡准确度为95%。在全局、领土和片段水平上报告性能。我们的方法在虚拟队列的应变迹上表现成功,并有望解决实际患者队列中区域心肌瘢痕检测问题。
https://arxiv.org/abs/2403.10291
The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
少量样本图像分类和分割(FS-CS)任务的目的是对查询图像中的目标对象进行分类和分割,而给定只有几个目标类别的例子。我们引入了 Vision-Instructed Segmentation and Evaluation(VISE)方法,将FS-CS问题转化为视觉问答(VQA)问题,利用视觉语言模型(VLMs),并且以无需训练的方式解决了这个问题。通过使视觉模型与通用视觉模型作为工具进行交互,所提出的方法能够使用仅有的图像级别标签对目标对象进行分类和分割。具体来说,连锁思考提示和上下文学习引导VLM像人类一样回答多选题;像YOLO和Segment Anything Model(SAM)这样的视觉模型帮助VLM完成任务。所提出方法的模块化框架使其易于扩展。我们的方法在Pascal-5i和COCO-20i数据集上实现了最先进的性能。
https://arxiv.org/abs/2403.10287
In this paper, we present Pre-CoFactv3, a comprehensive framework comprised of Question Answering and Text Classification components for fact verification. Leveraging In-Context Learning, Fine-tuned Large Language Models (LLMs), and the FakeNet model, we address the challenges of fact verification. Our experiments explore diverse approaches, comparing different Pre-trained LLMs, introducing FakeNet, and implementing various ensemble methods. Notably, our team, Trifecta, secured first place in the AAAI-24 Factify 3.0 Workshop, surpassing the baseline accuracy by 103% and maintaining a 70% lead over the second competitor. This success underscores the efficacy of our approach and its potential contributions to advancing fact verification research.
在本文中,我们提出了 Pre-CoFactv3,一个由问题回答和文本分类组件组成的全面框架,用于事实验证。利用上下文学习、预训练的大规模语言模型(LLMs)和 FakeNet 模型,我们解决了事实验证中的挑战。我们的实验探讨了各种方法,比较了不同的预训练 LLMs,引入了 FakeNet,并实现了各种集成方法。值得注意的是,我们的团队 Trifecta 在 AAAI-24 事实验证 3.0 工作会上获得了第一名,超过基线准确率 103%,并保持了 70% 的领先优势。这一成功充分证明了我们的方法的有效性,并表明其在推动事实验证研究方面的潜在贡献。
https://arxiv.org/abs/2403.10281
In today's technology-driven era, the imperative for predictive maintenance and advanced diagnostics extends beyond aviation to encompass the identification of damages, failures, and operational defects in rotating and moving machines. Implementing such services not only curtails maintenance costs but also extends machine lifespan, ensuring heightened operational efficiency. Moreover, it serves as a preventive measure against potential accidents or catastrophic events. The advent of Artificial Intelligence (AI) has revolutionized maintenance across industries, enabling more accurate and efficient prediction and analysis of machine failures, thereby conserving time and resources. Our proposed study aims to delve into various machine learning classification techniques, including Support Vector Machine (SVM), Random Forest, Logistic Regression, and Convolutional Neural Network LSTM-Based, for predicting and analyzing machine performance. SVM classifies data into different categories based on their positions in a multidimensional space, while Random Forest employs ensemble learning to create multiple decision trees for classification. Logistic Regression predicts the probability of binary outcomes using input data. The primary objective of the study is to assess these algorithms' performance in predicting and analyzing machine performance, considering factors such as accuracy, precision, recall, and F1 score. The findings will aid maintenance experts in selecting the most suitable machine learning algorithm for effective prediction and analysis of machine performance.
在当今以技术驱动的时代, predictive维护和高级诊断的需求已经超越了航空,涉及到了旋转和移动机器的损伤、故障和操作缺陷的识别。实施这些服务不仅降低了维护成本,而且延长了机器的寿命,确保了更高的操作效率。此外,它还是一种预防性措施,以应对可能的事故或灾难。人工智能(AI)的出现已经彻底颠覆了各个行业的维护,使得机器故障的预测和分析更加准确和高效,从而节省了时间和资源。我们提出的研究旨在探讨各种机器学习分类技术,包括支持向量机(SVM)、随机森林、逻辑回归和卷积神经网络LSTM-Based,以预测和分析机器性能。SVM根据数据在多维空间中的位置将其分类为不同的类别,而随机森林采用集成学习来创建多个分类决策树。逻辑回归通过输入数据预测二进制结果。本研究的主要目标是评估这些算法的预测和分析机器性能,包括准确性、精度、召回率和F1分数。研究结果将帮助维护专家在选择最合适的机器学习算法以有效预测和分析机器性能方面提供指导。
https://arxiv.org/abs/2403.10259
In goal-oriented communications, the objective of the receiver is often to apply a Deep-Learning model, rather than reconstructing the original data. In this context, direct learning over compressed data, without any prior decoding, holds promise for enhancing the time-efficient execution of inference models at the receiver. However, conventional entropic-coding methods like Huffman and Arithmetic break data structure, rendering them unsuitable for learning without decoding. In this paper, we propose an alternative approach in which entropic coding is realized with Low-Density Parity Check (LDPC) codes. We hypothesize that Deep Learning models can more effectively exploit the internal code structure of LDPC codes. At the receiver, we leverage a specific class of Recurrent Neural Networks (RNNs), specifically Gated Recurrent Unit (GRU), trained for image classification. Our numerical results indicate that classification based on LDPC-coded bit-planes surpasses Huffman and Arithmetic coding, while necessitating a significantly smaller learning model. This demonstrates the efficiency of classification directly from LDPC-coded data, eliminating the need for any form of decompression, even partial, prior to applying the learning model.
在目标导向的通信中,接收者的目标通常是应用深度学习模型,而不是重构原始数据。在这种情况下,直接在压缩数据上进行熵编码,无需前解码,对于增强接收器中推理模型的时间效率具有前景。然而,像Huffman和Arithmetic这样的传统熵编码方法会破坏数据结构,使得它们不适合无需解码的学习。在本文中,我们提出了一个替代方法,其中熵编码通过低密度奇偶校验(LDPC)码实现。我们假设深度学习模型可以更有效地利用LDPC码的内部编码结构。在接收端,我们利用一种特定的循环神经网络(RNN)类,特别是门控循环单元(GRU),为图像分类进行训练。我们的数值结果表明,基于LDPC编码的位平面分类超过了Huffman和Arithmetic编码,而需要的学习模型远小得多。这证明了从LDPC编码数据中直接进行分类的高效性,无需进行任何形式的解码,即使是部分解码。
https://arxiv.org/abs/2403.10202
User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
用户界面(UI)理解是一个越来越热门的话题,并且在过去的几年里,主要集中在Web和移动应用程序上。在本文中,我们提出了一个更难的任务:计算机UI理解。旨在促进此领域的研究,我们生成了一个数据集,其中用户会执行一系列操作,并且每张图片都显示了此时桌面内容。我们还提出了一个由合成样本生成管道和视频中的图像对比学习方法组成的框架。我们利用图像特征的自然条件树状关系,在处理多个部分任务的同时,对表示的学习进行正则化。实验结果表明,与之前提出的分层多标签对比损失相比,所提出的框架在细粒度UI分类上表现优异。
https://arxiv.org/abs/2403.10170
This paper introduces a novel Functional Graph Convolutional Network (funGCN) framework that combines Functional Data Analysis and Graph Convolutional Networks to address the complexities of multi-task and multi-modal learning in digital health and longitudinal studies. With the growing importance of health solutions to improve health care and social support, ensure healthy lives, and promote well-being at all ages, funGCN offers a unified approach to handle multivariate longitudinal data for multiple entities and ensures interpretability even with small sample sizes. Key innovations include task-specific embedding components that manage different data types, the ability to perform classification, regression, and forecasting, and the creation of a knowledge graph for insightful data interpretation. The efficacy of funGCN is validated through simulation experiments and a real-data application.
本文提出了一种新颖的函数图卷积网络(funGCN)框架,将功能数据分析和图卷积网络相结合,以解决数字健康和纵向研究中的多任务和多模态学习复杂性。随着医疗解决方案在改善医疗保健和社会支持、确保健康生活和促进各年龄段的健康益处方面的重要性不断增加,funGCN为多个实体处理多维纵向数据提供了一个统一的解决方案,并且即使样本量较小,也能确保可解释性。关键创新包括特定任务嵌入组件,用于管理不同数据类型,进行分类、回归和预测的能力,以及创建了用于洞察数据解释的知识图。funGCN的有效性通过仿真实验和实际应用得到了验证。
https://arxiv.org/abs/2403.10158
Tactile sensing represents a crucial technique that can enhance the performance of robotic manipulators in various tasks. This work presents a novel bioinspired neuromorphic vision-based tactile sensor that uses an event-based camera to quickly capture and convey information about the interactions between robotic manipulators and their environment. The camera in the sensor observes the deformation of a flexible skin manufactured from a cheap and accessible 3D printed material, whereas a 3D printed rigid casing houses the components of the sensor together. The sensor is tested in a grasping stage classification task involving several objects using a data-driven learning-based approach. The results show that the proposed approach enables the sensor to detect pressing and slip incidents within a speed of 2 ms. The fast tactile perception properties of the proposed sensor makes it an ideal candidate for safe grasping of different objects in industries that involve high-speed pick-and-place operations.
触觉感知是一种关键的技术,可以提高机器人操作器在各种任务中的性能。这项工作介绍了一种新颖的生物启发的基于事件事件的触觉传感器,该传感器使用基于事件的相机快速捕捉并传递机器人操作器与其环境之间的互动信息。相机在传感器中观察到由廉价的、可访问的3D打印材料制成的柔性皮肤的变形,而一个3D打印的刚性外壳则将传感器的组件集成在一起。传感器在一个涉及多个物体的握持阶段分类任务中进行了测试,利用数据驱动的学习方法。结果表明,与所提出的方案相比,该传感器能够以2ms的速度检测到按下和滑动事件。所提出的传感器快速触觉感知特性使其成为安全握持不同物体的理想选择,这些物体在涉及高速捡选和放置的行业中使用。
https://arxiv.org/abs/2403.10120
While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets) or heavy additional computations. In this paper, we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end, AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore, AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization, which requires auxiliary source information and heavy computation costs.
尽管微调是一种事实上的标准训练深度神经网络的方法,但在使用小目标数据集时,它仍然容易出现过拟合。之前的方法通过保留源数据集的知识或引入类似于对比损失的规范化项来提高微调性能。然而,这些方法需要辅助源信息(例如源标签或数据集)或大量的额外计算。在本文中,我们提出了一个简单的名为自适应随机特征正则化(AdaRand)的方法。AdaRand有助于训练模型的特征提取器在不依赖辅助源信息的情况下,适应性地改变下游分类任务的特征向量分布。为此,AdaRand最小化从类条件高斯分布中采样到的特征向量与随机参考向量之间的差距。此外,AdaRand动态地更新条件分布,以跟随当前更新的特征提取器,并在特征空间中平衡类之间的距离。我们的实验结果表明,AdaRand在其他微调正则化方法中表现优异,这些方法需要辅助源信息和大量的计算成本。
https://arxiv.org/abs/2403.10097
Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at this https URL.
视觉Transformer(ViT)已成为计算机视觉领域的一个重要骨干。为了提高ViT的效率,最近的工作通过剪枝或融合冗余的词来减轻自注意层的有功成本。然而,这些工作由于信息损失而面临速度准确率权衡。在这里,我们认为关键词融合需要考虑关键词之间多样性的关系,以最小化信息损失。在本文中,我们提出了一种多标准词融合(MCTF),根据多个标准(例如相似性、信息性和融合词的大小)逐步融合关键词。此外,我们还利用了One-step-ahead attention,这是更有效的捕捉关键词信息的方法。通过使用词减少一致性训练模型,我们在图像分类(ImageNet1K)上实现了最佳的速率和准确率权衡。实验结果证明,MCTF在未进行训练的情况下,始终超越了前面的减少方法。具体来说,DeiT-T和DeiT-S使用MCTF分别将FLOPs减少了约44%,同时提高了性能 (+0.5%和+0.3%)。我们还证明了MCTF在各种ViT(例如T2T-ViT和LV-ViT)上的适用性,实现了至少31%的性能提升,而没有性能下降。代码可在此处访问:https://thisurl.com/
https://arxiv.org/abs/2403.10030
Lifelong person re-identification (LReID) assumes a practical scenario where the model is sequentially trained on continuously incoming datasets while alleviating the catastrophic forgetting in the old datasets. However, not only the training datasets but also the gallery images are incrementally accumulated, that requires a huge amount of computational complexity and storage space to extract the features at the inference phase. In this paper, we address the above mentioned problem by incorporating the backward-compatibility to LReID for the first time. We train the model using the continuously incoming datasets while maintaining the model's compatibility toward the previously trained old models without re-computing the features of the old gallery images. To this end, we devise the cross-model compatibility loss based on the contrastive learning with respect to the replay features across all the old datasets. Moreover, we also develop the knowledge consolidation method based on the part classification to learn the shared representation across different datasets for the backward-compatibility. We suggest a more practical methodology for performance evaluation as well where all the gallery and query images are considered together. Experimental results demonstrate that the proposed method achieves a significantly higher performance of the backward-compatibility compared with the existing methods. It is a promising tool for more practical scenarios of LReID.
终身人物识别(LReID)假定一个实际场景,即在连续的 incoming 数据集中对模型进行序列训练,同时减轻 old 数据集中的灾难性遗忘。然而,不仅训练数据集,还包括画廊图像,都需要积累大量的计算复杂度和存储空间,在推理阶段提取特征。在本文中,我们通过首次将反向兼容性引入 LReID,解决了上述提到的这个问题。我们在连续的 incoming 数据集上训练模型,同时保持模型对之前训练的旧模型的兼容性,而不重新计算旧画廊图像的特征。为此,我们根据所有 old 数据集的对比学习,设计了一种跨模态兼容性损失。此外,我们还基于部分分类开发了知识整合方法,以学习不同数据集之间的共享表示。我们建议一种更实际的性能评估方法,其中所有画廊和查询图像都被考虑在内。实验结果表明,与现有方法相比,所提出的方法在反向兼容性方面取得了显著的提高。这是一个有前景的工具,适用于更实际的 LReID 场景。
https://arxiv.org/abs/2403.10022
Learning from point sets is an essential component in many computer vision and machine learning applications. Native, unordered, and permutation invariant set structure space is challenging to model, particularly for point set classification under spatial deformations. Here we propose a framework for classifying point sets experiencing certain types of spatial deformations, with a particular emphasis on datasets featuring affine deformations. Our approach employs the Linear Optimal Transport (LOT) transform to obtain a linear embedding of set-structured data. Utilizing the mathematical properties of the LOT transform, we demonstrate its capacity to accommodate variations in point sets by constructing a convex data space, effectively simplifying point set classification problems. Our method, which employs a nearest-subspace algorithm in the LOT space, demonstrates label efficiency, non-iterative behavior, and requires no hyper-parameter tuning. It achieves competitive accuracies compared to state-of-the-art methods across various point set classification tasks. Furthermore, our approach exhibits robustness in out-of-distribution scenarios where training and test distributions vary in terms of deformation magnitudes.
从点集学习是许多计算机视觉和机器学习应用中的关键组件。原生的、无序的和置换不变的集合结构空间具有挑战性,尤其是在考虑在变形情况下进行点集分类时。这里我们提出了一个分类经历某些类型空间变形点的集的框架,特别关注具有线性变形的数据集。我们的方法采用线性最优传输(LOT)变换获得集合结构的数据的线性嵌入。利用LOT变换的数学性质,我们证明了其通过构建凸数据空间来容纳点集中的变化的能力。我们的方法在LOT空间中采用最近邻子空间算法,具有标签效率、非迭代行为,并且无需进行超参数调节。它在各种点集分类任务上实现了与最先进方法相当的准确度。此外,我们的方法在训练和测试分布根据变形大小变化的情况下表现出了鲁棒性。
https://arxiv.org/abs/2403.10015
This paper presents a novel Fully Binary Point Cloud Transformer (FBPT) model which has the potential to be widely applied and expanded in the fields of robotics and mobile devices. By compressing the weights and activations of a 32-bit full-precision network to 1-bit binary values, the proposed binary point cloud Transformer network significantly reduces the storage footprint and computational resource requirements of neural network models for point cloud processing tasks, compared to full-precision point cloud networks. However, achieving a fully binary point cloud Transformer network, where all parts except the modules specific to the task are binary, poses challenges and bottlenecks in quantizing the activations of Q, K, V and self-attention in the attention module, as they do not adhere to simple probability distributions and can vary with input data. Furthermore, in our network, the binary attention module undergoes a degradation of the self-attention module due to the uniform distribution that occurs after the softmax operation. The primary focus of this paper is on addressing the performance degradation issue caused by the use of binary point cloud Transformer modules. We propose a novel binarization mechanism called dynamic-static hybridization. Specifically, our approach combines static binarization of the overall network model with fine granularity dynamic binarization of data-sensitive components. Furthermore, we make use of a novel hierarchical training scheme to obtain the optimal model and binarization parameters. These above improvements allow the proposed binarization method to outperform binarization methods applied to convolution neural networks when used in point cloud Transformer structures. To demonstrate the superiority of our algorithm, we conducted experiments on two different tasks: point cloud classification and place recognition.
本文提出了一种新颖的全二进制点云Transformer(FBPT)模型,该模型在机器人学和移动设备领域具有广泛的应用和扩展潜力。通过将32位全精度网络的权重和激活压缩至1位二进制值,所提出的二进制点云Transformer网络显著减少了用于点云处理任务的神经网络模型的存储空间和计算资源需求,与全精度点云网络相比。然而,实现一个完全二进制点云Transformer网络,其中所有模块都为二进制,在量化注意力的Q、K、V和自注意力模块方面存在挑战和瓶颈,因为它们不遵循简单的概率分布,并且可能随输入数据而变化。此外,在我们的网络中,由于软max操作后出现的均匀分布,二进制注意模块导致自注意力模块的降维。本文的主要目标是在使用二进制点云Transformer模块时解决性能下降问题。我们提出了名为动态-静态分层混合的新的二进制化机制。具体来说,我们的方法将整个网络模型的静态二进制化与数据敏感组件的精细颗粒度动态二进制化相结合。此外,我们还采用了一种新的分层训练方案来获得最优的模型和二进制参数。这些改进使得所提出的二进制化方法在应用于点云Transformer结构时能够优于应用于卷积神经网络的二进制化方法。为了证明我们算法的优越性,我们在两个不同任务上进行了实验:点云分类和地点识别。
https://arxiv.org/abs/2403.09998
Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes, leveraging the class concepts learned from labeled samples. Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. Though certain classes are visually confused, their text information might be distinct, motivating us to introduce text information into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visual and semantic information mutually enhance each other, promoting the interaction and fusion of visual and text embedding space. Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks, achieving new state-of-the-art. The code will be released at \url{this https URL}.
给定包含旧类和新类的不标注数据集,泛化类别发现(GCD)旨在准确地发现新类,同时正确地分类旧类,利用从标注样本中学到的类别概念。当前的GCD方法仅使用了一个视觉信息维度,导致对视觉相似的类别的分类结果不佳。尽管某些类别在视觉上很相似,但它们的文本信息可能是不同的,因此我们引入了文本信息到GCD任务中。然而,未给定类别的数据使得无法利用文本信息。为解决这个具有挑战性的问题,本文提出了一种文本嵌入合成器(TES)来为未标注样本生成伪文本嵌入。具体来说,我们的TES利用了CLIP生成对齐的视觉-语言特征的特性,将视觉嵌入转换为CLIP文本编码器的标记,从而生成伪文本嵌入。此外,我们还采用了一种双分支框架,通过不同模式分支的联合学习和实例一致性来提高视觉和语义信息之间的相互增强,促进视觉和文本嵌入空间之间的交互和融合。我们的方法 unlocks了CLIP的多模态潜力,并在所有GCD基准测试中取得了很大的优势,达到了最先进的水平。代码将在\url{这个 https URL}上发布。
https://arxiv.org/abs/2403.09974
Conventional imaging diagnostics frequently encounter bottlenecks due to manual inspection, which can lead to delays and inconsistencies. Although deep learning offers a pathway to automation and enhanced accuracy, foundational models in computer vision often emphasize global context at the expense of local details, which are vital for medical imaging diagnostics. To address this, we harness the Swin Transformer's capacity to discern extended spatial dependencies within images through the hierarchical framework. Our novel contribution lies in refining local feature representations, orienting them specifically toward the final distribution of the classifier. This method ensures that local features are not only preserved but are also enriched with task-specific information, enhancing their relevance and detail at every hierarchical level. By implementing this strategy, our model demonstrates significant robustness and precision, as evidenced by extensive validation of two established benchmarks for Knee OsteoArthritis (KOA) grade classification. These results highlight our approach's effectiveness and its promising implications for the future of medical imaging diagnostics. Our implementation is available on this https URL
常规的影像诊断检查常常因为手动检查而遇到瓶颈,这可能导致延迟和不一致性。尽管深度学习提供了一种自动化和提高准确性的途径,但计算机视觉的基础模型通常强调全局上下文,而忽视了局部细节,这些细节对医疗影像诊断至关重要。为了解决这个问题,我们利用Swin Transformer在图像中识别扩展空间依赖性的能力,通过分层框架。我们的新贡献在于改进局部特征表示,将它们专门指向分类器的最终分布。这种方法确保了局部特征不仅得以保留,而且还得到了与任务相关的信息的丰富,从而提高了它们在每一层的重要性。通过实现这一策略,我们的模型在充分验证两个成熟基准测试(Knee OsteoArthritis,KOA)分类器方面表现出了显著的稳健性和精度。这些结果突出了我们方法的有效性和其对医疗影像诊断未来发展的潜在影响。我们的实现可通过此链接访问:
https://arxiv.org/abs/2403.09947
Human affective behavior analysis aims to delve into human expressions and behaviors to deepen our understanding of human emotions. Basic expression categories (EXPR) and Action Units (AUs) are two essential components in this analysis, which categorize emotions and break down facial movements into elemental units, respectively. Despite advancements, existing approaches in expression classification and AU detection often necessitate complex models and substantial computational resources, limiting their applicability in everyday settings. In this work, we introduce the first lightweight framework adept at efficiently tackling both expression classification and AU detection. This framework employs a frozen CLIP image encoder alongside a trainable multilayer perceptron (MLP), enhanced with Conditional Value at Risk (CVaR) for robustness and a loss landscape flattening strategy for improved generalization. Experimental results on the Aff-wild2 dataset demonstrate superior performance in comparison to the baseline while maintaining minimal computational demands, offering a practical solution for affective behavior analysis. The code is available at this https URL
人类情感行为分析旨在深入研究人类表达和行为,以加深我们对人类情感的理解。基本表达类别(EXPR)和动作单元(AUs)是这个分析的两个关键组成部分,它们分别对情绪进行分类并将面部表情分解成基本单元。尽管有进步,但现有的表达分类和AU检测方法通常需要复杂的模型和大量的计算资源,这限制了它们在日常生活环境中的应用。在这项工作中,我们引入了第一个轻量级框架,专门用于有效地解决表达分类和AU检测问题。这个框架采用了一冻结的CLIP图像编码器与可训练的多层感知器(MLP)相结合,并在条件价值风险(CVaR)和损失景观平滑策略的改善下,增强了鲁棒性。在Aff-wild2数据集上的实验结果表明,与基线相比,其性能卓越,同时保持最小的计算需求,为情感行为分析提供了一个实际可行的解决方案。代码可在此处访问:https://url
https://arxiv.org/abs/2403.09915