We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.
我们提出了FastFit方法和一个Python软件包设计,旨在提供快速和准确的零散shot分类,尤其是在具有许多相似语义类别的场景中。FastFit采用了一种新颖的方法,将批式对比学习与词级相似度分数相结合。与现有的零散shot学习软件包(如SetFit、Transformers或通过API调用的大语言模型的一小 shots提示)相比,FastFit在速度和准确性上显著提高了多分类分类性能。FastFit在FewMany、我们的新编英语基准和多语言数据集上的表现表明,其训练速度提高了3-20倍,训练时间仅需几秒钟。FastFit软件包现在可以在GitHub和PyPI上获得,为NLP从业者提供了一个易于使用的解决方案。
https://arxiv.org/abs/2404.12365
This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significantly harm its performance. A targeted feature is "removed" by collapsing the dimension in the data distribution that corresponds to that feature. We perform this by moving data points along the feature dimension to a baseline feature value while staying on the data manifold, as estimated by a deep generative model. Then we observe how the model's performance changes on the modified test data set, with the target feature dimension removed. We test our method on deep neural network models trained on synthetic image data with known ground truth, an Alzheimer's disease prediction task using MRI and hippocampus segmentations from the OASIS-3 dataset, and a cell nuclei classification task using the Lizard dataset.
本文提出了一种新的测量神经网络模型特征依赖性的技术。动机是更好地理解模型,通过查询它是否使用人类可理解特征(例如解剖形状、体积或图像纹理)来查询。我们的方法基于一个原理,即如果一个模型依赖于一个特征,那么移除该特征应显著损害其性能。我们通过在数据分布中沿着特征维度移动数据点,同时保持在数据流上,作为由深度生成模型估计的数据范式的“移除”来执行此操作。然后,我们在包含目标特征的修改测试数据集上观察模型性能的变化。我们用已知真实标准的深度神经网络模型进行了测试,该模型基于OASIS-3数据集的MRI和 hippocampus 分割预测阿尔茨海默病,以及使用Lizard数据集的细胞核分类任务。
https://arxiv.org/abs/2404.12341
Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective.
资源受限的硬件设备(如边缘设备或智能手机)通常依赖于云计算服务器为深度视觉模型的推理提供所需的计算资源。然而,将边缘或移动设备中的图像和视频数据传输到云计算服务器需要处理网络限制。使用标准的编解码器(如JPEG或H.264)较为普遍,以确保互操作性。本文旨在探讨在深度视觉管道中使用标准化编解码器的影响。我们发现,使用JPEG和H.264编解码会显著降低各种视觉任务和模型的准确性。例如,强大的压缩率将掩码IoU的准确性降低超过80%。与之前的研究相比,我们的分析超越了图像和动作分类,扩展到局部定位和密集预测任务,从而提供了更全面的视角。
https://arxiv.org/abs/2404.12330
Deep models trained on large amounts of data often incorporate implicit biases present during training time. If later such a bias is discovered during inference or deployment, it is often necessary to acquire new data and retrain the model. This behavior is especially problematic in critical areas such as autonomous driving or medical decision-making. In these scenarios, new data is often expensive and hard to come by. In this work, we present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias. We achieve this by tuning a zero-initialized copy of a frozen pre-trained network. Our method needs very few, in extreme cases only a single, examples that contradict the bias to increase performance. Additionally, we propose an early stopping criterion to modify baselines and reduce overfitting. We evaluate our approach on a well-known bias in skin lesion classification and three other datasets from the domain shift literature. We find that our approach works especially well with very few images. Simple fine-tuning combined with our early stopping also leads to performance benefits for a larger number of tuning samples.
训练大量数据的大深度模型通常在训练过程中存在潜在的偏见。如果在推理或部署过程中发现了这种偏见,通常需要获取新的数据并重新训练模型。这种行为在关键领域(如自动驾驶或医疗决策)尤其成问题。在这些场景中,新数据通常很贵,很难获得。在这项工作中,我们提出了一个基于变化惩罚的方法,该方法基于预训练的模型,并调整权重以减轻之前发现的偏见。我们通过调整一个零初始化的预训练网络来实现这一目标。我们的方法需要非常少,在极端情况下,只有一个反例来提高性能。此外,我们提出了一个提前停止准则来修改基线并减少过拟合。我们在领域迁移 literature 中著名的皮肤斑点分类数据集以及其他三个数据集上评估我们的方法。我们发现,我们的方法在非常少的数据上表现尤其好。简单地进行微调并结合我们的提前停止准则,也对更多的 tuning 样本带来了性能提升。
https://arxiv.org/abs/2404.12292
Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.
嵌入模型对各种自然语言处理任务至关重要,但它们可能受到词汇有限、缺乏上下文和语法错误等因素的限制。本文提出了一种通过利用大型语言模型(LLMs)在嵌入过程前对输入文本进行丰富和重新编写的全新方法,以提高嵌入性能。通过使用ChatGPT 3.5提供额外的上下文、修正不准确性和包含元数据,所提出的方法旨在增强嵌入模型的效用和准确性。本文在三个数据集上进行了评估:Banking77分类、TwitterSemEval 2015和Amazon Counter-factual分类。结果表明,与基线模型相比,在TwitterSemEval 2015数据集上取得了显著的提高,最佳表现提示得分比之前最佳成绩(MTEB Leaderboard)高出85.34分。然而,在其他两个数据集上的表现并不令人印象深刻,这表明在考虑领域特征时非常重要。这些发现表明,基于LLM的文本丰富已经显示出改善嵌入性能的前景,特别是在某些领域。因此,在嵌入过程中可以避免许多限制。
https://arxiv.org/abs/2404.12283
In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion", a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information. We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis. These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings.
在大数据时代,整合多样数据模态面临着重大挑战,特别是在复杂的领域如医疗保健领域。本文介绍了一种名为多模态数据融合数据挖掘的新过程模型,将嵌入和跨行业数据挖掘标准过程与现有的数据融合信息组模型相结合。我们的目标是降低计算成本、复杂性和偏见,同时提高效率和可靠性。我们还提出了“解耦的密集融合”,一种新型的嵌入融合方法,旨在优化互信息并促进密集模态特征交互,从而最小化冗余信息。我们通过三个用例展示了模型的效果:使用视网膜图像预测糖尿病视网膜病变,利用卫星图像和人口数据预测家庭暴力,以及从X光片和临床笔记中识别临床和人口特征。模型在糖尿病视网膜病变预测方面的Macro F1得分达到了0.92,在家庭暴力预测方面的R-squared为0.854,在放射学分析中的sMAPE为24.868,而在疾病预测和性别分类方面的宏AUC分别为0.92和0.99。这些结果强调了你数据融合数据挖掘模型对多模态数据处理的重大影响,推动了其在各种资源受限的环境中采用。
https://arxiv.org/abs/2404.12278
In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.
在这项研究中,我们引入了DeepLocalization,一种专为实时定位针对监控驾驶员行为的动作的创新框架。利用先进深度学习方法论的力量,我们的目标是解决驾驶员分心驾驶这一关键问题,这是导致道路事故的一个重要因素。我们的策略采用了一种双方法:利用基于图的变换点检测来精确定位时间中的动作,同时结合视频大型语言模型(Video-LLM)进行精确分类活动。通过仔细的提示工程,我们定制了Video-LLM,使其能够熟练处理驾驶活动的细节,即使数据稀疏,也能确保分类效果。经优化后,我们的框架轻便且适用于消费级GPU,因此在实际场景中具有广泛的应用前景。我们对该方法在SynDD2数据集上的测试进行了严格的评估,这是一个复杂的驾驶员分心驾驶行为基准数据集,它在事件分类和事件检测方面都取得了令人满意的成绩,证明了DeepLocalization在准确识别不同驾驶员行为及其时间发生情况方面具有巨大的潜力。
https://arxiv.org/abs/2404.12258
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。
https://arxiv.org/abs/2404.12210
As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.
随着移动相机技术的最新进展,已经能够捕捉到高分辨率图像,如4K图像,对大运动模糊的有效的模糊模型处理需求增加了。在本文中,我们发现根据模糊类型的不同,图像残差误差可以分为一些类别。为了验证这个想法,我们分解了模糊(回归)任务为模糊像素离散化(像素级模糊分类)和离散-到连续转换(带有模糊类图的回归)任务。具体来说,我们通过识别模糊像素并将其转换为连续形式,使得计算更加高效,而原始回归问题使用连续值求解在计算上更加昂贵。在这里,我们发现离散化结果,即模糊分割图,与图像残差误差具有显著的视觉相似性。因此,我们的有效模型在现实基准测试中的性能与最先进的 methods相当,而我们的方法比传统方法在计算上效率高达10倍。
https://arxiv.org/abs/2404.12168
The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we have collected a novel dataset of speech recordings from $20$ patients from which we extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of $66.2\,\%$. Moreover, we show that integrating our speech model with a series of patients' metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of $94.4\,\%$, marking an absolute improvement of $28.2\,\%$, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.
延迟在急诊科接受专业精神科评估和治疗患者存在自杀倾向, creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive,speech-based approach for automatic suicide risk assessment. For our study, we have collected a novel dataset of speech recordings from $20$ patients from which we extract three sets of features,including wav2vec,interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of $66.2\%$.此外,我们证明了将我们的语音模型与一系列患者元数据(如自杀尝试历史或获取枪支的渠道)集成可以提高整体结果。元数据集成使平衡准确率达到了$94.4\%$,表明我们在急诊医学中自动自杀风险评估方法的实效性。
https://arxiv.org/abs/2404.12132
Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (this https URL).
利用深度学习从遥感图像中进行Change Detection(CD)的研究已经广泛展开。通常,它被视为一个像素级的标注任务,旨在将每个像素分类为发生改变或未发生改变。尽管在编码器-解码器结构中的每个像素分类网络已经表现出优势,但在各种场景中,它们仍然存在不精确的边界和对象不完整的外部边界。对于高分辨率的反射图像,部分或完全发生变化的对象更值得关注,而不是单个像素。因此,我们从掩膜预测和分类的角度重新审视了CD任务,并提出了MaskCD来通过自适应生成分类掩码来检测发生变化的部分。具体来说,它利用跨级变化表示器(CLCRP)来学习多尺度变化感知的表示,并利用变形多头自注意力(DeformMHSA)从编码特征中捕获语义关系。然后,开发了一个掩码注意力和自注意力的检测变压器(MA-DETR)解码器,用于准确地定位和识别发生变化的对象,基于掩码注意力和自注意机制。它通过将像素级表示解码为可学习掩码建议并做出最后预测来重构所需的变化对象。在五个基准数据集上的实验结果表明,与最先进的模型相比,所提出的方法表现出色。代码和预训练模型可在线获取(此https://)
https://arxiv.org/abs/2404.12081
This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and challenges of multi-task learning versus single-task models. The motivation for this research is twofold: firstly, to empirically assess the advantages and drawbacks of multi-task learning over single-task models in the context of speaker profiling; secondly, to emphasize the undiminished significance of skillful feature engineering for speaker recognition tasks. The findings reveal challenges in accent classification, and multi-task learning is found advantageous for tasks of similar complexity. Non-sequential features are favored for speaker recognition, but sequential ones can serve as starting points for complex models. The study underscores the necessity of meticulous experimentation and parameter tuning for deep learning models.
本研究采用深度学习技术对TIMIT数据集中的四个说话人分类任务进行研究,包括性别分类、口音分类、年龄估计和说话人识别,强调了多任务学习与单任务模型的优势和挑战。本研究的研究动机是双重的:首先,旨在通过实验检验多任务学习在说话人分类任务中相对于单任务模型的优势和劣势;其次,强调了对说话人识别任务中技能化特征工程的重要性。研究结果揭示了口音分类任务的挑战,而多任务学习在任务复杂度相似的情况下具有优势。非序列特征在说话人识别中受到青睐,但序列特征可以作为复杂模型的起点。本研究强调了对于深度学习模型的实验和参数调整的必要性。
https://arxiv.org/abs/2404.12077
This paper introduces a novel approach, evolutionary multi-objective optimisation for fairness-aware self-adjusting memory classifiers, designed to enhance fairness in machine learning algorithms applied to data stream classification. With the growing concern over discrimination in algorithmic decision-making, particularly in dynamic data stream environments, there is a need for methods that ensure fair treatment of individuals across sensitive attributes like race or gender. The proposed approach addresses this challenge by integrating the strengths of the self-adjusting memory K-Nearest-Neighbour algorithm with evolutionary multi-objective optimisation. This combination allows the new approach to efficiently manage concept drift in streaming data and leverage the flexibility of evolutionary multi-objective optimisation to maximise accuracy and minimise discrimination simultaneously. We demonstrate the effectiveness of the proposed approach through extensive experiments on various datasets, comparing its performance against several baseline methods in terms of accuracy and fairness metrics. Our results show that the proposed approach maintains competitive accuracy and significantly reduces discrimination, highlighting its potential as a robust solution for fairness-aware data stream classification. Further analyses also confirm the effectiveness of the strategies to trigger evolutionary multi-objective optimisation and adapt classifiers in the proposed approach.
本文提出了一种新方法,称为进化多目标优化公平感知自调整记忆分类器,旨在提高应用于数据流分类的机器学习算法中的公平性。随着对算法决策中歧视的担忧不断增加,特别是在动态数据流环境中,需要方法来确保对敏感属性(如种族或性别)的公平对待。所提出的方法通过将自调整记忆K-最近邻算法的优势与进化多目标优化相结合来解决这一挑战。这种结合允许新的方法有效地管理数据流中的概念漂移,并利用进化的多目标优化的灵活性来同时最大化准确性和最小化歧视。我们通过在各种数据集上进行广泛的实验,比较了所提出方法在准确性和公平度指标上的表现与多个基线方法的性能。我们的结果表明,与基线方法相比,所提出方法保持了竞争性的准确性和显著减少了歧视,表明其可能成为公平感知数据流分类的稳健解决方案。进一步的分析还证实了策略触发进化多目标优化和自调整分类器在所提出方法中的有效性。
https://arxiv.org/abs/2404.12076
Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km$^2$ across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.
树木物种分布的了解是管理森林的基础。新的深度学习方法预计将在森林制图方面显著提高准确性,并成为放大规模绘制多种树种的 critical 工具。为了推动该领域的发展,深度学习研究人员需要大型高质量标注的数据集。为此,我们提出了 PureForest 数据集:一个大规模、开放、多模态的数据集,旨在从空域激光扫描(ALS)点云和非常高分辨率(VHR)航空图像中对树木物种进行分类。目前,大多数公开的 Lidar 数据集用于树木物种分类时具有较低的多样性,因为它们只覆盖了少数 annotated 公顷的土地。相比之下,PureForest 把 18 种树木分成了 13 个语义类,跨越了 449 个不同的单一森林,迄今为止是最大的、最全面的 Lidar 数据集,用于识别树木物种。通过将 PureForest 公开发布,我们希望为开发从 Lidar 和/或航空影像中识别树木物种的深度学习方法提供一个具有挑战性的基准数据集。在本文的数据论文中,我们描述了标注工作流程、数据集、推荐的评估方法和从 3D 和 2D 模态中建立基准性能。
https://arxiv.org/abs/2404.12064
The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.
数字媒体和不断变化的社会政治动态显著增强了仇恨内容的传播。现有的研究主要集中在将文本分类为二元类别,往往忽视了文本中存在的连续的冒犯和仇恨程度。在这项研究中,我们提出了一个广泛的哈马斯语 benchmark 数据集,包括 8,258 条推特,分别用于三个不同的任务:分类、识别仇恨目标和评分冒犯力和仇恨程度。我们的研究强调,绝大多数推特属于不太冒犯和不太仇恨的程度,这需要利益相关者的早期干预。民族和政治仇恨目标的普遍存在,在我们的数据集中具有显著的重叠,强调了 Ethiopia 社会政治格局中复杂的关系。我们构建了分类和回归模型,并研究了这些任务中模型的效果。我们的结果表明,简单的二元分类无法解决仇恨和冒犯性言论的问题,反而表现为一个连续范围内的变量。Afro-XLMR-large 模型在分类、目标和回归任务上都取得了最佳性能,分别达到 F1 分数为 75.30%、70.59% 和 29.42%。Afro-XLMR-large 模型的 80.22% 相关系数表明很强的 alignments。
https://arxiv.org/abs/2404.12042
Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained visual categorization~(FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance.
数据无感知知识蒸馏(DFKD)是一种解决与模型压缩、隐私和安全相关问题的有前途的方法,尤其是在涉及对类似类别的细粒度分类任务的实际应用中。虽然利用DFKD的现有方法已经取得了鼓舞人心的成就,但在实际应用中涉及细粒度分类任务时,得到的结果往往是不最优的。为了解决这个问题,我们提出了一个名为DFKD-FGVC的方法,将其扩展到细粒度视觉分类(FGVC)任务中。我们的方法利用注意力生成器、混合高阶注意力蒸馏和语义特征对比学习。具体来说,我们在生成器中引入了一个空间级的注意力机制,以合成具有更多细节的判别部分的精细图像。我们还利用混合高阶注意力机制来捕捉部分之间的复杂互动以及细粒度类别的判别特征之间的微妙差异,关注局部特征和语义上下文关系。此外,我们还利用蒸馏框架的教师和学生模型来对比超空间中高级语义特征映射的差异,比较不同类别的差异。我们在三个广泛使用的FGVC基准(飞机、汽车196和CUB200)上评估我们的方法,并证明了其优越性能。
https://arxiv.org/abs/2404.12037
Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: this https URL.
语义场景完成(也称为语义占用预测),可以为自动驾驶车辆提供丰富的几何和语义信息,这吸引了学术界和产业界越来越多的关注。然而,现有的方法通常将此任务表示为体素级的分类问题,并且在训练过程中对每个体素同等对待3D空间。由于对难于学习的难于体素没有给予足够的关注,因此在一些具有挑战性的区域,性能有限。通常,3D密集空间包含大量空体素,这些体素容易学习,但由于处理所有体素的方式相同,需要大量的计算。此外,边界区域内的体素比内部更难区分。在本文中,我们提出了带有硬化设计的语义场景完成模型来训练具有硬化设计的模型。对网络优化过程的全局硬度定义为动态选择难以学习的体素。然后,采用局部硬度与几何变形来对体素进行逐个细化。此外,还引入了自蒸馏策略来使训练过程稳定和一致。大量实验证明,我们的HASSC方案可以在不产生额外推理成本的情况下有效提高基线模型的准确性。代码可在此处下载:https://这个链接。
https://arxiv.org/abs/2404.11958
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at this https URL
基础模型,在大量数据上预训练,已经在各种下游任务中展示了出色的零样本能力。然而,在目标检测和实例分割这两个对大量人类标注依赖的基本计算机视觉任务中,基础模型如SAM和DINO很难实现令人满意的成绩。在这项研究中,我们揭示了对象边界就在那里,即这些基础模型无法区分单个对象的边界。对于第一个,我们观察到CLIP,它从未访问过任何实例级别的标注,在其特定中间层的聚类结果中可以提供高度有益的实例级别边界先验。接着,我们提出了Zip,它将CLIP和SAM在一种新颖的分类先于发现的数据管道中结合,实现无标注、复杂场景 capable的开放词汇目标检测和实例分割。我们的Zip显著提高了SAM在COCO数据集上的掩码AP,并建立了各种设置中的最先进性能,包括无需训练、自训练和标签效率微调。此外,无标注的Zip甚至与使用基本注释的最佳性能对象检测器相当。代码发布在https://这个URL上。
https://arxiv.org/abs/2404.11957
Despite the progress of Semi-supervised Learning (SSL), existing methods fail to utilize unlabeled data effectively and efficiently. Many pseudo-label-based methods select unlabeled examples based on inaccurate confidence scores from the classifier. Most prior work also uses all available unlabeled data without pruning, making it difficult to handle large amounts of unlabeled data. To address these issues, we propose two methods: Variational Confidence Calibration (VCC) and Influence-Function-based Unlabeled Sample Elimination (INFUSE). VCC is an universal plugin for SSL confidence calibration, using a variational autoencoder to select more accurate pseudo labels based on three types of consistency scores. INFUSE is a data pruning method that constructs a core dataset of unlabeled examples under SSL. Our methods are effective in multiple datasets and settings, reducing classification errors rates and saving training time. Together, VCC-INFUSE reduces the error rate of FlexMatch on the CIFAR-100 dataset by 1.08% while saving nearly half of the training time.
尽管半监督学习(SSL)取得了进展,但现有的方法并未充分利用无标签数据。许多基于伪标签的方法根据分类器不准确的置信度分数选择无标签示例。大多数先前的研究也使用了所有可用的无标签数据,没有进行剪枝,这使得处理大量无标签数据变得困难。为了应对这些问题,我们提出了两种方法:变分置信校准(VCC)和基于影响函数的未标记样本消除(INFUSE)。VCC是SSL置信度校准的通用插件,使用变分自编码器根据三种一致性分数选择更准确的伪标签。INFUSE是一种数据剪枝方法,在SSL下构建无标签示例的核心数据集。我们的方法在多个数据集和设置中有效,降低分类错误率并节省训练时间。与VCC-INFUSE一起,INFUSE在CIFAR-100数据集上的FlexMatch错误率降低了1.08%,同时训练时间缩短了几乎一半。
https://arxiv.org/abs/2404.11947
Explainable Artificial Intelligence (XAI) poses a significant challenge in providing transparent and understandable insights into complex AI models. Traditional post-hoc algorithms, while useful, often struggle to deliver interpretable explanations. Concept-based models offer a promising avenue by incorporating explicit representations of concepts to enhance interpretability. However, existing research on automatic concept discovery methods is often limited by lower-level concepts, costly human annotation requirements, and a restricted domain of background knowledge. In this study, we explore the potential of a Large Language Model (LLM), specifically GPT-4, by leveraging its domain knowledge and common-sense capability to generate high-level concepts that are meaningful as explanations for humans, for a specific setting of image classification. We use minimal textual object information available in the data via prompting to facilitate this process. To evaluate the output, we compare the concepts generated by the LLM with two other methods: concepts generated by humans and the ECII heuristic concept induction system. Since there is no established metric to determine the human understandability of concepts, we conducted a human study to assess the effectiveness of the LLM-generated concepts. Our findings indicate that while human-generated explanations remain superior, concepts derived from GPT-4 are more comprehensible to humans compared to those generated by ECII.
可解释人工智能(XAI)在为复杂AI模型提供透明和可理解的洞察方面提出了重大挑战。虽然传统的后验算法在某种程度上很有用,但往往很难提供可解释的解释。基于概念的模型通过将概念的显式表示来提高可解释性,为这一途径提供了有前景的方法。然而,现有的自动概念发现方法的 research 通常受到较低级别的概念、昂贵的人类标注要求和受限的知识域的限制。在这项研究中,我们探讨了大型语言模型(LLM)特别是 GPT-4 的潜力,通过利用其领域知识和常识能力生成高质量的概念,作为解释为人类对图像分类的特定场景的高层次概念。我们通过提示来最小化数据中可用到的文本对象信息,促进这一过程。为了评估输出,我们比较了 LLM 生成的概念与其他两种方法:由人类生成的概念和 ECII 隐式概念诱导系统生成的概念。由于没有明确的方法来确定人类对概念的理解,我们进行了一项人类研究来评估 LLM 生成的概念的有效性。我们的研究结果表明,尽管人类生成的解释仍然具有优势,但 GPT-4 生成的概念对人类来说更有可理解性,与 ECII 生成的概念相比更具可理解性。
https://arxiv.org/abs/2404.11875