We propose a fully unsupervised algorithm that detects from encephalography (EEG) recordings when a subject actively listens to sound, versus when the sound is ignored. This problem is known as absolute auditory attention decoding (aAAD). We propose an unsupervised discriminative CCA model for feature extraction and combine it with an unsupervised classifier called minimally informed linear discriminant analysis (MILDA) for aAAD classification. Remarkably, the proposed unsupervised algorithm performs significantly better than a state-of-the-art supervised model. A key reason is that the unsupervised algorithm can successfully adapt to the non-stationary test data at a low computational cost. This opens the door to the analysis of the auditory attention of a subject using EEG signals with a model that automatically tunes itself to the subject without requiring an arduous supervised training session beforehand.
我们提出了一种完全无监督的算法,该算法能够从脑电图(EEG)记录中检测出受试者何时主动聆听声音,以及何时忽略声音。这个问题被称为绝对听觉注意力解码(aAAD)。为此,我们提出了一个用于特征提取的无监督判别式CCA模型,并将其与一种称为最小信息线性判别分析(MILDA)的无监督分类器结合使用来进行aAAD分类。令人惊讶的是,所提出的无监督算法的表现显著优于最先进的有监督模型。其关键原因在于该无监督算法能够以较低的计算成本成功适应非平稳测试数据。这为利用EEG信号自动调整模型来分析受试者的听觉注意力打开了大门,并且无需事先进行繁琐的有监督训练阶段。
https://arxiv.org/abs/2504.17724
This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.
本文通过一个具有挑战性的空中图像数据集,对多种形式的非约束环境中事件进行了全面的经验分析,探讨了符合预测方法的有效性。符合预测是一种强大的事后技术,它能够将任意分类器的输出转化为一组可能的标签,并为真实标签提供覆盖率统计保证。与在标准基准上的评估不同,本研究解决了稀疏数据和高度多变的真实世界设置中的复杂问题。 我们调查了利用预训练模型(MobileNet、DenseNet 和 ResNet),并结合少量标注数据进行微调,以生成具有信息性的预测集的有效性。为了进一步评估校准的影响,我们考虑了两条平行的处理路径(含与不含温度缩放)并使用两个关键指标——经验覆盖率和平均预测集合大小来评估性能。这种设置使我们能够系统地分析校准选择如何影响可靠性和效率之间的权衡。 我们的研究结果表明,即使在标注样本相对较少且非一致性评分简单的情况下,符合预测也能为复杂任务提供有价值的不确定性估计。此外,我们的分析揭示了尽管温度缩放常用于校准,但其并不总是导致更小的预测集合,强调了谨慎应用的重要性。同时,我们的成果还突显了模型压缩技术在资源受限环境中使用符合预测管道中的巨大潜力。 基于我们观察到的情况,我们建议未来的研究应深入探讨噪声或模糊标签对符合预测性能的影响,并探索有效的模型缩减策略。
https://arxiv.org/abs/2504.17655
The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
在线通信中滥用语言的蔓延对个人和社区的健康与福祉构成了重大风险。鉴于人们对网络欺凌及其后果日益增长的关注,迫切需要开发识别、缓解有害内容的方法,并促进持续监控、调节及早期干预。本文提出了一种用于区分网络文本中滥用语言关键特征的分类体系。我们的方法采用系统化的方法来构建分类体系,整合了18个现有多标签数据集中的分类系统,以捕捉与在线滥用语言分类相关的关键特性。最终形成的分类体系是分层且有角度(facet)的,包含5大类和17个维度。它涵盖了网络欺凌的各种方面,包括情境、目标、强度、直接性和主题等。这种共同的理解可以促进更加连贯的努力,推动知识交流,并加速研究人员、政策制定者、在线平台所有者及其他利益相关方在在线滥用检测与缓解领域的进展。
https://arxiv.org/abs/2504.17653
As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.
随着虚假信息在社交媒体平台上的持续传播,有效识别谣言已成为自然语言处理领域的一项紧迫挑战。本文提出了一种名为RAGAT-Mind的方法,这是一种基于MindSpore深度学习框架的中文谣言检测多粒度建模方法。该模型结合了TextCNN用于局部语义提取、双向GRU用于序列上下文学习、Multi-Head Self-Attention用于全局依赖聚焦以及双向图卷积网络(BiGCN)用于词共现图的结构化表示。 在微博1-Rumor数据集上的实验表明,RAGAT-Mind取得了卓越的分类性能,准确率达到99.2%,宏平均F1值为0.9919。这些结果验证了将层次化的语言特征与基于图形的语义结构相结合的有效性。此外,该模型表现出强大的泛化能力和可解释性,突显了其在实际谣言检测应用中的实用价值。
https://arxiv.org/abs/2504.17574
Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at this https URL.
城市土地使用分类和制图对于城市规划、资源管理和环境监测至关重要。现有的遥感技术在复杂的城市环境中通常缺乏精度,因为缺少地面细节信息。与空中视角不同的是,街景图像提供了地面上的视图,捕捉到更多的人类和社会活动,这些对复杂的都市场景中的土地使用来说是相关的。现有基于街景的方法主要依赖于监督分类方法,但由于高质量标注数据的稀缺以及在多样城市景观中泛化的困难而面临挑战。 本研究提出了一种内置地理先验知识的无监督对比聚类模型,用于增强街景图像的聚类性能。结合简单的视觉分配法对这些簇进行处理后,我们的方法为土地使用制图提供了一种灵活且可定制的解决方案,可以满足城市规划者特定的需求。我们通过实验展示了该方法可以从两个城市的地理标签街景图像数据集中生成土地使用地图。 由于我们的研究方法依赖于地缘空间数据普遍存在的空间一致性(“托伯勒定律”),它可以适应各种地方,在这些地方街景图像是可用的,从而实现大规模、无监督的土地使用制图和更新。该代码将在[提供链接或具体网址的地方]公开获取。
https://arxiv.org/abs/2504.17551
The recent global spread of monkeypox, particularly in regions where it has not historically been prevalent, has raised significant public health concerns. Early and accurate diagnosis is critical for effective disease management and control. In response, this study proposes a novel deep learning-based framework for the automated detection of monkeypox from skin lesion images, leveraging the power of transfer learning, dimensionality reduction, and advanced machine learning techniques. We utilize the newly developed Monkeypox Skin Lesion Dataset (MSLD), which includes images of monkeypox, chickenpox, and measles, to train and evaluate our models. The proposed framework employs the Xception architecture for deep feature extraction, followed by Principal Component Analysis (PCA) for dimensionality reduction, and the Natural Gradient Boosting (NGBoost) algorithm for classification. To optimize the model's performance and generalization, we introduce the African Vultures Optimization Algorithm (AVOA) for hyperparameter tuning, ensuring efficient exploration of the parameter space. Our results demonstrate that the proposed AVOA-NGBoost model achieves state-of-the-art performance, with an accuracy of 97.53%, F1-score of 97.72% and an AUC of 97.47%. Additionally, we enhance model interpretability using Grad-CAM and LIME techniques, providing insights into the decision-making process and highlighting key features influencing classification. This framework offers a highly precise and efficient diagnostic tool, potentially aiding healthcare providers in early detection and diagnosis, particularly in resource-constrained environments.
近期猴痘在全球范围内的扩散,特别是在历史上该病毒不太常见的地区,引发了重要的公共卫生关切。早期和准确的诊断对于有效管理和控制疾病至关重要。为此,本研究提出了一种基于深度学习的新型框架,旨在从皮肤病变图像中自动检测猴痘,利用迁移学习、降维和先进机器学习技术的力量。 我们使用新开发的猴痘皮肤病变数据集(MSLD),该数据集中包含猴痘、水痘和麻疹的图片,用于训练和评估我们的模型。提出的框架采用Xception架构进行深度特征提取,随后通过主成分分析(PCA)进行降维,并利用自然梯度增强算法(NGBoost)进行分类。 为了优化模型性能并提高泛化能力,我们引入了非洲兀鹫优化算法(AVOA)用于超参数调优,确保高效地探索参数空间。我们的结果显示,提出的 AVOA-NGBoost 模型达到了最先进的性能水平,其准确率为 97.53%,F1 值为 97.72% 和 AUC 值为 97.47%。 此外,我们使用 Grad-CAM 和 LIME 技术增强了模型的可解释性,提供了关于决策过程的见解,并突出了影响分类的关键特征。这一框架提供了一种高精度和高效的诊断工具,有助于医疗保健提供者在资源有限环境中实现早期检测和诊断。
https://arxiv.org/abs/2504.17540
We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model's prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.
我们提出了一种新颖的样本选择方法,用于在存在噪声标签的情况下进行图像分类。现有方法通常将小损失值的样本视为正确标记的样本。然而,在训练初期,一些正确的、对模型来说难以学习的样本可能会表现出高损失值,类似于错误标注的样本。因此,根据每个样本的损失值来设置阈值以选择正确标签会导致在采样选择中精度和召回率之间的权衡:较低的阈值可能会遗漏许多正确的但难以学习的样本(低召回率),而较高的阈值则可能导致包含大量错误标注的样本(低精确度)。为解决这一问题,我们的目标是准确地区分那些虽然被正确标记但是对模型来说难于学习的样本和真正的错误标注样本,从而缓解这种权衡难题。我们通过考虑模型预测置信度的趋势而不是仅仅依赖损失值来实现这一点。实证观察表明,仅对于正确的标签,模型对其注释标签的预测置信度通常比其他类别的增加速度更快。基于这一洞察,我们提出在训练过程中跟踪注释标签与其他类别之间的置信度差距,并使用Mann-Kendall检验评估这些趋势。如果一个样本的所有置信度差距都显示出上升的趋势,则认为该样本可能是正确标记的。我们的方法作为一种即插即用组件,可以无缝地集成到现有的样本选择技术中。在几个标准基准和真实世界数据集上的实验表明,与现有处理噪声标签的方法相比,我们提出的方法能够显著提高其性能。
https://arxiv.org/abs/2504.17474
Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks.
多实例学习(Multiple Instance Learning,MIL)是使用全滑动图像(WSI)在病理学中进行弱监督分类的一种有前景的方法。然而,传统的MIL方法如基于注意力的深度多实例学习(ABMIL),通常会忽略对病理诊断至关重要的补丁间空间交互作用。最近的研究进展,例如基于Transformer的MIL (TransMIL),已经整合了空间上下文和补丁间的相互关系。但是,目前还不清楚在仅依赖多层感知机(MLP)的ABMIL中显式建模这些补丁关系是否能带来类似的性能提升。相比之下,TransMIL采用基于Transformer的层结构,在计算复杂度显著增加的情况下引入了架构的根本变化。 在这项工作中,我们通过整合认知交互表示来增强ABMIL框架,以解答这个问题。我们的模型——全局ABMIL(GABMIL)明确地捕捉实例间的依赖关系,同时保持计算效率不变。在两个公开可用的数据集上进行的实验显示,这些数据集用于乳腺癌和肺癌肿瘤亚型分类,结果显示与ABMIL相比,GABMIL将AUPRC提高了最多7个百分点,并且Kappa分数提升了5个百分点,在几乎不增加或完全没有额外计算开销的情况下实现了这一目标。这些结果强调了在MIL框架中整合补丁交互的重要性。
https://arxiv.org/abs/2504.17379
Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust real-world generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and becomes impractical for resource-limited rare domains. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances or mismatches in class-conditional distributions, and propose a simple yet effective robust learning method that eliminates the need for both bias annotations and predictions. With the goal of reducing the mutual information between spurious factors and label information, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.
误导性关联会导致模型基于错误原因作出正确预测,这对稳健的现实世界泛化提出了关键挑战。现有研究将这一问题归因于群体不平衡,并通过最大化群体平衡或最差群体准确性来解决,这高度依赖于昂贵的偏差标注。一种妥协方法是利用广泛预训练的基础模型来预测偏差信息,这种方法需要大规模数据,在资源有限的罕见领域中变得不切实际。 为了应对这些挑战,我们提供了一种新的视角,将误导性关联重新定义为条件类别分布中的不平衡或不匹配,并提出了一种简单而有效的稳健学习方法,该方法消除了对偏差标注和预测的需求。通过减少错误因素与标签信息之间的互信息,我们的方法利用样本再加权策略实现类条件分布平衡,这能够自动突出少数群体和类别,有效拆解误导性关联,并生成用于分类的去偏数据分布。 广泛的实验和分析表明,我们提出的方法在性能上持续领先于依赖偏差监督的方法。
https://arxiv.org/abs/2504.17314
Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain. Our code is publicly available to facilitate further research at this https URL.
人工智能(AI)系统,尤其是基于深度学习模型的系统,在医学应用中已经越来越能够达到专家级别的表现。然而,人们日益担忧这些AI系统可能反映出并放大了人类偏见,并且在历史上服务不足的人群中的性能可能会下降。公平性问题已经在医疗成像分类领域引起了相当大的研究兴趣,但在文本生成领域仍然鲜有研究。本研究探讨了医学领域中文本生成的公平性问题,并观察到不同种族、性别和年龄组(包括交叉群体)、各种模型规模以及不同的评估指标之间存在显著的表现差异。 为了缓解这一公平性问题,我们提出了一种算法,通过选择性地优化那些表现较差的群体来减少偏见。所选规则不仅考虑单词层面的准确性,还考虑到病理学对目标参考的准确性,并确保整个过程完全可微分以实现有效的模型训练。我们在多个骨干网络、数据集和模式上的评估表明,我们提出的算法在不牺牲整体性能的情况下提高了文本生成中的公平性。具体来说,在不同度量标准下各群体之间的差异通过我们的算法减少了超过30%,而文本生成准确性的相对变化通常不超过2%。 通过减少深度学习模型产生的偏见,我们提出的方法有可能缓解人们对医学领域中基于文本生成诊断的公平性和可靠性的担忧。我们的代码公开提供以促进进一步研究:[此链接](https://example.com)(请将“https URL”替换为实际提供的URL)。
https://arxiv.org/abs/2504.17279
Downsampling layers are crucial building blocks in CNN architectures, which help to increase the receptive field for learning high-level features and reduce the amount of memory/computation in the model. In this work, we study the generalization of the uniform downsampling layer for group equivariant architectures, e.g., G-CNNs. That is, we aim to downsample signals (feature maps) on general finite groups with anti-aliasing. This involves the following: (a) Given a finite group and a downsampling rate, we present an algorithm to form a suitable choice of subgroup. (b) Given a group and a subgroup, we study the notion of bandlimited-ness and propose how to perform anti-aliasing. Notably, our method generalizes the notion of downsampling based on classical sampling theory. When the signal is on a cyclic group, i.e., periodic, our method recovers the standard downsampling of an ideal low-pass filter followed by a subsampling operation. Finally, we conducted experiments on image classification tasks demonstrating that the proposed downsampling operation improves accuracy, better preserves equivariance, and reduces model size when incorporated into G-equivariant networks
下采样层是卷积神经网络(CNN)架构中的关键构建模块,它们有助于增加感受野以学习高层次特征,并减少模型的内存/计算需求。在这项工作中,我们研究了均匀下采样层在群等变架构(如G-CNNs)中的泛化能力,旨在对一般的有限群上的信号(特征图)进行抗混叠下的降采样。这包括以下两个方面: (a) 对于给定的有限群和一个下采样的速率,我们提出了一种算法来选择合适的子群。 (b) 给定一个群及其对应的子群,我们研究了带限性的概念,并提出了如何执行抗混叠的方法。值得一提的是,我们的方法扩展了基于经典采样理论的下采样概念。当信号位于循环群上时(即周期性信号),我们的方法恢复了使用理想低通滤波器进行标准降采样的操作。 最后,我们通过图像分类任务进行了实验,并证明所提出的下采样操作可以提高准确性、更好地保持等变性和减少模型大小,尤其是在将其集成到G-等变网络中的情况下。
https://arxiv.org/abs/2504.17258
Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at this https URL .
扩散模型在诸如图像和视频生成等各种生成任务中取得了显著进展。本文研究了如何利用预训练的扩散模型来执行判别性任务的问题。具体而言,我们通过“反转”一个预先训练好的布局转图像扩散模型,将预训练且冻结的生成型扩散模型的判别能力从分类任务扩展到更复杂的对象检测任务。为此,分别提出了一种基于梯度的离散优化方法来替代沉重的预测枚举过程,并提出一种先验分布模型以更加准确地应用贝叶斯规则。实验证明,在COCO数据集上,该方法与基本的对象检测基线性能相当。此外,我们的方法可以极大地加快先前用于分类的基于扩散的方法的速度而不牺牲准确性。代码和模型可在此网址获取:[此 https URL]。
https://arxiv.org/abs/2504.17253
Auscultatory analysis using an electronic stethoscope has attracted increasing attention in the clinical diagnosis of respiratory diseases. Recently, neural networks have been applied to assist in respiratory sound classification with achievements. However, it remains challenging due to the scarcity of abnormal respiratory sound. In this paper, we propose a novel architecture, namely Waveform-Logmel audio neural networks (WLANN), which uses both waveform and log-mel spectrogram as the input features and uses Bidirectional Gated Recurrent Units (Bi-GRU) to context model the fused features. Experimental results of our WLANN applied to SPRSound respiratory dataset show that the proposed framework can effectively distinguish pathological respiratory sound classes, outperforming the previous studies, with 90.3% in sensitivity and 93.6% in total score. Our study demonstrates the high effectiveness of the WLANN in the diagnosis of respiratory diseases.
使用电子听诊器进行听觉分析在呼吸系统疾病的临床诊断中越来越受到重视。最近,神经网络被应用于帮助分类呼吸声音,并已取得一定成就。然而,由于异常呼吸声音数据的稀缺性,这一领域仍面临挑战。本文提出了一种新型架构——Waveform-Logmel音频神经网络(WLANN),该架构同时使用波形和对数梅尔频谱图作为输入特征,并采用双向门控循环单元(Bi-GRU)来建模融合后的特征的上下文信息。 我们应用 WLANN 对 SPRSound 呼吸数据集进行实验,结果显示提出的框架能够有效地区分病理性的呼吸声音类别,在敏感性和总评分方面分别达到了 90.3% 和 93.6%,优于以往的研究成果。我们的研究证明了 WLANN 在诊断呼吸系统疾病方面的高有效性。
https://arxiv.org/abs/2504.17156
As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.
随着表征学习领域的不断发展,出现了越来越多用于解决不同类型问题的损失函数。我们引入了一个单一的信息理论方程,该方程概括了机器学习中一大类现代损失函数。具体来说,我们提出了一种框架,表明许多广泛的机器学习方法都在最小化两个条件分布之间的积分KL散度:监督和学习到的表征。这种观点揭示了聚类、谱方法、降维、对比学习和监督学习背后隐藏的信息几何结构。 这一框架还使得通过结合文献中成功的各种技术来开发新的损失函数成为可能。我们不仅提供了一系列广泛的证明,将超过23种不同的方法连接起来,而且还利用这些理论成果创建了最先进的无监督图像分类器,在ImageNet-1K上的无监督分类上实现了比先前最佳状态高出+8%的性能改进。 此外,我们还展示了I-Con可以用来推导出原理性的去偏方法,从而提高对比表征学习者的表现。
https://arxiv.org/abs/2504.16929
In recent years, the detection of AI-generated text has become a critical area of research due to concerns about academic integrity, misinformation, and ethical AI deployment. This paper presents COT Fine-tuned, a novel framework for detecting AI-generated text and identifying the specific language model. responsible for generating the text. We propose a dual-task approach, where Task A involves classifying text as AI-generated or human-written, and Task B identifies the specific LLM behind the text. The key innovation of our method lies in the use of Chain-of-Thought reasoning, which enables the model to generate explanations for its predictions, enhancing transparency and interpretability. Our experiments demonstrate that COT Fine-tuned achieves high accuracy in both tasks, with strong performance in LLM identification and human-AI classification. We also show that the CoT reasoning process contributes significantly to the models effectiveness and interpretability.
近年来,由于对学术诚信、错误信息和伦理人工智能部署的担忧不断增加,检测AI生成文本已成为研究的一个关键领域。本文介绍了COT Fine-tuned(链式思维微调),这是一种用于检测AI生成文本并识别具体语言模型的新框架。我们提出了一种双任务方法:任务A涉及将文本分类为AI生成或人类撰写;任务B则旨在识别生成该文本的具体大型语言模型(LLM)。我们的方法的关键创新在于使用了Chain-of-Thought推理,使模型能够为其预测生成解释,从而增强了透明度和可解释性。实验结果表明,COT Fine-tuned在两个任务中均实现了高精度,并且在LLM识别和人类-AI分类方面表现出色。此外,我们还展示了CoT推理过程对于提高模型的有效性和可解释性的重大贡献。
https://arxiv.org/abs/2504.16913
Most datasets for sentiment analysis lack context in which an opinion was expressed, often crucial for emotion understanding, and are mainly limited by a few emotion categories. Foundation large language models (LLMs) like GPT-4 suffer from over-predicting emotions and are too resource-intensive. We design an LLM-based data synthesis pipeline and leverage a large model, Mistral-7b, for the generation of training examples for more accessible, lightweight BERT-type encoder models. We focus on enlarging the semantic diversity of examples and propose grounding the generation into a corpus of narratives to produce non-repetitive story-character-centered utterances with unique contexts over 28 emotion classes. By running 700K inferences in 450 GPU hours, we contribute with the dataset of 100K contextual and also 300K context-less examples to cover both scenarios. We use it for fine-tuning pre-trained encoders, which results in several Emo Pillars models. We show that Emo Pillars models are highly adaptive to new domains when tuned to specific tasks such as GoEmotions, ISEAR, IEMOCAP, and EmoContext, reaching the SOTA performance on the first three. We also validate our dataset, conducting statistical analysis and human evaluation, and confirm the success of our measures in utterance diversification (although less for the neutral class) and context personalization, while pointing out the need for improved handling of out-of-taxonomy labels within the pipeline.
大多数用于情感分析的数据集缺少意见表达的上下文,而这种上下文对于理解情绪至关重要,并且主要局限于少数几类情感。大型语言模型(LLM)如GPT-4存在过度预测情绪的问题,而且资源消耗过大。为此,我们设计了一个基于LLM的数据合成管道,并利用一个大规模模型Mistral-7b生成用于训练轻量级BERT类型编码器模型的训练样本。我们的目标是扩大示例语义多样性,并提出将生成过程扎根于叙事语料库中,以产生非重复性的、以故事和角色为中心的独特上下文表述,涵盖28种情感类别。通过在450个GPU小时内运行70万次推理,我们贡献了包含10万个具有上下文的样本和30万个无上下文的样本的数据集,以覆盖两种场景。我们将该数据集用于微调预训练编码器,并由此产生了几个Emo Pillars模型。结果显示,当这些Emo Pillars模型针对特定任务如GoEmotions、ISEAR、IEMOCAP和EmoContext进行调整时,在前三个任务中达到了最先进的性能水平。我们还通过统计分析和人工评估验证了我们的数据集,确认在语句多样化(尽管对于中立类别效果稍差)和上下文个性化方面取得了成功,并指出了需要改进处理超出分类标签的必要性。
https://arxiv.org/abs/2504.16856
We present an open-source, low-cost photogrammetry system for 3D plant modeling and phenotyping. The system uses a structure-from-motion approach to reconstruct 3D representations of the plants via point clouds. Using wheat as an example, we demonstrate how various phenotypic traits can be computed easily from the point clouds. These include standard measurements such as plant height and radius, as well as features that would be more cumbersome to measure by hand, such as leaf angles and convex hull. We further demonstrate the utility of the system through the investigation of specific metrics that may yield objective classifications of erectophile versus planophile wheat canopy architectures.
我们提出了一种开源且低成本的摄影测量系统,用于三维植物建模和表型分析。该系统采用基于运动结构的方法通过点云来重建植物的三维表示。以小麦为例,我们展示了如何轻松地从点云中计算出各种表型特征,包括标准测量如植株高度和半径,以及那些手动测量更为困难的特征,例如叶片角度和凸包(convex hull)。此外,我们还通过研究可能用于客观分类直立倾向性与平面倾向性小麦冠层结构的具体指标,进一步展示了该系统的实用性。
https://arxiv.org/abs/2504.16840
Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at this https URL
对比语言-图像预训练(CLIP)已经在多个下游任务中通过对齐图像和文本模态取得了成功。然而,全局对比学习的本质限制了CLIP理解组合概念的能力,如关系和属性的理解。尽管最近的研究采用全局硬负样本以提高组合理解能力,但这些方法显著损害了模型固有的泛化能力,因为它们强制在嵌入空间中将文本的负面样本从图像中拉开距离。 为了克服这一局限性,我们提出了一种解耦的全球-局部对齐(Decoupled Global-Local Alignment, DeGLA)框架,该框架能够增强组合理解的同时大幅减少了固有能力损失。为优化模型固有能力的保留,我们在全局对齐过程中融入了自蒸馏机制,通过将可学习的图像-文本编码器与从指数移动平均值衍生出的冻结教师模型进行对齐来实现这一点。在自蒸馏约束下,它有效地缓解了预训练知识在微调过程中的灾难性遗忘问题。 为了提高组合理解能力,我们首先利用大型语言模型(LLMs)的上下文学习能力构建了约200万条高质量的负面描述文本,涵盖了五种类型。随后,我们提出了图像基础对比损失(Image-Grounded Contrast, IGC)和文本基础对比损失(Text-Grounded Contrast, TGC),以增强视觉-语言组合性。 广泛的实验结果证明了DeGLA框架的有效性。与之前的最新方法相比,在VALSE、SugarCrepe和ARO基准测试中,DeGLA平均提高了3.5%的成绩;同时在11个数据集上的零样本分类任务中,性能平均提升了13.0%。 我们的代码将在此网址发布:[此https URL]
https://arxiv.org/abs/2504.16801
The rapid growth of unlabeled time-series data in domains such as wireless communications, radar, biomedical engineering, and the Internet of Things (IoT) has driven advancements in unsupervised learning. This review synthesizes recent progress in applying autoencoders and vision transformers for unsupervised signal analysis, focusing on their architectures, applications, and emerging trends. We explore how these models enable feature extraction, anomaly detection, and classification across diverse signal types, including electrocardiograms, radar waveforms, and IoT sensor data. The review highlights the strengths of hybrid architectures and self-supervised learning, while identifying challenges in interpretability, scalability, and domain generalization. By bridging methodological innovations and practical applications, this work offers a roadmap for developing robust, adaptive models for signal intelligence.
近年来,无线通信、雷达、生物医学工程和物联网(IoT)等领域中未标记的时间序列数据快速增长,推动了无监督学习领域的进步。这篇综述总结了最近在应用自编码器(autoencoders)和视觉变换器(vision transformers)进行无监督信号分析方面的进展,重点关注这些模型的架构、应用场景以及新兴趋势。本文探讨了这些模型如何实现特征提取、异常检测及分类等功能,并涵盖了各种类型的信号数据,如心电图、雷达波形及物联网传感器数据。综述强调了混合架构和自监督学习的优点,同时指出了可解释性、扩展性和领域泛化等方面的挑战。通过连接方法创新与实际应用,本文为开发稳健且适应性强的信号智能模型提供了路线图。
https://arxiv.org/abs/2504.16972
Recent advances in large language models have significantly improved their ability to process long-context input, but practical applications are challenged by increased inference time and resource consumption, particularly in resource-constrained environments. To address these challenges, we propose MOOSComp, a token-classification-based long-context compression method that enhances the performance of a BERT-based compressor by mitigating the over-smoothing problem and incorporating outlier scores. In the training phase, we add an inter-class cosine similarity loss term to penalize excessively similar token representations, thereby improving the token classification accuracy. During the compression phase, we introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression. These scores are integrated with the classifier's output, making the compressor more generalizable to various tasks. Superior performance is achieved at various compression ratios on long-context understanding and reasoning benchmarks. Moreover, our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.
近期在大型语言模型上的进展显著提升了它们处理长上下文输入的能力,但在实际应用中面临推理时间增加和资源消耗增大的挑战,特别是在资源受限的环境中。为了应对这些挑战,我们提出了一种基于令牌分类的长上下文压缩方法——MOOSComp。该方法通过缓解过度平滑问题并引入异常分数来改进BERT基线压缩器的表现。 在训练阶段,我们在损失函数中加入了一个类间余弦相似度项,以惩罚过于类似的令牌表示,从而提高令牌分类精度。在压缩阶段,我们引入了异常分数来保留那些易被任务无关的压缩方法丢弃但又关键罕见的令牌。这些分数与分类器的输出相结合,使压缩器更具通用性,能够适应各种任务。 实验表明,在长上下文理解和推理基准测试中,我们的方法在多种压缩比率下均表现出色。此外,在资源受限的移动设备上,以4倍压缩比进行操作时,我们所提出的方法实现了3.3倍的速度提升。
https://arxiv.org/abs/2504.16786