Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.
多模态大型语言模型(MLLMs)在作为服务的微调设置中越来越多地被部署,其中用户提交的数据集将通用模型适应于下游任务。然而,这种灵活性也引入了严重的安全风险,因为恶意微调可以轻松地在MLLM中植入后门。在这篇论文中,我们观察到后门触发器会系统性地破坏跨模态处理过程,导致注意力集中在非语义区域上——我们将这一现象称为注意力崩溃(attention collapse)。基于这一洞察,我们提出了“相信你的眼睛”(Believe Your Eyes, BYE),这是一个利用注意力熵模式作为自我监督信号来识别和过滤后门样本的数据过滤框架。BYE通过一个三阶段管道运行:(1) 使用微调后的模型提取注意力图;(2) 计算熵分数,并通过二模态分离对敏感层进行轮廓分析;(3) 进行无监督聚类以移除可疑样本。与先前的防御措施不同,BYE不需要干净的数据标注、辅助标签或模型修改。广泛的实验在各种数据集、模型和多样的触发类型上验证了BYE的有效性:它实现了接近零的成功攻击率,并且保持了清洁任务的表现力,为MLLM中的后门威胁提供了稳健而通用的解决方案。
https://arxiv.org/abs/2505.16916
Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.
深度学习在计算机视觉领域取得了变革性的进展,但其高度依赖于大规模的标注数据集和计算资源。迁移学习特别是对预训练模型进行微调提供了实际可行的替代方案;然而,基于自然图像数据集(如ImageNet)预训练的模型可能无法捕捉到医学影像中的特定领域特征。这项研究提出了一种无监督学习框架,用于提取高价值的皮肤科特征,而不是单纯依赖于基于ImageNet的预训练。我们使用一个从头开始在专有的皮肤科数据集上训练的变分自编码器(VAE),使模型能够学习到结构化且临床相关的潜在空间。随后,将这种自我监督的特征提取器与在同一分类条件下使用的基于ImageNet预训练的骨干网络进行比较,突出了一般用途和特定领域预训练之间的权衡。 研究结果揭示了不同的学习模式。自我监督模型实现了最终验证损失为0.110(-33.33%),而基于ImageNet预训练的模型则停滞在0.100(-16.67%),这表明后者出现了过拟合现象。准确性趋势也证实了这一点:自我监督模型从45%提高到了65%(+44.44%),其过拟合差距接近于零;而基于ImageNet预训练的模型则达到了87%(+50.00%),但最终稳定在75%(+19.05%),其过拟合差距增加到了+0.060。这些发现表明,尽管基于ImageNet的预训练加速了收敛过程,但它也放大了对非临床相关特征的过拟合现象。相比之下,自我监督学习实现了持续改进、更强泛化能力和更佳适应性,在医学影像中强调特定领域特征提取的重要性。
https://arxiv.org/abs/2505.16773
Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages.
近期的语音转换研究越来越多地集中在提升现有方法的零样本(zero-shot)能力上。尽管取得了显著的进步,现有的架构在处理零样本跨语言设置时仍存在挑战,并且通常无法为未见过的语言和口音泛化。在这篇论文中,我们采用了一种简单而有效的方法,该方法结合了自监督模型中的离散语音表示与基于非自回归的Diffusion-Transformer条件流匹配语音解码器。这种方法使我们能够以完全无文本、自我监督的方式训练一个语音转换模型。我们的技术无需使用多个编码器来分离语音特征就能工作,并且在零样本跨语言设置中,即使对于未见过的语言也能表现出色。
https://arxiv.org/abs/2505.16691
The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.
人类获取物体内部表征的学习机制尚不完全清楚。深度神经网络(DNN)作为一种有用的工具,因其在优化目标函数的过程中产生了类似人类的内部表示而被用于研究这一问题。尽管之前的研究表明,在监督学习、自监督学习以及CLIP等不同学习范式下训练的模型都能获得类似于人类的表征,但它们与人类表征的相似性主要体现在粗略类别层级还是延伸到了更细微层面仍不清楚。在此研究中,我们采用了一种基于Gromov-Wasserstein最优传输的无监督对齐方法,在细粒度和粗粒度层面上比较了人与模型的对象表示。相较于传统的代表性相似性分析方法,这种方法的独特之处在于它能够估计人类对象表征与模型对象表征之间最优化的细粒度映射关系。 我们使用这种无监督对齐方法来评估在人类中每个对象的表示是否能正确地映射到相应模型中的相同对象。基于THINGS数据集中1,854个物体的人类相似性判断,研究发现训练于CLIP框架下的模型始终能达到与人类物体表征强相关的细粒度和粗粒度匹配。相比之下,在自监督学习下训练的模型在细粒度和粗粒度层面上表现出有限的匹配能力,但仍然形成了反映人类粗略类别结构的对象簇。 我们的研究结果为语言信息在精确获取对象表示中的作用以及自监督学习捕捉粗略分类结构的能力提供了新的见解。
https://arxiv.org/abs/2505.16419
Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.
近期,基于自监督学习(SSL)的音乐信息检索预训练模型越来越受欢迎,并在各种下游任务中表现出成功。然而,关于编码信息的具体含义及其适用性的研究相对有限。探索这些方面可以帮助我们更好地理解这些模型的能力和局限性,从而更有效地应用于下游任务中。在这项研究中,我们分析了先进的音乐表示模型MusicFM和新出现的SSL模型MuQ。我们的重点在于三个方面:(i) 验证SSL模型在多个下游任务中的优势;(ii) 探索不同层信息对于特定任务的专业化程度;(iii) 比较选择特定层时性能上的差异。通过这项分析,我们揭示了关于SSL模型在音乐信息检索中结构和潜在应用的见解。
https://arxiv.org/abs/2505.16306
In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the 'hypersphere collapse' problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a 'single cluster' while simultaneously learning the cluster center and mapping data close to this center. Also, to improve expressiveness and enable effective single clustering, we propose a new 'One-directed Adaptive loss'. The optimization of this loss is mathematically proven. MADCluster consists of three main components: Base Embedder capturing high-dimensional temporal dynamics, Cluster Distance Mapping, and Sequence-wise Clustering for continuous center updates. Its model-agnostic characteristics are achieved by applying various architectures to the Base Embedder. Experiments on four time series benchmark datasets demonstrate that applying MADCluster improves the overall performance of comparative models. In conclusion, the compatibility of MADCluster shows potential for enhancing model performance across various architectures.
在这篇论文中,我们提出了一种新的模型无关异常检测框架MADCluster,该框架利用自监督聚类技术。MADCluster适用于各种深度学习架构,并解决了现有基于深度学习的异常检测方法中存在的“超球体坍塌”问题。其核心思想是将正常模式数据聚集到一个‘单一集群’中,同时学习这个集群中心并将数据映射到接近该中心的位置。为了提高表达能力和实现有效的单个聚类,我们还提出了一种新的“单向自适应损失函数”。这种损失的优化在数学上得到了证明。 MADCluster包含三个主要组件:捕捉高维时间动态的基本嵌入器(Base Embedder)、聚类距离映射和序列级聚类以进行连续中心更新。通过将不同的架构应用于基本嵌入器,其模型无关性得以实现。在四个时间序列基准数据集上的实验表明,应用MADCluster可以提高比较模型的整体性能。 总之,MADCluster的兼容性显示了它在各种架构中提升模型性能的巨大潜力。
https://arxiv.org/abs/2505.16223
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
本文介绍了Meta-PerSER,这是一种新颖的元学习框架,它通过适应每个听众独特的解读情感的方式实现了个性化语音情感识别(Speech Emotion Recognition,简称SER)。传统的SER系统依赖于聚合注释,这些注释通常忽略了个体的细微差别,并导致预测不一致。相比之下,Meta-PerSER采用增强型模型不可知元学习(Model-Agnostic Meta-Learning,简称MAML)方法,结合了联合集元训练、导数退火和逐层逐步的学习率调整,从而能够仅通过少量标注示例实现快速适应。我们的框架首先利用预训练的自监督模型整合出稳健的情感特征表示,并在此基础上捕捉一般性情感线索,然后根据个人注释风格进行微调。 在IEMOCAP语料库上的实验表明,Meta-PerSER在已见数据和未见过数据场景中均显著优于基准方法,突显了其在个性化情绪识别方面的潜力。
https://arxiv.org/abs/2505.16220
Recent studies have highlighted the potential of discrete tokens derived from self-supervised learning (SSL) models for various speech-related tasks. These tokens serve not only as substitutes for text in language modeling but also as intermediate representations for tasks such as automatic speech recognition (ASR). However, discrete tokens are typically obtained via k-means clustering of SSL features independently of downstream tasks, making them suboptimal for specific applications. This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. This approach enables the fine-tuning of the SSL parameters and learning weights for outputs from multiple SSL layers. Experiments were conducted with ASR as a downstream task. ASR accuracy successfully improved owing to the optimized tokens. The acquired tokens also exhibited greater purity of phonetic information, which were found to be useful even in speech resynthesis.
最近的研究强调了自监督学习(SSL)模型中离散标记在各种语音相关任务中的潜力。这些标记不仅可以用作语言建模中文本的替代品,还可以作为自动语音识别(ASR)等任务的中间表示形式。然而,离散标记通常通过独立于下游任务对SSL特征进行k均值聚类来获取,这使得它们对于特定应用来说不是最优选择。本文提出了使用可微分k均值的方法,使标记化和下游任务能够联合优化。这种方法允许对SSL参数以及来自多个SSL层的输出权重进行微调。实验是在ASR作为下游任务的情况下进行的。由于优化后的标记,ASR准确率得到了成功提升。获得的标记还表现出更纯净的音素信息,在语音重合成中也证明是有用的。
https://arxiv.org/abs/2505.16207
Recently, a method for synthesizing foreign-accented speech only with native speech data using discrete tokens obtained from self-supervised learning (SSL) models was proposed. Considering limited availability of accented speech data, this method is expected to make it much easier to simulate foreign accents. By using the synthesized accented speech as listening materials for humans or training data for automatic speech recognition (ASR), both of them will acquire higher robustness against foreign accents. However, the previous method has a fatal flaw that it cannot reproduce duration-related accents. Durational accents are commonly seen when L2 speakers, whose native language has syllable-timed or mora-timed rhythm, speak stress-timed languages, such as English. In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. Experiments show that the proposed method successfully replicates durational accents seen in real L2 speech.
最近,提出了一种仅使用来自自我监督学习(SSL)模型获得的离散令牌来合成带有外国口音的语音的方法。考虑到有声母语数据但缺乏带口音的数据的情况,这种方法有望使模拟外国口音变得更加容易。通过将合成的带口音语音用作人类的听力材料或自动语音识别(ASR)系统的训练数据,两者都将提高其对抗外国口音的能力。然而,先前的方法存在一个致命缺陷,即无法重现与持续时间相关的口音变化。在讲英语等重音语言时,那些母语具有音节定时或拍子定时节奏的二语(L2)说话者常常会表现出这种持续时间上的口音差异。在这篇论文中,我们将持续时间修改整合到先前的方法中,以更准确地模拟外国口音。实验表明,所提出的方法成功复制了真实L2语音中的持续时间口音变化。
https://arxiv.org/abs/2505.16191
In this study, we gained insight that contributes to achieving accent-robust ASR using only native speech data. In human perception of non-native speech, the phenomenon known as "interlanguage speech intelligibility benefit" (ISIB) is observed, where non-native listeners who share the native language with the speaker understand the speech better compared even to native listeners. Based on the idea that discrete tokens extracted from self-supervised learning (SSL) models represent the human perception of speech, we conducted an analytical study on the robustness of discrete token-based ASR to non-native speech, varying the language used for training the tokenization, which is viewed as a technical implementation of ISIB. The results showed that ISIB actually occurred in the discrete token-based ASR. Since our approach relies only on native speech data to simulate the behavior of human perception, it is expected to be applicable to a wide range of accents for which speech data is scarce.
在这项研究中,我们获得了有助于仅使用母语语音数据实现鲁棒性较强的音素发音识别(ASR)的见解。在人类对非母语演讲的理解过程中,观察到了一种被称为“中介语言语可懂度优势”(ISIB)的现象:说者和听者的语言相同但不是其母语的人比母语听众更能理解该讲话。 基于从自监督学习(SSL)模型中提取的离散令牌可以代表人类对语音感知这一理念,我们研究了以不同训练语言进行标记化的基于离散令牌的ASR对于非母语演讲的鲁棒性。这被看作是ISIB的技术实现。结果表明,在基于离散令牌的ASR中确实发生了ISIB。 由于我们的方法仅依赖于母语语音数据来模拟人类感知行为,因此预计它适用于缺乏足够语音数据的各种口音。
https://arxiv.org/abs/2505.16182
Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at this https URL.
场景文本检测领域已涌现出一系列在学术基准测试中表现出色的方法,然而这些方法往往难以在实际环境中复制同样的成功。通过广泛的实验研究,我们发现了导致这一差距的两个关键因素。首先,存在一个称为“微调间隙”(Fine-tuning Gap)的现象:模型利用特定数据集优化(Dataset-Specific Optimization, DSO)范式为某一领域量身定制时,会导致在其他领域的有效性降低,并且这种做法会使得模型在学术基准测试中的表现被夸大。其次,在实际应用中性能不佳主要是由于文本的长尾分布导致的,即检测器难以处理罕见和复杂的类别,如艺术性或重叠的文字。 鉴于DSO范式可能削弱模型的泛化能力,我们提倡采用联合数据集学习(Joint-Dataset Learning, JDL)协议来缓解微调间隙问题。此外,我们还进行了一项错误分析,识别出长尾场景文本中的三个主要挑战类别和十三个子类别,并提出一个长尾基准测试(Long-Tailed Benchmark, LTB),以全面评估处理多样化的长尾挑战的能力。 为了进一步支持这一研究,我们引入了MAEDet方法作为LTB的一个强基线,该方法基于自监督学习。相关的代码可以在以下链接中获取:[此URL](请将实际的GitHub或其他代码托管平台的URL填入括号内)。
https://arxiv.org/abs/2505.15649
To overcome the constraints of the underwater environment and improve the accuracy and robustness of underwater target detection models, this paper develops a specialized dataset for underwater target detection and proposes an efficient algorithm for underwater multi-target detection. A self-supervised learning based on the SimSiam structure is employed for the pre-training of underwater target detection network. To address the problems of low detection accuracy caused by low contrast, mutual occlusion and dense distribution of underwater targets in underwater object detection, a detection model suitable for underwater target detection is proposed by introducing deformable convolution and dilated convolution. The proposed detection model can obtain more effective information by increasing the receptive field. In addition, the regression loss function EIoU is introduced, which improves model performance by separately calculating the width and height losses of the predicted box. Experiment results show that the accuracy of the underwater target detection has been improved by the proposed detector.
为了克服水下环境的限制并提高水下目标检测模型的准确性和鲁棒性,本文开发了一个专门用于水下目标检测的数据集,并提出了一种高效的水下多目标检测算法。采用基于SimSiam结构的自监督学习方法对水下目标检测网络进行预训练。为了应对由于对比度低、相互遮挡和密集分布等原因导致的检测精度低下问题,本文引入了可变形卷积(deformable convolution)和空洞卷积(dilated convolution),从而提出了一个适用于水下目标检测的模型。通过增加感受野,所提出的检测模型能够获取更多有效的信息。此外,还引入了一种回归损失函数EIoU,该函数通过分别计算预测框宽度和高度的损失来提高模型性能。实验结果表明,所提出的检测器显著提高了水下目标检测的准确性。
https://arxiv.org/abs/2505.15518
Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.
最近在扩大计算机视觉模型规模的努力中,视觉变换器(Vision Transformers,简称ViTs)已经成为领先架构。ViTs 将图像块之间的权重共享作为重要的归纳偏置。在这项工作中,我们展示了 ViTs 在纳入八重对称群下的等变性(即反射和90度旋转的不变性)作为进一步的归纳偏置时受益匪浅。我们开发了新的架构——八重ViT(Octic ViTs),这些架构使用八重等变层,并在监督学习和自监督学习上对其进行了测试。通过在 ImageNet-1K 上对 DeiT-III 和 DINOv2 进行广泛的实验,我们证明了八重ViTs 能够生成更高效的计算网络,同时提高性能。特别是,在不牺牲分类和分割结果的情况下,我们将 ViT-H 的 FLOPs(每秒浮点运算次数)减少了大约 40%。
https://arxiv.org/abs/2505.15441
Chromosome analysis is vital for diagnosing genetic disorders and guiding cancer therapy decisions through the identification of somatic clonal aberrations. However, developing an AI model are hindered by the overwhelming complexity and diversity of chromosomal abnormalities, requiring extensive annotation efforts, while automated methods remain task-specific and lack generalizability due to the scarcity of comprehensive datasets spanning diverse resource conditions. Here, we introduce CHROMA, a foundation model for cytogenomics, designed to overcome these challenges by learning generalizable representations of chromosomal abnormalities. Pre-trained on over 84,000 specimens (~4 million chromosomal images) via self-supervised learning, CHROMA outperforms other methods across all types of abnormalities, even when trained on fewer labelled data and more imbalanced datasets. By facilitating comprehensive mapping of instability and clonal leisons across various aberration types, CHROMA offers a scalable and generalizable solution for reliable and automated clinical analysis, reducing the annotation workload for experts and advancing precision oncology through the early detection of rare genomic abnormalities, enabling broad clinical AI applications and making advanced genomic analysis more accessible.
染色体分析对于诊断遗传性疾病以及通过识别体细胞克隆异常来指导癌症治疗决策至关重要。然而,由于染色体异常的复杂性和多样性,开发人工智能模型受到了阻碍,这需要大量的注释工作,而自动化方法仍然局限于特定任务且缺乏通用性,原因是缺少涵盖不同资源条件的全面数据集。为此,我们推出了CHROMA,这是一种用于细胞基因组学的基础模型,旨在通过学习染色体异常的一般表示来克服这些挑战。通过在超过84,000份标本(约400万张染色体图像)上进行自监督训练,CHROMA在所有类型的异常中均优于其他方法,即使是在使用较少标注数据和更多不平衡数据集时也是如此。通过支持各种异常类型中的不稳定性及克隆损伤的全面映射,CHROMA为可靠的自动化临床分析提供了一种可扩展且通用的解决方案,减少了专家的工作负担,并通过早期发现罕见基因组异常推进了精准肿瘤学的发展,使得高级基因组分析更广泛地应用于临床并更加普及。
https://arxiv.org/abs/2505.15868
Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.
自监督学习已成为训练深度神经网络的强大范式,尤其是在医学影像领域,因为该领域的标注数据稀缺。尽管目前的方法通常依赖于单张图像的合成增强,我们提出了VET-DINO框架,它利用了医学影像的一个独特特性:同一研究中的多个标准化视图可用。通过一系列来自同一患者研究的临床兽医X光片,我们可以使模型学会识别不受视图影响的解剖结构,并从2D投影中推断出隐含的3D理解。我们在一个包含500万张兽医X光片的数据集上验证了我们的方法,这些X光片来自6.68万个犬类研究案例。 通过广泛的实验,包括视图合成和下游任务性能测试,我们展示了从真实的多视角图像对中学习可以比仅使用合成增强获得更好的解剖学理解。VET-DINO在各种兽医成像任务上实现了最先进的表现。我们的工作确立了一个新的自监督学习范式,在医学影像领域利用特定领域的特性而非简单地调整自然图像技术。
https://arxiv.org/abs/2505.15248
Lung cancer remains among the deadliest types of cancer in recent decades, and early lung nodule detection is crucial for improving patient outcomes. The limited availability of annotated medical imaging data remains a bottleneck in developing accurate computer-aided diagnosis (CAD) systems. Self-supervised learning can help leverage large amounts of unlabeled data to develop more robust CAD systems. With the recent advent of transformer-based architecture and their ability to generalize to unseen tasks, there has been an effort within the healthcare community to adapt them to various medical downstream tasks. Thus, we propose a novel "LungNodule-SSM" method, which utilizes selfsupervised learning with DINOv2 as a backbone to enhance lung nodule detection and classification without annotated data. Our methodology has two stages: firstly, the DINOv2 model is pre-trained on unlabeled CT scans to learn robust feature representations, then secondly, these features are fine-tuned using transformer-based architectures for lesionlevel detection and accurate lung nodule diagnosis. The proposed method has been evaluated on the challenging LUNA 16 dataset, consisting of 888 CT scans, and compared with SOTA methods. Our experimental results show the superiority of our proposed method with an accuracy of 98.37%, explaining its effectiveness in lung nodule detection. The source code, datasets, and pre-processed data can be accessed using the link:this https URL
肺癌仍然是近几十年来最致命的癌症类型之一,早期肺结节检测对于改善患者预后至关重要。目前,标注医学影像数据的有限可用性仍是开发准确计算机辅助诊断(CAD)系统的主要瓶颈。自我监督学习可以帮助利用大量的未标记数据来发展更稳健的CAD系统。随着基于变换器架构的最近进展及其在未知任务中的泛化能力,在医疗保健社区中已经作出努力将它们适应于各种医学下游任务。因此,我们提出了一种新颖的方法“LungNodule-SSM”,该方法利用自我监督学习和DINOv2作为骨干网络来增强肺结节检测与分类的准确性,而无需注释数据。我们的方法分为两个阶段:首先,在未标记的CT扫描上预训练DINOv2模型以学习稳健的功能表示;然后,使用基于变换器的架构对这些特征进行微调,用于病变级别的检测和精确的肺结节诊断。 提出的方法已在具有挑战性的LUNA 16数据集(包含888个CT扫描)上进行了评估,并与最先进方法进行了比较。实验结果显示了我们所提方法的优势,在准确率方面达到了98.37%,证明其在肺结节检测中的有效性。源代码、数据集及预处理的数据可以通过此链接访问:[this https URL]
https://arxiv.org/abs/2505.15120
We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural network-based models trained to predict human-labeled quality scores of speech samples. SHEET provides comprehensive training and evaluation scripts, multi-dataset and multi-model support, as well as pre-trained models accessible via Torch Hub and HuggingFace Spaces. To demonstrate its capabilities, we re-evaluated SSL-MOS, a speech self-supervised learning (SSL)-based SSQA model widely used in recent scientific papers, on an extensive list of speech SSL models. Experiments were conducted on two representative SSQA datasets named BVCC and NISQA, and we identified the optimal speech SSL model, whose performance surpassed the original SSL-MOS implementation and was comparable to state-of-the-art methods.
我们介绍了一款名为SHEET的多功能开源工具包,旨在加速主观语音质量评估(SSQA)的研究。SHEET代表“Speech Human Evaluation Estimation Toolkit”,专注于基于数据驱动的深度神经网络模型,这些模型经过训练可以预测人类标注的语音样本的质量分数。SHEET提供了全面的训练和评估脚本、多数据集和多模型支持,并且可以通过Torch Hub和HuggingFace Spaces访问预训练模型。 为了展示其功能,我们重新评估了SSL-MOS——一种在近期科学论文中广泛使用的基于语音自监督学习(SSL)的SSQA模型,在一系列广泛的语音SSL模型上进行了测试。我们在两个具有代表性的SSQA数据集BVCC和NISQA上进行了实验,并确定了一种性能最优的语音SSL模型,其表现超过了原有的SSL-MOS实现方法,并且与当前最先进的方法相当。
https://arxiv.org/abs/2505.15061
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.
多智能体系统(MAS)利用大型语言模型(LLM)的卓越能力在处理复杂任务方面具有巨大潜力。然而,大多数当前的MAS依赖于手动设计的代理角色和通信协议。这些手工设计往往无法与底层LLMs的优势相匹配,并且难以适应新任务。最近的一些自动MAS方法试图减轻这些问题,但通常需要验证集来进行调优,并产生缺乏推理时适应性的静态MAS设计。我们提出了SELF-MAS,这是一个首个自监督、仅在推理时间的框架,用于自动设计MAS。SELF-MAS通过元级别设计迭代地生成、评估和细化针对每个问题实例的MAS配置,无需验证集。最关键的是,它能够通过关于可解性和完整性的元反馈动态组合代理并分解问题。 实验跨越了数学、研究生水平的问题回答以及软件工程基准测试,在各种规模的封闭源代码和开源LLM模型后端上进行,表明SELF-MAS在性能上优于手动设计及自动化的MAS基线,平均准确率比第二强的基准高出7.44%,同时保持成本效益。这些发现强调了元级别自监督设计对于创建有效且适应性强的MAS所展现出的巨大潜力。
https://arxiv.org/abs/2505.14996
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail.
离线目标导向的强化学习(GCRL)是一种有前景的方法,用于在无奖励轨迹的大数据集上对一般策略进行预训练,类似于用来训练计算机视觉和自然语言处理基础模型的自监督目标。然而,由于稀疏奖励和折扣因素的结合,将GCRL扩展到更长的时间范围仍然面临挑战,这使得原始动作相对于遥远目标的优势变得模糊不清。分层强化学习方法在长时间跨度的目标达成任务上表现出很强的经验结果,但它们对模块化、时间尺度特定策略以及子目标生成的依赖引入了显著的额外复杂性,并且阻碍了向高维目标空间扩展。 在这项工作中,我们介绍了一种通过基于优势加权重要采样的子目标导向策略进行引导来训练扁平(非分层)目标导向策略的算法。我们的方法消除了对目标空间生成模型的需求,这被证明是扩展到大型状态空间中高维度控制的关键。此外,我们还展示了现有的分层和引导方法对应于我们在推导过程中特定的设计选择。 在一系列基于状态和像素的运动与操作基准测试上,我们的方法匹配或超越了当前最先进的离线GCRL算法,并能应用于复杂、长时间跨度的任务,在这些任务中先前的方法无法有效应对。
https://arxiv.org/abs/2505.14975
Self-Supervised Learning (SSL) has led to considerable progress in Speaker Verification (SV). The standard framework uses same-utterance positive sampling and data-augmentation to generate anchor-positive pairs of the same speaker. This is a major limitation, as this strategy primarily encodes channel information from the recording condition, shared by the anchor and positive. We propose a new positive sampling technique to address this bottleneck: Self-Supervised Positive Sampling (SSPS). For a given anchor, SSPS aims to find an appropriate positive, i.e., of the same speaker identity but a different recording condition, in the latent space using clustering assignments and a memory queue of positive embeddings. SSPS improves SV performance for both SimCLR and DINO, reaching 2.57% and 2.53% EER, outperforming SOTA SSL methods on VoxCeleb1-O. In particular, SimCLR-SSPS achieves a 58% EER reduction by lowering intra-speaker variance, providing comparable performance to DINO-SSPS.
自监督学习(Self-Supervised Learning,SSL)在说话人验证(Speaker Verification,SV)领域取得了显著的进展。传统的框架采用同一次语音采样的正样本以及数据增强方法来生成相同说话人的锚点和正例对。这种方法的一个主要限制在于它主要编码了由录音条件共享的信道信息。为此,我们提出了一种新的正样本抽取技术——自监督正样本抽样(Self-Supervised Positive Sampling, SSPS)以解决这一瓶颈问题:对于给定的锚点,SSPS旨在通过聚类分配和正例嵌入的记忆队列,在潜在空间中找到同一个说话人但录音条件不同的适当正样本。 SSPS技术改进了SimCLR和DINO在说话人验证中的性能,在VoxCeleb1-O数据集上达到了2.57%和2.53%的等错误率(EER),超过了现有的SSL方法的最佳水平。特别地,SimCLR-SSPS通过降低同一位说话人的内部方差,使等错误率降低了58%,并且与DINO-SSPS相比表现出相当的性能。
https://arxiv.org/abs/2505.14561