Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
近期的医学视觉语言模型的发展指导了视觉表示的学习;然而,这种形式的监督受限于配对图像文本数据的可用性,引发了是否可以不依赖语言监督来学习稳健的放射学编码器的问题。在本工作中,我们介绍了RadJEPA,这是一种基于联合嵌入预测架构构建的自监督框架,它可以在没有语言监督的情况下进行学习。该模型仅通过未标记的胸部X光图像预训练,学习对遮蔽图像区域的潜在表示进行预测。这种预测目标与图像文本预训练和DINO风格的自我蒸馏方法根本不同:RadJEPA不是跨视图或模态对齐全局表示,而是明确地建模潜在空间中的预测。 我们在疾病分类、语义分割和报告生成任务上评估了所学习到的编码器。在各个基准测试中,RadJEPA的表现超过了包括Rad-DINO在内的最先进方法。
https://arxiv.org/abs/2601.15891
This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.
这项研究通过将AV-HuBERT模型对不一致的音视频刺激(McGurk效应)的反应与人类观察者(N=44)进行比较,评估了其感知生物保真度。结果揭示了一个显著的数量一致性:AI和人类表现出几乎相同的听觉优势率(32.0% 对比 31.8%),这表明该模型捕捉到了听觉抵抗的生物学阈值。然而,AV-HuBERT显示出对音素融合的确定性偏见(68.0%),远高于人类的比例(47.7%)。尽管人类表现出感知随机性和多样化的错误模式,但该模型保持了严格的分类特性。研究结果表明,当前的自监督架构可以模仿多感官的结果,但在人类语音感知中固有的神经变异性方面仍有欠缺。
https://arxiv.org/abs/2601.15869
This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $\alpha$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
这项工作专注于使用ALOS-2单极化(HH)SAR数据在日本进行国家尺度的土地利用/土地覆盖(LULC)语义分割,同时包括一个伴生的二元水体检测任务。基于SAR-W-MixMAE自监督预训练[1],我们解决了常见的SAR密集预测失败模式,如边界过度平滑、遗漏细长结构以及在长尾标签下罕见类别的性能下降问题,而无需增加管道复杂度。我们引入了三项轻量级改进:(i) 将高分辨率特征注入多尺度解码中;(ii) 一种逐步细化和上采样的交替进行的渐进式细化头部,以及 (iii) $\alpha$-缩放因子,用于调节在焦点+dice目标下的类重新加权。最终模型在日本全境ALOS-2 LULC基准测试中取得了持续性的改进,特别是在代表性不足的类别中,并且在标准评估指标下提高了水体检测性能。
https://arxiv.org/abs/2601.15705
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
Recent advancements in mobile and wireless networks are unlocking the full potential of robotic autonomy, enabling robots to take advantage of ultra-low latency, high data throughput, and ubiquitous connectivity. However, for robots to navigate and operate seamlessly, efficiently and reliably, they must have an accurate understanding of both their surrounding environment and the quality of radio signals. Achieving this in highly dynamic and ever-changing environments remains a challenging and largely unsolved problem. In this paper, we introduce MapViT, a two-stage Vision Transformer (ViT)-based framework inspired by the success of pre-train and fine-tune paradigm for Large Language Models (LLMs). MapViT is designed to predict both environmental changes and expected radio signal quality. We evaluate the framework using a set of representative Machine Learning (ML) models, analyzing their respective strengths and limitations across different scenarios. Experimental results demonstrate that the proposed two-stage pipeline enables real-time prediction, with the ViT-based implementation achieving a strong balance between accuracy and computational efficiency. This makes MapViT a promising solution for energy- and resource-constrained platforms such as mobile robots. Moreover, the geometry foundation model derived from the self-supervised pre-training stage improves data efficiency and transferability, enabling effective downstream predictions even with limited labeled data. Overall, this work lays the foundation for next-generation digital twin ecosystems, and it paves the way for a new class of ML foundation models driving multi-modal intelligence in future 6G-enabled systems.
最近在移动和无线网络方面的进展正在释放机器人自主性的全部潜力,使机器人能够利用超低延迟、高数据吞吐量以及无处不在的连接。然而,为了使机器人能够在无缝、高效且可靠地导航和操作环境中运行,它们必须对其周围环境及其无线电信号质量有准确的理解。在高度动态和不断变化的环境中实现这一点仍然是一项挑战性的问题,并且尚未得到充分解决。 本文介绍了MapViT,这是一种基于两阶段视觉变换器(Vision Transformer, ViT)框架的设计,灵感来自于大型语言模型(Large Language Models, LLMs)预训练和微调范式的成功。MapViT旨在预测环境变化以及预期的无线信号质量。我们使用一系列代表性的机器学习(Machine Learning, ML)模型对这一框架进行了评估,并分析了它们在不同场景下的优缺点。实验结果表明,所提出的两阶段管道能够实现实时预测,基于ViT的实施实现了准确性与计算效率之间的良好平衡。这使得MapViT成为移动机器人等能源和资源受限平台的一种有前景的解决方案。 此外,来自自我监督预训练阶段得出的几何基础模型提高了数据利用效率和可转移性,在有限标注数据的情况下仍能实现有效的下游预测。总的来说,这项工作为下一代数字孪生生态系统奠定了基础,并为未来6G支持系统中的多模态智能提供了新的机器学习基础模型类别。
https://arxiv.org/abs/2601.15578
Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.
大多数现有的时间序列分类方法采用了一种判别式范式,该范式直接将输入序列映射为一个由one-hot编码表示的类别标签。尽管这种方法有效,但它难以整合上下文特征,并且无法捕捉类别的语义关系。为了克服这些限制,我们提出了InstructTime,这是一个新的框架,它重新定义时间序列分类为一个多模态生成任务。具体来说,在该框架中,连续数值序列、上下文文本特征和任务指令被视为多模态输入,而类别标签则通过调整过的语言模型生成为文本输出。 为了弥合不同模态之间的鸿沟,InstructTime引入了一个时间序列离散化模块,它将连续的时间序列转换成离散的时间标记。此外,还包括一个对齐投影层和一种增强跨模态表示对齐的生成自监督预训练策略。 在此框架的基础上,我们进一步提出了InstructTime++,该方法通过引入隐式特征建模来扩展InstructTime以弥补语言模型有限的归纳偏差。InstructTime++利用专门工具包从原始时间序列和上下文输入中挖掘出有用的隐式模式,包括统计特征提取以及基于视觉-语言的时间序列描述生成,并将这些模式转化为文本描述进行无缝集成。 在多个基准数据集上的广泛实验表明了InstructTime++的优越性能。
https://arxiv.org/abs/2601.14968
Identifying unique polyps in colon capsule endoscopy (CCE) images is a critical yet challenging task for medical personnel due to the large volume of images, the cognitive load it creates for clinicians, and the ambiguity in labeling specific frames. This paper formulates this problem as a multi-instance learning (MIL) task, where a query polyp image is compared with a target bag of images to determine uniqueness. We employ a multi-instance verification (MIV) framework that incorporates attention mechanisms, such as variance-excited multi-head attention (VEMA) and distance-based attention (DBA), to enhance the model's ability to extract meaningful representations. Additionally, we investigate the impact of self-supervised learning using SimCLR to generate robust embeddings. Experimental results on a dataset of 1912 polyps from 754 patients demonstrate that attention mechanisms significantly improve performance, with DBA L1 achieving the highest test accuracy of 86.26\% and a test AUC of 0.928 using a ConvNeXt backbone with SimCLR pretraining. This study underscores the potential of MIL and self-supervised learning in advancing automated analysis of Colon Capsule Endoscopy images, with implications for broader medical imaging applications.
在结肠胶囊内镜(CCE)图像中识别独特的息肉对于医疗人员来说是一项至关重要的但又极具挑战性的任务,因为这涉及大量的图像数据、对临床医生的认知负荷以及特定帧标记的模糊性。本文将这一问题作为多实例学习(MIL)任务进行形式化处理,在该任务中,通过将查询息肉图像与目标图像包进行比较来判断其独特性。为此,我们采用了包含注意力机制(如变异数激发多头注意力VEMA和基于距离的注意力DBA)的多实例验证(MIV)框架,以增强模型提取有意义表示的能力。此外,本文还探讨了使用SimCLR进行自监督学习对生成稳健嵌入的影响。 在由754名患者提供的1912个息肉数据集上的实验结果表明,注意力机制显著提高了性能表现。其中,DBA L1实现了最高的测试准确率86.26%以及0.928的测试AUC值(使用带有SimCLR预训练的ConvNeXt主干)。这项研究表明了MIL和自监督学习在推进结肠胶囊内镜图像自动化分析中的潜力,并对其它医疗影像应用也有重要启示。
https://arxiv.org/abs/2601.14771
The requirement for expert annotations limits the effectiveness of deep learning for medical image analysis. Although 3D self-supervised methods like volume contrast learning (VoCo) are powerful and partially address the labeling scarcity issue, their high computational cost and memory consumption are barriers. We propose 2D-VoCo, an efficient adaptation of the VoCo framework for slice-level self-supervised pre-training that learns spatial-semantic features from unlabeled 2D CT slices via contrastive learning. The pre-trained CNN backbone is then integrated into a CNN-LSTM architecture to classify multi-organ injuries. In the RSNA 2023 Abdominal Trauma dataset, 2D-VoCo pre-training significantly improves mAP, precision, recall, and RSNA score over training from scratch. Our framework provides a practical method to reduce the dependency on labeled data and enhance model performance in clinical CT analysis. We release the code for reproducibility. this https URL
医学图像分析中对专家注释的需求限制了深度学习的有效性。尽管3D自监督方法(如体积对比学习VoCo)非常强大,并在一定程度上解决了标签稀缺问题,但其高昂的计算成本和内存消耗成为应用障碍。我们提出了一种名为2D-VoCo的方法,这是一种高效的VoCo框架适应版本,用于基于未标记的2D CT切片进行自我监督预训练,通过对比学习来获取空间语义特征。接着,我们将预训练好的CNN骨干网络集成到CNN-LSTM架构中以分类多器官损伤情况。在RSNA 2023腹部创伤数据集中,与从零开始的训练相比,使用2D-VoCo进行预训练显著提高了mAP、精确度、召回率和RSNA评分。 我们的框架提供了一种实用的方法来减少对标注数据的依赖,并提升临床CT分析中的模型性能。为了保证研究的可复现性,我们发布了相关代码:[此链接](https://this.url.com)
https://arxiv.org/abs/2601.14593
Self-supervised learning is increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to acquire in clinical practice. In this paper, we propose a novel self-supervised training strategy that relies exclusively on LDCT images. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained denoising learning. In addition, we add Gaussian noise to LDCT images, which acts as a regularization and mitigates overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.
自监督学习在低剂量计算机断层扫描(LDCT)图像去噪中的应用研究日益增多,因为它减轻了对正常剂量CT(NDCT)配对数据的依赖,而这些数据往往在临床实践中难以获取。本文提出了一种新颖的仅基于LDCT图像进行训练的自监督策略。我们引入了一种逐步盲点去噪机制,在逐渐推进的过程中强制执行条件独立性,从而实现更为精细的去噪学习。此外,我们在LDCT图像上添加高斯噪声,这作为正则化手段,并减轻过拟合现象。在Mayo LDCT数据集上的大量实验表明,所提出的方法始终优于现有的自监督方法,并且其性能与几种典型的监督去噪方法相当或更优。
https://arxiv.org/abs/2601.14180
The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non-stationarity of EEG signals where statistical properties change over time. To address this, we propose RL-BioAug, a framework that leverages a label-efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10\%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self-supervised manner. Experimental results demonstrate that RL-BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69\% and 8.80\% in Macro-F1 score on the Sleep-EDFX and CHB-MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task -- for example, Time Masking with a 62\% probability for sleep stage classification and Crop \& Resize with a 77\% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic-based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at \href{this https URL}{this https URL}.
数据增强的质量是脑电图(EEG)任务中对比学习性能的关键决定因素。虽然这一范式有潜力利用未标记的数据,但静态或随机的增强策略由于EEG信号随时间变化的非平稳性而常常无法保留内在信息。为了解决这个问题,我们提出了RL-BioAug框架,该框架利用一种标签高效的强化学习(RL)代理自主确定最佳增广策略。仅使用少量标记数据(10%)指导代理政策的情况下,我们的方法使编码器能够在完全自我监督的方式下学习稳健的表示形式。实验结果表明,与随机选择策略相比,RL-BioAug显著提高了性能,在Sleep-EDFX和CHB-MIT数据集上的宏平均F1分数分别提升了9.69%和8.80%。值得注意的是,该代理主要为每个任务选择了最佳策略——例如,在睡眠分期分类中,时间掩码的概率高达62%,而在癫痫发作检测中,“裁剪与重缩放”操作的概率为77%。我们的框架表明其可能取代传统的基于启发式的增强方法,并建立了一种新的自主数据增广范式。源代码可在此链接获取:[此URL](this https URL)。
https://arxiv.org/abs/2601.13964
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
当前的视觉表示学习仍然存在两极分化的问题:视觉-语言模型(例如CLIP)在全局语义对齐方面表现出色,但在空间精度上有所欠缺;而自监督方法(如MAE、DINO)则擅长捕捉复杂的局部结构,却难以处理高层次的语义上下文。我们认为这些范式从根本上说是互补的,并可以通过一个以密集的空间监督增强的原则性多任务框架进行整合。 我们介绍了一种名为MTV的多任务视觉预训练框架,该框架在共享的骨干网络上同时优化了视觉-语言对比、自监督和密集空间目标。为了减少手动注释的需求,我们利用高容量的“专家”模型(例如Depth Anything V2 和 OWLv2)来大规模合成密集且结构化的伪标签。 除了这一框架之外,我们还系统地探讨了多任务视觉学习机制,并分析了:(i) 每个目标的边际收益;(ii) 任务之间的协同作用与干扰;以及 (iii) 在不同数据和模型规模下的扩展行为。我们的研究结果表明,MTV实现了“集二者之长”的性能,在不牺牲全局语义理解的前提下显著增强了细粒度的空间推理能力。 这些发现表明,借助高质量的伪监督进行多任务学习是通向更通用视觉编码器的一条可扩展路径。
https://arxiv.org/abs/2601.13886
The speckle noise inherent in Synthetic Aperture Radar (SAR) imagery significantly degrades image quality and complicates subsequent analysis. Given that SAR speckle is multiplicative and Gamma-distributed, effectively despeckling SAR imagery remains challenging. This paper introduces a novel self-supervised framework for SAR image despeckling based on score-based generative models operating in the transformed log domain. We first transform the data into the log-domain and then convert the speckle noise residuals into an approximately additive Gaussian distribution. This step enables the application of score-based models, which are trained in the transformed domain using a self-supervised objective. This objective allows our model to learn the clean underlying signal by training on further corrupted versions of the input data itself. Consequently, our method exhibits significantly shorter inference times compared to many existing self-supervised techniques, offering a robust and practical solution for SAR image restoration.
合成孔径雷达(SAR)图像中存在的斑点噪声显著降低了图像质量,并且增加了后续分析的复杂性。鉴于SAR斑点噪声具有乘法性质并遵循伽玛分布,有效去除这种噪声仍然是一项挑战。本文介绍了一种基于分数生成模型(在转换后的对数域中运行)的新颖自监督框架,用于处理SAR图像去噪问题。 首先,我们将数据转换到对数域,并将斑点噪声残差转化为近似加法高斯分布。这一步使得可以应用分数模型,在变换的领域内通过自监督目标进行训练。这种目标使我们的模型能够通过对输入数据进一步退化的版本进行训练来学习干净的基础信号。因此,与许多现有的自监督技术相比,我们方法的推理时间显著缩短,为SAR图像恢复提供了一种稳健且实用的解决方案。
https://arxiv.org/abs/2601.14334
Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.
流利的轮流发言仍然是人机交互中的一个关键挑战。自我监督语音表示(Self-supervised Speech Representations, S3R)在许多方面取得了进展,但仍然不清楚基于S3R的轮流发言模型是依赖于语调线索、词汇线索还是两者兼而有之。我们引入了一种基于声码器的方法来比以往的工作更干净地控制语音中的语调和词汇线索。这使我们能够探究声音活动投影模型,即一种基于S3R的轮流发言模型。 我们的研究发现,在与原始清晰语音匹配的语调但难以理解的噪音上进行预测时,准确率与在清晰语音上的表现相似。这一结果揭示了无论是语调还是词汇线索都能支持轮流发言,而且两者可以独立使用。因此,未来的模型可能只需要依赖于语调信息,从而提供隐私保护和潜在的性能优势。 当不论是语调还是词汇信息被破坏时,该模型无需额外训练就可以利用另一种未受损的信息,这表明在S3R中这两种信息编码具有有限的相关性。这些结果在基于CPC(Contrastive Predictive Coding)和wav2vec2.0的S3R上是一致的。 我们讨论了研究发现,并指出了未来工作的多个方向。所有的代码都可提供以支持后续的研究工作。
https://arxiv.org/abs/2601.13835
With the advancement of self-supervised learning (SSL), fine-tuning pretrained SSL models for mean opinion score (MOS) prediction has achieved state-of-the-art performance. However, during fine-tuning, these SSL-based MOS prediction models often suffer from catastrophic forgetting of the pretrained knowledge and tend to overfit the training set, resulting in poor generalization performance. In this study, we propose DistilMOS, a novel method that learns to predict not only MOS but also token IDs obtained by clustering the hidden representations of each layer in the pretrained SSL model. These layer-wise token targets serve as self-distillation signals that enables the MOS prediction model to extract rich internal knowledge from SSL models, enhancing both prediction accuracy and generalization capability. Experimental evaluations demonstrate that our method significantly outperforms standard SSL-based MOS prediction models on both in-domain and out-of-domain evaluations, verifying the effectiveness and practicality of the proposed method.
随着自监督学习(SSL)的进步,针对平均意见得分(MOS)预测对预训练的SSL模型进行微调已经取得了最先进的性能。然而,在微调过程中,这些基于SSL的MOS预测模型往往会忘记预训练的知识,并且容易过度拟合训练集,从而导致较差的一般化性能。为此,我们提出了DistilMOS,这是一种新颖的方法,该方法不仅学习预测MOS值,还通过聚类预训练SSL模型中每一层的隐藏表示来学习预测标记ID(token IDs)。这些逐层的令牌目标作为自蒸馏信号,使得MOS预测模型能够从SSL模型中提取丰富的内部知识,从而提高预测准确性和泛化能力。实验评估表明,我们的方法在域内和跨域评价中均显著优于标准的基于SSL的MOS预测模型,证明了所提出的方法的有效性和实用性。
https://arxiv.org/abs/2601.13700
Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.
扩散模型作为图像合成中的最先进的生成方法已崭露头角,但它们作为一种通用特征编码器的潜力尚未得到充分探索。这些模型在没有标签的情况下进行去噪和生成训练,可以被视为一种自我监督学习者,能够捕捉到低级和高级结构信息。我们展示了冻结状态下的扩散骨干网络通过探测跨层和时间步长中的中间去噪特征,并针对每一对训练线性分类器,可以实现强大的细粒度识别能力。我们在一个具有实际影响的浮游生物监测的真实场景中评估了这一点,使用与已确立的监督学习和自我监督方法进行对比且一致的训练设置。在平衡和自然长尾设置下,冻结扩散特征不仅能够媲美监督基线模型,还能超越其他自监督方法。此外,在时间上和地理上偏移的浮游生物数据集上的分布外评估进一步表明,冻结扩散特征能够在显著的分布变化下保持强准确性和Macro F1分数。
https://arxiv.org/abs/2601.13416
Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.
以人类为中心的视觉分析在监控、医疗保健和人机交互等多种应用中扮演着关键角色。随着大规模未标注的人体图像数据集的出现,对于能够支持各种以人为中心的下游任务的一般无监督预训练模型的需求日益增长。为了实现这一目标,我们提出了CLASP(由CLIP引导的可适应自我监督学习框架),这是一种专门用于人类为中心的视觉任务中无监督预训练的新颖框架。CLASP利用强大的视觉-语言模型CLIP来生成低级(如身体部位)和高级(如属性)语义伪标签。然后将这些多层级的语义线索整合到所学的视觉表示中,丰富了其表达能力和泛化能力。 认识到不同的下游任务需要不同程度的语义粒度,CLASP集成了由提示控制的专家混合(MoE)模块。MoE根据特定于任务的提示动态调整特征提取,以减轻潜在的特征冲突并增强可迁移性。此外,CLASP采用了多任务预训练策略,在该策略中,从CLIP衍生出来的部分级和属性级伪标签引导表示学习过程。 在多个基准测试中的广泛实验表明,CLASP持续优于现有的无监督预训练方法,从而推进了以人为中心的视觉分析领域的发展。
https://arxiv.org/abs/2601.13133
Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.
在遥感领域的自监督预训练中,通常使用的是中等空间分辨率(MR)图像数据集,因为它们的高可用性。鉴于高分辨率(HR)数据集的发布,我们探讨了如何将这些HR数据集纳入自监督预训练过程,以增强对MR图像表示的学习,并提升下游分割任务中的表现性能。为此,我们设计了一个可以添加到现有自监督学习框架中的空间亲和力组件,该组件利用HR影像来学习更好的MR影像表示。我们在两个自监督学习框架中测试了这个空间亲和力组件,并展示了它在仅使用HR或MR图像预训练的模型上表现更佳的结果。
https://arxiv.org/abs/2601.12964
The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.
预训练的变压器在自然图像处理中展示了卓越的泛化能力。然而,直接将其应用于磁共振成像(MRI)时会面临两个关键挑战:一是无法适应医学解剖结构的特定性;二是由于医疗数据隐私性和稀缺性的限制带来的问题。为了解决这些问题,本文提出了一种用于MRI图像的自我监督预训练变压器(SSPFormer),该模型能够通过利用未标记的原始影像数据有效学习医学图像特有的特征表示。 为了应对领域差异和数据稀缺的问题,我们引入了逆频率投影掩码技术,优先重建高频率解剖区域,以此强化结构感知型表征学习。同时,为提高对现实世界MRI伪影的鲁棒性,我们采用了频率加权FFT噪声增强技术,在傅立叶域中注入生理真实的噪声。 通过这些策略,模型能够直接从原始扫描数据中学习到领域不变性和抗伪影特征。在分割、超分辨率和去噪任务上的广泛实验表明,所提出的SSPFormer达到了最先进的性能水平,充分验证了其捕捉精细MRI图像保真度及适应临床应用需求的能力。
https://arxiv.org/abs/2601.12747
Anomaly detection of multi-temporal modal data in Wireless Sensor Network (WSN) can provide an important guarantee for reliable network operation. Existing anomaly detection methods in multi-temporal modal data scenarios have the problems of insufficient extraction of spatio-temporal correlation features, high cost of anomaly sample category annotation, and imbalance of anomaly samples. In this paper, a graph neural network anomaly detection backbone network incorporating spatio-temporal correlation features and a multi-task self-supervised training strategy of "pre-training - graph prompting - fine-tuning" are designed for the characteristics of WSN graph structure data. First, the anomaly detection backbone network is designed by improving the Mamba model based on a multi-scale strategy and inter-modal fusion method, and combining it with a variational graph convolution module, which is capable of fully extracting spatio-temporal correlation features in the multi-node, multi-temporal modal scenarios of WSNs. Secondly, we design a three-subtask learning "pre-training" method with no-negative comparative learning, prediction, and reconstruction to learn generic features of WSN data samples from unlabeled data, and design a "graph prompting-fine-tuning" mechanism to guide the pre-trained self-supervised learning. The model is fine-tuned through the "graph prompting-fine-tuning" mechanism to guide the pre-trained self-supervised learning model to complete the parameter fine-tuning, thereby reducing the training cost and enhancing the detection generalization performance. The F1 metrics obtained from experiments on the public dataset and the actual collected dataset are up to 91.30% and 92.31%, respectively, which provides better detection performance and generalization ability than existing methods designed by the method.
在无线传感器网络(WSN)中,多时态模态数据的异常检测可以为可靠的网络运行提供重要保障。现有方法在处理多时态模态数据场景中的异常检测时存在时空相关特征提取不足、异常样本分类标注成本高以及异常样本不平衡等问题。本文设计了一种结合时空相关特征的图神经网络异常检测骨干网络和一种“预训练-图提示-微调”的多任务自监督训练策略,专门针对WSN图结构数据的特点。 首先,通过改进基于多尺度策略和跨模态融合方法的Mamba模型,并结合变分图卷积模块来设计异常检测骨干网络。这种组合能够充分提取无线传感器网络中多节点、多时态模态场景下的时空相关特征。其次,我们提出了一种包含非负对比学习、预测及重建三个子任务的学习“预训练”方法,从无标签数据中学习WSN数据样本的通用特征,并设计了“图提示-微调”机制来指导预训练自监督学习模型进行参数调整。通过这种方式,可以降低训练成本并提高检测泛化性能。 实验结果显示,在公共数据集和实际采集的数据集上,所提方法分别获得了高达91.30% 和 92.31% 的F1指标,这表明与现有设计的方法相比,该模型提供了更好的检测性能及泛化能力。
https://arxiv.org/abs/2601.12745
Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.
对比语言-音频预训练(CLAP)在学习语义丰富的音频表示方面取得了显著的成功,并广泛应用于各种与音频相关的任务中。然而,目前的CLAP模型面临着几个关键限制。首先,它们通常是在相对较小的数据集上进行训练的,数据集中往往包含几百万个音频样本。其次,现有的CLAP模型局限于短且固定时长的音频片段,这在处理现实世界中具有变化时长的音频场景时构成了约束条件。第三,标准对比学习目标基于全局表示,这可能阻碍了密集、精细化音频特征的学习。 为了应对这些挑战,我们引入了一种可扩展的语言-音频预训练(SLAP)方法,它将语言-音频预训练规模扩大到了10.9亿个具有变化时长的音视频对,并整合了多种训练目标。SLAP在单一阶段的训练过程中统一了对比损失与额外的自我监督和字幕生成损失,从而促进密集音频表示的学习。所提出的SLAP模型在音频-文本检索任务和零样本音频分类任务上均达到了新的最佳性能水平,展示了其在各种基准测试中的有效性。
https://arxiv.org/abs/2601.12594