We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The clustering resulting from our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering closely aligns with the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised representations to achieve competitive performance and serve as an evaluation method for the quality of these representations.
我们提出了一种通过将二元分配建模为随机变量来优化图比率割的新方法。我们提供了期望比率割的上界以及其梯度的无偏估计,以便在在线设置中学习分配变量的参数。我们的概率方法(PRCut)产生的聚类结果优于瑞利商松弛组合问题、该问题的在线学习扩展以及其他多种常用方法。我们展示了PRCut聚类与相似性度量紧密对齐,并且当提供基于标签的相似性时,其性能可媲美监督分类器。这种方法可以利用现成的自监督表示来实现竞争性的性能,并可以用作评估这些表示质量的方法。
https://arxiv.org/abs/2502.03405
We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at this https URL.
我们介绍了Metis,这是一种用于统一语音生成的基础模型。与以往的任务特定或多任务模型不同,Metis遵循预训练和微调的范式。它在大规模未标记的语音数据上使用掩码生成建模进行预训练,并随后通过微调适应各种语音生成任务。 具体来说: 1. Metis利用两种离散的语音表示:一种是从语音自监督学习(SSL)特征中派生出来的SSL令牌,另一种则是直接从波形量化得到的声学令牌。 2. Metis在SSL令牌上进行掩码生成预训练,并使用30万小时多样化的语音数据集,无需任何额外条件。 3. 通过加入特定任务的条件进行微调,Metis能够高效地适应各种语音生成任务,即使在有限的数据和可训练参数的情况下也能支持多模态输入。实验表明,Metis可以作为统一语音生成的基础模型:在五个语音生成任务(包括零样本文本到语音、声音转换、目标说话人提取、语音增强以及唇动至语音)中,即使使用少于20M的可训练参数或比现有数据集小300倍的数据量,Metis也优于最先进的特定任务或多任务系统。 音频示例可在以下链接找到:[https URL](实际提供链接时,请替换为有效URL)。
https://arxiv.org/abs/2502.03128
Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semantic information embedded in speech, which is crucial for improving intelligibility, speaker consistency, and overall quality of enhanced speech signals. To enrich the SE model with semantic information, we employ language models as an efficient semantic learner and propose a comprehensive framework tailored for language model-based speech enhancement, called \textit{GenSE}. Specifically, we approach SE as a conditional language modeling task rather than a continuous signal regression problem defined in existing works. This is achieved by tokenizing speech signals into semantic tokens using a pre-trained self-supervised model and into acoustic tokens using a custom-designed single-quantizer neural codec model. To improve the stability of language model predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages. Moreover, we introduce a token chain prompting mechanism during the acoustic token generation stage to ensure timbre consistency throughout the speech enhancement process. Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability.
语义信息指的是通过词汇、短语以及语言结构中的上下文关系传达的意义。人类可以利用这些语义信息,例如熟悉的语言模式和上下文线索,在嘈杂环境中重建不完整或被屏蔽的语音信号。然而,现有的语音增强(SE)方法往往忽视了嵌入在语音中的丰富语义信息,这对于提高增强语音信号的理解度、说话者一致性及整体质量至关重要。为了使SE模型更加丰富地利用语义信息,我们采用语言模型作为有效的语义学习工具,并提出了一种基于语言模型的语音增强框架,称为\textit{GenSE}。 具体而言,我们将SE视为一个条件语言建模任务,而不是现有研究定义中的连续信号回归问题。这通过使用预训练的自监督模型将语音信号转换为语义标记以及采用定制设计的单量化器神经编解码模型将语音信号转换为声学标记来实现。为了提高语言模型预测的稳定性,我们提出了一种分层建模方法,该方法将生成干净语义标记和干净声学标记的过程分为两个独立阶段。此外,在声学标记生成阶段中,我们引入了一个令牌链提示机制以确保在整个语音增强过程中音色的一致性。 在基准数据集上的实验结果表明,我们的方法在语音质量和泛化能力方面均优于现有的最先进的SE系统。
https://arxiv.org/abs/2502.02942
Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model's downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.
头部计算机断层扫描(CT)成像是一种广泛使用的医学影像手段,具有多种临床应用,特别是在评估大脑、颅骨和脑血管系统的病理方面。由于其图像获取速度快、安全性高、成本效益好以及普及性广等特点,它通常被用作神经急症的首选影像检查方法。深度学习模型能够辅助检测广泛的疾病类型,然而,在罕见疾病的高质量标签和注释稀缺的情况下,阻碍了强效模型的发展。为应对这一挑战,我们引入了一种名为FM-CT的基础模型:一种用于头部CT图像分析、具有广泛病灶识别能力的自监督学习模型。我们的方法在无需人工标注的大规模多样化数据集(包括361,663个非对比度3D头部CT扫描)上对深度学习模型进行了预训练,使模型能够学习到稳健且通用性强的特征。 为了探索自监督学习在头部CT图像分析中的潜力,我们采用了自我蒸馏和掩码图像建模两种方法,并构建了一个三维(3D)而非二维(2D)切片级别的模型来更全面、高效地利用头部CT扫描的结构。通过内部和三个外部数据集对下游分类性能进行评估,这些数据集涵盖了分布内(ID)和分布外(OOD)的数据。我们的研究结果表明,与从零开始训练或基于稀缺注释数据集的3D CT基础模型相比,自监督基础模型在下游诊断任务中的表现有了显著提升。这项工作强调了自监督学习在医学影像分析中的有效性,并为三维头部CT图像分析设定了新的基准,从而推动了人工智能技术在基于头部CT诊断中更广泛的应用。
https://arxiv.org/abs/2502.02779
Foundational models have emerged as a powerful paradigm in deep learning field, leveraging their capacity to learn robust representations from large-scale datasets and effectively to diverse downstream applications such as classification. In this paper, we present Astromer 2 a foundational model specifically designed for extracting light curve embeddings. We introduce Astromer 2 as an enhanced iteration of our self-supervised model for light curve analysis. This paper highlights the advantages of its pre-trained embeddings, compares its performance with that of its predecessor, Astromer 1, and provides a detailed empirical analysis of its capabilities, offering deeper insights into the model's representations. Astromer 2 is pretrained on 1.5 million single-band light curves from the MACHO survey using a self-supervised learning task that predicts randomly masked observations within sequences. Fine-tuning on a smaller labeled dataset allows us to assess its performance in classification tasks. The quality of the embeddings is measured by the F1 score of an MLP classifier trained on Astromer-generated embeddings. Our results demonstrate that Astromer 2 significantly outperforms Astromer 1 across all evaluated scenarios, including limited datasets of 20, 100, and 500 samples per class. The use of weighted per-sample embeddings, which integrate intermediate representations from Astromer's attention blocks, is particularly impactful. Notably, Astromer 2 achieves a 15% improvement in F1 score on the ATLAS dataset compared to prior models, showcasing robust generalization to new datasets. This enhanced performance, especially with minimal labeled data, underscores the potential of Astromer 2 for more efficient and scalable light curve analysis.
基础模型在深度学习领域中作为强大范式已经崭露头角,它们能够从大规模数据集中学习稳健的表示,并有效应用于分类等多样化的下游任务。本文介绍了Astromer 2,这是一个专门为提取光曲线嵌入而设计的基础模型。我们介绍Astromer 2为自我监督光曲线分析模型的增强版本。本文强调了其预训练嵌入的优势,比较了与前代模型Astromer 1相比的表现,并提供了对其能力的详细实证分析,深入探讨了该模型表示法的特点。 Astromer 2在MACHO调查中进行了预训练,使用的是来自150万单波段光曲线的数据集。我们采用了一种自我监督的学习任务来预测序列中的随机掩码观测值。通过在较小的标注数据集中进行微调,我们可以评估其在分类任务中的表现情况。嵌入的质量是根据由Astromer生成的嵌入训练MLP(多层感知器)分类器所得到的F1得分来衡量。 我们的研究结果表明,Astromer 2在其所有被评估的情景下都显著优于Astromer 1,包括每类仅有20、100和500个样本的小型数据集。使用加权单一样本嵌入特别有效,这种嵌入整合了来自Astromer注意力模块的中间表示。值得注意的是,在ATLAS数据集中,Astromer 2相比之前的模型F1得分提高了15%,这显示出了其在新数据集上强大的泛化能力。 这种性能提升特别是在少量标注数据的情况下尤为显著,突显出Astromer 2对于更高效和可扩展的光曲线分析具有巨大潜力。
https://arxiv.org/abs/2502.02717
Effective self-supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains in its early stages. We present a self-supervised masked modeling framework for 3D particle trajectory analysis in Time Projection Chambers (TPCs). These detectors produce globally sparse (<1% occupancy) but locally dense point clouds, capturing meter-scale particle trajectories at millimeter resolution. Starting with PointMAE, this work proposes volumetric tokenization to group sparse ionization points into resolution-agnostic patches, as well as an auxiliary energy infilling task to improve trajectory semantics. This approach -- which we call Point-based Liquid Argon Masked Autoencoder (PoLAr-MAE) -- achieves 99.4% track and 97.7% shower classification F-scores, matching that of supervised baselines without any labeled data. While the model learns rich particle trajectory representations, it struggles with sub-token phenomena like overlapping or short-lived particle trajectories. To support further research, we release PILArNet-M -- the largest open LArTPC dataset (1M+ events, 5.2B labeled points) -- to advance SSL in high energy physics (HEP). Project site: this https URL
有效的自监督学习(SSL)技术已成为解锁大型数据集进行表示学习的关键。尽管许多有前景的方法已经通过在线语料库和配有说明的照片得到了发展,但它们在科学领域的应用——这些领域中的数据编码了高度专业化的知识——仍处于早期阶段。我们提出了一种用于时间投影室(TPC)中3D粒子轨迹分析的自监督掩码建模框架。这些探测器产生的全局稀疏(<1%占用率)但局部密集的点云,以毫米级分辨率捕捉到米级的粒子轨迹。 基于PointMAE的工作提出了体积标记化来将稀疏的离子化点分组为与分辨率无关的补丁,并引入了一个辅助能量填充任务来改进轨迹语义。我们称这种方法为基于点的液氩掩码自编码器(PoLAr-MAE),它在没有标注数据的情况下,达到了99.4%的追踪和97.7%的 Shower分类F分数,与监督基线相匹配。 尽管该模型能够学习到丰富的粒子轨迹表示,但它仍难以处理如重叠或短寿命粒子轨迹这样的亚标记现象。为了支持进一步的研究,我们发布了PILArNet-M——一个最大的开放LArTPC数据集(超过100万事件,52亿个标注点),以推动高能物理领域中的自监督学习。 项目网站:[这个链接](https://this-url.com)
https://arxiv.org/abs/2502.02558
Ultrasound (US) imaging is clinically invaluable due to its noninvasive and safe nature. However, interpreting US images is challenging, requires significant expertise, and time, and is often prone to errors. Deep learning offers assistive solutions such as segmentation. Supervised methods rely on large, high-quality, and consistently labeled datasets, which are challenging to curate. Moreover, these methods tend to underperform on out-of-distribution data, limiting their clinical utility. Self-supervised learning (SSL) has emerged as a promising alternative, leveraging unlabeled data to enhance model performance and generalisability. We introduce a contrastive SSL approach tailored for B-mode US images, incorporating a novel Relation Contrastive Loss (RCL). RCL encourages learning of distinct features by differentiating positive and negative sample pairs through a learnable metric. Additionally, we propose spatial and frequency-based augmentation strategies for the representation learning on US images. Our approach significantly outperforms traditional supervised segmentation methods across three public breast US datasets, particularly in data-limited scenarios. Notable improvements on the Dice similarity metric include a 4% increase on 20% and 50% of the BUSI dataset, nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4% and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively. Furthermore, we demonstrate superior generalisability on the out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6% compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST training data, respectively. Our research highlights that domain-inspired SSL can improve US segmentation, especially under data-limited conditions.
超声成像(US)由于其非侵入性和安全性,在临床应用中具有不可估量的价值。然而,解读超声图像是一项挑战,需要深厚的专业知识和时间,并且容易出错。深度学习提供了诸如分割等辅助解决方案。监督方法依赖于大规模、高质量且一致标注的数据集,这类数据集的整理极具挑战性。此外,这些方法在处理非标准数据时表现不佳,限制了其临床应用的价值。自我监督学习(SSL)作为一种有前景的选择出现,通过利用未标记的数据来增强模型的表现力和泛化能力。我们引入了一种针对B模式超声图像的对比SSL方法,并采用了新颖的关系对比损失(RCL)。RCL通过一个可学习的度量标准区分正负样本对,从而鼓励学习出独特的特征。此外,我们还提出了基于空间和频率的增强策略以促进在US图像上的表示学习。 我们的方法在三个公开乳腺超声数据集上显著优于传统的监督分割方法,特别是在数据有限的情况下。Dice相似性指标中的改进尤为明显:在BUSI数据集的20%和50%数据量下分别提高了4%,BrEaST数据集分别为近6%和9%,UDIAT数据集中则为6.4%和3.7%(对应于20%和50%的数据)。此外,我们展示了在非标准数据集UDIAT上的优越泛化能力,相较于使用20%和50%的BUSI及BrEaST训练数据的监督基线模型,分别提高了20.6%和13.6%。我们的研究强调了领域启发式SSL在超声分割中的提升作用,尤其是在数据有限的情况下尤为显著。
https://arxiv.org/abs/2502.02489
4D medical image interpolation is essential for improving temporal resolution and diagnostic precision in clinical applications. Previous works ignore the problem of distribution shifts, resulting in poor generalization under different distribution. A natural solution would be to adapt the model to a new test distribution, but this cannot be done if the test input comes without a ground truth label. In this paper, we propose a novel test time training framework which uses self-supervision to adapt the model to a new distribution without requiring any labels. Indeed, before performing frame interpolation on each test video, the model is trained on the same instance using a self-supervised task, such as rotation prediction or image reconstruction. We conduct experiments on two publicly available 4D medical image interpolation datasets, Cardiac and 4D-Lung. The experimental results show that the proposed method achieves significant performance across various evaluation metrics on both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image interpolation but also provides a template for domain adaptation in other fields such as image segmentation and image registration.
四维医学图像插值在提高临床应用中的时间分辨率和诊断精度方面至关重要。以往的研究忽略了分布变化的问题,导致模型在不同数据分布下的泛化能力较差。一个自然的解决方案是将模型适应到新的测试分布上,但这在没有真实标签的情况下无法实现。为此,在本文中,我们提出了一种新颖的测试时训练框架,该框架利用自监督学习使模型能够在无需任何标签的情况下适应新分布。具体而言,在对每个测试视频进行帧插值之前,模型会使用诸如旋转预测或图像重建等自监督任务在相同实例上接受培训。 我们在两个公开可用的四维医学图像插值数据集(心脏和4D肺部)上进行了实验。实验证明,所提出的方法在各个评估指标中均表现出显著性能,并且在卡迪亚克数据集中实现了33.73dB的峰值信噪比,在4D-Lung数据集中达到了34.02dB。 我们的方法不仅推动了四维医学图像插值技术的发展,还为其他领域如图像分割和图像配准中的域适应提供了模板。
https://arxiv.org/abs/2502.02341
Despite decades of research on data collection and model architectures, current gaze estimation models face significant challenges in generalizing across diverse data domains. While recent advances in self-supervised pre-training have shown remarkable potential for improving model generalization in various vision tasks, their effectiveness in gaze estimation remains unexplored due to the geometric nature of the gaze regression task. We propose UniGaze, which leverages large-scale, in-the-wild facial datasets through self-supervised pre-training for gaze estimation. We carefully curate multiple facial datasets that capture diverse variations in identity, lighting, background, and head poses. By directly applying Masked Autoencoder (MAE) pre-training on normalized face images with a Vision Transformer (ViT) backbone, our UniGaze learns appropriate feature representations within the specific input space required by downstream gaze estimation models. Through comprehensive experiments using challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. The source code and pre-trained models will be released upon acceptance.
尽管在数据收集和模型架构方面进行了数十年的研究,当前的视线估计模型仍然面临着跨不同数据域进行泛化的重大挑战。虽然最近在自我监督预训练方面的进展已经在各种视觉任务中展示了显著改善模型泛化的能力,但由于注视回归任务的几何特性,其在注视估计中的有效性尚未被探索。我们提出了UniGaze,该系统利用大规模、现实环境中的人脸数据集通过自我监督预训练来进行注视估计。我们精心策划了多个捕捉身份、光照、背景和头部姿态多种变化的人脸数据集。通过直接将遮罩自动编码器(MAE)预训练应用于具有视觉变换器(ViT)骨干的标准化面部图像,我们的UniGaze学习到了下游注视估计模型所需特定输入空间内的适当特征表示。通过使用具有挑战性的跨数据集评估和包括留一数据集外和联合数据集设置在内的新型协议进行详尽实验,我们证明了UniGaze在多个数据域中的泛化性能得到了显著提升,并且最大限度地减少了对昂贵标注数据的依赖。源代码和预训练模型将在接受后发布。
https://arxiv.org/abs/2502.02307
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
自监督学习已经成为从无标签数据中提取有意义表示的强大方法,覆盖了各个领域,并减少了对大型标注数据集的依赖。受 BERT 在自然语言处理中成功捕捉深度双向上下文的启发,类似的框架已被应用于其他模式如音频信号,模型如 BEATs 将双向训练范式扩展到音频信号,使用向量量化 (VQ) 技术。然而,这些框架面临着挑战,特别是它们依赖于单一代码本进行量化,这可能无法捕捉复杂多面的信号特性。此外,在代码本利用中的低效导致了未充分利用的码矢量。为了解决这些问题,我们介绍了 BRIDLE(双向残差量化交织离散学习编码器),这是一种自监督编码器预训练框架,它将残差量化 (RQ) 集成到双向训练过程中,并且适用于音频、图像和视频的预训练。通过使用多个分层代码本,RQ 在潜在空间中实现了细粒度的离散化,从而提升了表示的质量。BRIDLE 包含编码器和标记器之间的交错式训练程序。我们在音频理解任务上使用分类基准评估了 BRIDLE,并取得了最先进的结果;同时在图像分类和视频分类任务中展示了竞争性的性能,显示出与传统 VQ 方法相比,在下游任务中的持续改进。
https://arxiv.org/abs/2502.02118
Navigating densely vegetated environments poses significant challenges for autonomous ground vehicles. Learning-based systems typically use prior and in-situ data to predict terrain traversability but often degrade in performance when encountering out-of-distribution elements caused by rapid environmental changes or novel conditions. This paper presents a novel, lidar-only, online adaptive traversability estimation (TE) method that trains a model directly on the robot using self-supervised data collected through robot-environment interaction. The proposed approach utilises a probabilistic 3D voxel representation to integrate lidar measurements and robot experience, creating a salient environmental model. To ensure computational efficiency, a sparse graph-based representation is employed to update temporarily evolving voxel distributions. Extensive experiments with an unmanned ground vehicle in natural terrain demonstrate that the system adapts to complex environments with as little as 8 minutes of operational data, achieving a Matthews Correlation Coefficient (MCC) score of 0.63 and enabling safe navigation in densely vegetated environments. This work examines different training strategies for voxel-based TE methods and offers recommendations for training strategies to improve adaptability. The proposed method is validated on a robotic platform with limited computational resources (25W GPU), achieving accuracy comparable to offline-trained models while maintaining reliable performance across varied environments.
在植被密集的环境中导航对自主地面车辆构成了重大挑战。基于学习的方法通常使用先验和现场数据来预测地形可通行性,但在遇到由快速环境变化或新情况引起的分布外元素时,性能往往会下降。本文提出了一种新颖、仅使用激光雷达(LiDAR)的在线自适应可通行性估计(TE)方法,该方法直接在机器人上训练模型,通过机器人与环境之间的交互收集自我监督数据。所提出的方案利用概率3D体素表示法来整合激光雷达测量值和机器人的经验,创建一个显著的环境模型。为了确保计算效率,采用了一种稀疏图基表示法来更新暂时演化的体素分布。使用无人驾驶地面车辆在自然地形中进行广泛的实验表明,该系统只需8分钟的操作数据即可适应复杂环境,并实现了0.63的马修斯相关系数(MCC)评分,在密集植被环境中实现了安全导航。这项工作研究了基于体素的TE方法的不同训练策略,并为提高适应性的培训策略提供了建议。所提出的方法在计算资源有限(25W GPU)的机器人平台上得到了验证,其准确性与离线训练模型相当,同时在各种环境下保持可靠的性能。
https://arxiv.org/abs/2502.01987
While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.
虽然传统的自监督学习方法在各种医疗任务中提高了性能和鲁棒性,但它们依赖于单一向量嵌入,这可能无法捕捉到精细的概念,例如解剖结构或器官。能够在无监督的情况下识别这些概念及其特征的能力有望改进预训练方法,并实现诸如细粒度图像检索和基于概念的异常检测等新型应用。在本文中,我们介绍了ConceptVAE,这是一种新颖的自监督预训练框架,可以检测并分离出从其风格特性中的细微概念。我们提出了一系列损失项和模型架构基本元素,旨在将输入数据离散化为预定数量的概念及其局部风格。我们在定性和定量上验证了ConceptVAE的能力,展示了它能够从2D心脏超声心动图中识别精细的解剖结构,如血液池和隔壁。在量化方面,ConceptVAE在区域实例检索、语义分割、分布外检测和目标检测等任务中优于传统的自监督方法。此外,我们还探讨了生成与训练数据具有相同概念但风格不同的分布内合成数据的可能性,强调其在更精确的数据生成方面的潜力。总的来说,我们的研究介绍并验证了一种基于概念-风格分离的新颖预训练技术,为开发比黑盒方法更具可解释性和可解释性的医疗图像分析模型开辟了多种途径。
https://arxiv.org/abs/2502.01335
Large language models have been shown to suffer from reasoning inconsistency issues. That is, they fail more in situations unfamiliar to the training data, even though exact or very similar reasoning paths exist in more common cases that they can successfully solve. Such observations motivate us to propose methods that encourage models to understand the high-level and abstract reasoning processes during training instead of only the final answer. This way, models can transfer the exact solution to similar cases, regardless of their relevance to the pre-training data distribution. In this work, we propose SAL, a self-supervised analogical learning framework. SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions from cases that they know how to solve to other rare cases in which they tend to fail more. We show that the resulting models after SAL learning outperform base language models on a wide range of reasoning benchmarks, such as StrategyQA, GSM8K, and HotpotQA, by 2% to 20%. At the same time, we show that our model is more generalizable and controllable through analytical studies.
大型语言模型已被证明存在推理不一致的问题。也就是说,它们在与训练数据不常见的情况下表现较差,即使这些情况有类似或完全相同的推理路径可以解决也无济于事。这种观察促使我们提出一种方法,鼓励模型在训练过程中理解高层次和抽象的推理过程,而不仅仅是最终答案。这样,模型就可以将精确解决方案转移到相似案例中,无论这些案例与预训练数据分布的相关性如何。 为此,在这项工作中,我们提出了SAL(Self-supervised Analogical Learning)框架,这是一个自我监督类比学习框架。SAL模仿了人类的类比过程,并训练模型从它们已经知道如何解决的情况中明确地转移高质量的符号解决方案到其他罕见情况,这些罕见情况下它们通常会失败。 我们的实验结果显示,在各种推理基准测试(如StrategyQA、GSM8K和HotpotQA)上,经过SAL学习后的模型优于基础语言模型2%到20%。同时,通过分析研究我们证明了我们的模型具有更高的泛化能力和可控性。
https://arxiv.org/abs/2502.00996
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at this https URL
图表示学习已成为节点分类和链接预测等任务的重要基石,然而现有的自监督学习(SSL)方法面临着计算效率低下、依赖对比目标以及表征坍缩等问题。现有方法通常依赖于特征重构、负采样或复杂的解码器,这些方法会增加训练开销并阻碍泛化能力。此外,当前解决这些问题的技术未能考虑到在没有标签节点的情况下节点嵌入对特定预测的贡献。为了解决这些问题,我们提出了一种新的联合嵌入预测框架用于图自监督学习,该框架消除了对比目标和负采样,同时保留语义和结构信息。此外,我们引入了一个具有语义感知的目标项,该项采用了从高斯混合模型(GMMs)推导出的伪标签,通过评估潜在特征贡献来增强节点区分度。 广泛的实验表明,我们的框架在基准测试中超越了最先进的图自监督学习方法,在没有对比损失和复杂解码器的情况下取得了更优的表现。关键创新包括: 1. 一种非对比、视图不变性的联合嵌入预测架构, 2. 利用子图间的单一上下文与多个目标之间的关系,以及 3. 基于GMM的伪标签评分以捕捉语义贡献。 这项工作通过提供一个计算效率高且不易坍缩的方法推进了图SSL的发展,并将空间和语义图特征相结合用于下游任务。我们的论文代码可在以下网址找到:[此处插入链接]
https://arxiv.org/abs/2502.01684
Electroencephalogram (EEG) provides a non-invasive, highly accessible, and cost-effective solution for Alzheimer's Disease (AD) detection. However, existing methods, whether based on manual feature extraction or deep learning, face two major challenges: the lack of large-scale datasets for robust feature learning and evaluation, and poor detection performance due to inter-subject variations. To address these challenges, we curate an EEG-AD corpus containing 813 subjects, which forms the world's largest EEG-AD dataset to the best of our knowledge. Using this unique dataset, we propose LEAD, the first large foundation model for EEG-based AD detection. Our method encompasses an entire pipeline, from data selection and preprocessing to self-supervised contrastive pretraining, fine-tuning, and key setups such as subject-independent evaluation and majority voting for subject-level detection. We pre-train the model on 11 EEG datasets and unified fine-tune it on 5 AD datasets. Our self-supervised pre-training design includes sample-level and subject-level contrasting to extract useful general EEG features. Fine-tuning is performed on 5 channel-aligned datasets together. The backbone encoder incorporates temporal and channel embeddings to capture features across both temporal and spatial dimensions. Our method demonstrates outstanding AD detection performance, achieving up to a 9.86% increase in F1 score at the sample-level and up to a 9.31% at the subject-level compared to state-of-the-art methods. The results of our model strongly confirm the effectiveness of contrastive pre-training and channel-aligned unified fine-tuning for addressing inter-subject variation. The source code is at this https URL.
电生理图(EEG)为阿尔茨海默病(AD)检测提供了一种非侵入性、高度可访问且成本效益高的解决方案。然而,现有的方法,无论是基于手动特征提取还是深度学习的方法,都面临着两大挑战:缺乏大规模数据集以进行稳健的特征学习和评估;以及由于个体差异导致的较差的检测性能。为了解决这些问题,我们编制了一个包含813名受试者的EEG-AD语料库(EEG-AD Corpus),据我们所知这是世界上最大的EEG-AD数据集。利用这个独特的数据集,我们提出了LEAD,首个专门用于基于EEG的阿尔茨海默病检测的大规模基础模型。我们的方法涵盖了整个流程,从数据选择和预处理到自我监督对比预训练、微调以及如独立受试者评估和多数投票以进行受试者级别的检测等关键设置。我们在11个EEG数据集上对模型进行了预训练,并统一在5个AD数据集上对其进行了精细调整。我们的自监督预训练设计包括样本级别和个体级别对比,以便提取有用的一般EEG特征。同时,在五个通道对齐的数据集中一起进行微调工作。骨干编码器结合了时间嵌入和通道嵌入以捕捉时间和空间维度上的特征。与现有方法相比,我们提出的方法在AD检测性能上表现出色,在样本级别的F1分数提高了最多9.86%,在受试者级别提高了最多9.31%。我们的模型结果强烈证实对比预训练和通道对齐的统一微调对于解决个体差异问题的有效性。 源代码可在[此链接](https://example.com)获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2502.01678
3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.
基于3D高斯点的Talking Head(虚拟人脸)合成技术因其能够实时渲染高质量图像而受到关注。然而,由于该方法通常仅在缺乏面部情感多样性的短视频上进行训练,生成的人脸难以表现出丰富的情感范围。为解决这一问题,我们提出了一种唇部对齐的情感脸部生成器,并利用它来训练我们的EmoTalkingGaussian模型。该模型能够在保留嘴唇动作与输入音频同步的前提下,根据连续的情绪值(如效价和唤醒度)调节面部表情。此外,为了实现野外音频的精确唇部同步,我们引入了一种自监督学习方法,结合了文本到语音网络和视听同步网络。 我们在公开可用的视频上测试了我们的EmoTalkingGaussian模型,并在图像质量(以PSNR、SSIM、LPIPS衡量)、情感表达(以V-RMSE、A-RMSE、V-SA、A-SA、情感准确性衡量)以及唇部同步(以LMD、Sync-E、Sync-C衡量)方面分别取得了优于现有技术水平的结果。
https://arxiv.org/abs/2502.00654
Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in this https URL.
基础模型是指通过自监督算法在大规模未标记数据集上进行预训练的深度学习模型。在地球科学和遥感领域,人们对利用包括卫星和航空影像在内的地球观测数据的兴趣日益增加,并希望通过基础模型实现这一转变。已经开发出多种针对遥感领域的基础模型,如用于多光谱、高分辨率以及超光谱图像的基础模型,在各种下游任务上的表现均优于传统的监督学习模型。这些模型正在迅速发展,具备处理多光谱、多时相及多传感器数据的能力。 大多数研究采用掩码自编码器与视觉变换器(ViT)相结合的方法作为预训练的骨干网络。尽管这些模型展现出了令人鼓舞的表现,但ViTs面临着一些挑战,比如输入长度增加导致计算成本呈二次方增长,这可能限制其在处理长序列的多波段和多时相数据方面的性能。 本研究旨在通过提出SatMamba这一新的预训练框架来应对上述挑战。SatMamba将掩码自编码器与状态空间模型结合使用,并提供线性计算扩展能力。实验结果显示,在高分辨率图像的各种下游任务上,该方法表现出了令人鼓舞的结果,这为开发更高效的基础模型铺平了道路,并释放了地球观测数据的全部潜力。源代码可在以下网址获取:[提供的链接]
https://arxiv.org/abs/2502.00435
Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. $\textbf{TEST-V}$ achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.
最近,通过使用少量提示调整类嵌入(测试时提示调优,TPT)或用生成的视觉样本替换类名(支持集),将视觉语言模型(VLMs)适应于零样本视觉分类显示出有希望的结果。然而,TPT无法避免模态间的语义差距,而支持集则不能被调整。为此,我们借鉴彼此的优势,提出了一种新颖的框架,即用于零样本视频分类的测试时支持集调优 (TEST-V)。该框架首先使用多个提示(多提示支持集膨胀,MSD)来扩展支持集,并通过可学习权重对支持集进行侵蚀以动态挖掘每个类的关键线索(时间感知支持集侵蚀,TSE)。具体来说: i) MSD 通过从大型语言模型(LLMs)获取的多个提示为基础,为每个类别扩展现有的支持样本,从而丰富了支持集的多样性。 ii) TSE 则使用因子化可学习权重根据自监督的方式进行时间预测一致性来调整支持集,以挖掘对每个类至关重要的支撑线索。 **TEST-V 在四个基准测试上取得了最先进的结果,并且对于支持集的膨胀和侵蚀过程具有良好的解释性。**
https://arxiv.org/abs/2502.00426
Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.
尽管端到端的语音到文本翻译取得了巨大成功,但我们认为级联式语音到文本翻译模型仍然有其存在的价值。这种模型通常会因为自动语音识别(ASR)和机器翻译(MT)模型之间的错误传播而受到批评。在本文中,我们探讨了将来自ASR的多个候选词以及自监督语音特征融入到MT中的好处。我们的分析表明,级联错误的主要原因在于,在从语音域映射到文本域时,相似样本间的差异增大。通过包含多个候选词和自监督语音特征,我们的方法使机器翻译模型能够选择正确的词汇并利用各种语音样本确保精确的翻译。这种策略可以最小化错误传播,并且充分利用大规模的ASR和MT数据集以及预训练的ASR/MT模型来解决相关问题。
https://arxiv.org/abs/2502.00377
This paper introduces a three-branch checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by governmental systems. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE as the legislative branch establishing ethical guardrails, and ERIS as the judicial branch for contextual interpretation. The adversarial DIKE-ERIS duality enables adaptation to diverse cultural contexts while upholding consistent ethical principles. This architecture addresses limitations of reinforcement learning with human feedback (RLHF) by providing interpretable, adaptable, and culturally-aware ethical reasoning. Through self-supervised learning and adversarial testing, our framework demonstrates how emotional modeling can guide linguistic behaviors toward ethical outcomes while preserving independence across knowledge generation, ethical oversight, and contextual interpretation.
这篇论文介绍了一个三分支的制衡框架,用于大型语言模型(LLM)的伦理一致性调整,灵感来源于政府系统。该框架实现了三个独立但又相互作用的组成部分:LLMs作为执行部门负责知识生成;DIKE作为立法机构设立道德规范;ERIS作为司法部门进行情境解释。这种对抗性的DIKE-ERIS二元性可以在不同的文化背景下适应多样化的伦理原则,同时保持一致性。此架构通过提供可解释、灵活和具有文化意识的伦理推理来解决基于人类反馈的强化学习(RLHF)的局限性。通过自我监督学习和对抗测试,我们的框架展示了情感建模如何指导语言行为以实现伦理结果,同时在知识生成、道德监督和情境解读之间保持独立性。
https://arxiv.org/abs/2502.00136