The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.
扩散模型和基于流的生成模型中的慢速迭代采样特性仍然是其实用部署的主要瓶颈。虽然一致性模型(CMs)代表了高效生成的一种最先进的蒸馏方法,但它们的大规模应用仍受到两个关键问题的限制:训练不稳定性和采样灵活性不足。现有方法通过架构调整或正则化目标来缓解这些问题,却忽略了对轨迹选择这一核心依赖性的忽视。 在本研究中,我们首先分析了这两个局限性:训练不稳定性源于由不稳定自监督项引发的损失发散;而采样的非弹性来自误差累积。基于这些见解和分析,我们提出了双端一致性模型(DE-CM),该模型通过选择关键子轨迹簇来实现稳定且有效的训练。 DE-CM将PF-ODE轨迹分解为三个优化目标的关键子轨迹。具体来说,我们的方法利用连续时间CMs目标以实现少量步骤的蒸馏,并采用流匹配作为边界正则化器来稳定训练过程。此外,我们提出了一种新颖的噪声到噪点(N2N)映射技术,可以将噪声映射到任意位置,从而减轻第一步中的误差累积问题。 广泛的实验结果展示了我们的方法的有效性:在ImageNet 256x256数据集上进行一步生成时,DE-CM实现了业界领先的FID评分为1.70,超越了现有的一步CM方法。
https://arxiv.org/abs/2602.10764
Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.
超分辨率(SR)技术在处理现实世界的低分辨率(LR)图像时,往往会遇到复杂的、不规则的退化现象,这些退化是由自然场景获取过程中的固有复杂性所引起的。与根据定义明确的情境生成的合成LR图像所产生的SR伪影相比,在真实生活情境中产生的这些扭曲是高度不可预测且变化多端的。因此,评估从现实世界低分辨率图像获得的超分辨图像(SR-IQA)的质量仍然是一项具有挑战性和未充分探索的问题。 在这项工作中,我们提出了一种专门针对这种高难度、非结构化现实场景的无参考SR-IQA方法。该方法能够为真实世界的SR应用提供领域自适应型IQA,在数据稀缺的情况下尤其有效。我们认为,超分辨率图像中的退化很大程度上取决于所使用的SR算法,而不仅仅是由图像内容决定的。 为此,我们引入了一种自我监督学习(SSL)策略:在预处理阶段,先对多个针对SR模型的不同表示进行预训练。我们的对比学习框架通过将同一种SR方法生成的图片配成正样本对,并将不同方法产生的图片配成负样本对来构建特征集,完全忽略图像内容的影响。 提出的S3 RIQA方法进一步包含了针对性的预处理步骤以提取互补的质量信息,并附加了一个辅助任务以便更好地应对与不同SR缩放因子相关的各种退化模式。为此,我们创建了新的数据集SRMORSS,用于支持无监督预训练;该数据集中包含广泛应用于大量真实LR图像的不同SR算法的应用实例,从而填补现有数据集的空白。 在实际SR-IQA基准测试中的实验表明,S3 RIQA方法在大多数当前最先进的相关度量标准中表现一致优异。
https://arxiv.org/abs/2602.10744
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
规模可调的、受动作控制的世界模型受限于动作标签的稀缺性。虽然潜在动作学习承诺从无标签视频中提取控制接口,但所学得的潜在变量往往难以跨上下文转移:它们会纠缠特定场景中的线索,并缺乏一个共享的坐标系统。这是因为标准目标仅在每个片段内操作,没有机制来对齐不同上下文的动作语义。 我们的关键见解是:尽管动作不可见,其语义效果是可以观察到并能作为共同参考点使用的。我们引入了Seq$\Delta$-REPA,这是一个序列级的控制效果对准目标,它将整合后的潜在动作锚定于来自冻结的自监督视频编码器的时序特征差异上。 在此基础上,我们提出了Olaf-World,这是一种流水线方法,通过大规模被动视频预训练受动作条件约束的视频世界模型。大量的实验表明,我们的方法学习到了一个更结构化的潜在动作空间,在零样本动作迁移和对新控制接口的数据高效适应方面优于现有的最先进技术。
https://arxiv.org/abs/2602.10104
As autonomous vehicles are rolled out, measures must be taken to ensure their safe operation. In order to supervise a system that is already in operation, monitoring frameworks are frequently employed. These run continuously online in the background, supervising the system status and recording anomalies. This work proposes an online monitoring framework to detect anomalies in object state representations. Thereby, a key challenge is creating a framework for anomaly detection without anomaly labels, which are usually unavailable for unknown anomalies. To address this issue, this work applies a self-supervised embedding method to translate object data into a latent representation space. For this, a JEPA-based self-supervised prediction task is constructed, allowing training without anomaly labels and the creation of rich object embeddings. The resulting expressive JEPA embeddings serve as input for established anomaly detection methods, in order to identify anomalies within object state representations. This framework is particularly useful for applications in real-world environments, where new or unknown anomalies may occur during operation for which there are no labels available. Experiments performed on the publicly available, real-world nuScenes dataset illustrate the framework's capabilities.
随着自动驾驶车辆的推出,必须采取措施确保其安全运行。为了监督已投入运营的系统,通常会使用监控框架,在后台持续在线运行以监视系统状态并记录异常情况。本文提出了一种在线监控框架,用于检测对象状态表示中的异常情况。其中的一个关键挑战是创建一个不需要异常标签(这些标签对于未知异常通常是不可用的)的异常检测框架。 为了解决这个问题,本工作应用了自我监督嵌入方法,将物体数据转换到潜在表征空间中。为此构建了一个基于JEPA的自监督预测任务,使得可以在没有异常标签的情况下进行训练,并创建丰富的对象嵌入。生成的表现力强的JEPA嵌入被用作现有异常检测方法的输入,以识别对象状态表示中的异常情况。 该框架对于现实世界的应用特别有用,在这种应用中,在操作期间可能会出现新的或未知的异常,而这些异常没有可用标签。在公开提供的nuScenes数据集上进行的实验展示了该框架的能力。
https://arxiv.org/abs/2602.09985
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
被动声学监测已成为生物多样性评估、保护和行为生态学的关键策略,尤其是随着物联网(IoT)设备的普及,这种技术能够大规模地进行现场音频采集。尽管最近基于自监督学习(SSL)的音频编码器(如BEATs和AVES)在生物声学任务中表现出色,但其计算成本高及对未知环境适应性差的问题限制了它们在资源受限平台上的部署。为此,我们引入了一种名为BioME的新颖音频编码器,专为生物声学应用而设计,并且资源效率极高。 BioME通过层级到层级的知识蒸馏技术从一个大型教师模型中学习,这使得强大的特征传递成为可能,同时将参数数量减少了75%。为了进一步提升生态系统的泛化能力,该模型使用涵盖语音、环境声音和动物鸣叫的跨域数据进行预训练。其主要贡献在于通过FiLM条件化的方式整合调制感知声学特性,注入了一种DSP启发式的归纳偏置,增强了低容量架构下的特征解耦。 在多个生物声学任务中,BioME的表现与更大的模型(包括它的教师模型)相当甚至更好,并且适合资源受限的IoT部署。为了便于重复验证,代码和预训练检查点都是公开可用的。
https://arxiv.org/abs/2602.09970
Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
膀胱癌的监测需要跟踪在多次干预过程中肿瘤的位置变化,然而,由于膀胱是可变形且空腔结构的特点,很难找到稳定的定位标志。尽管内窥镜检查时可见的血管可以为患者提供一个特定的“血管指纹”以帮助导航,但不完美的内窥镜数据(如标签稀疏、气泡等伪影、光照变化、连续变形以及类似血管的黏膜皱褶)给自动分割带来了挑战。目前最先进的血管分割方法通常无法解决这些特有的复杂问题。 我们引入了一种混合注意力-卷积 (HAC) 架构,该架构结合了Transformer以捕捉全局血管拓扑结构,并采用CNN来学习一个残差精化图以便精确恢复细小的血管细节。为了强调结构连通性,Transformer是基于优化后的地面真实数据进行训练的,这些数据排除了短枝和终端分支。此外,为了解决数据稀缺问题,我们采用了物理感知预训练方法,这是一种使用无标签数据并通过临床相关的增强策略实现自监督的方法。 在包含内窥镜视频帧的BlaVeS 数据集上进行评估时,我们的方法取得了较高的准确性(0.94)、优于现有医学分割模型的精度(0.61)和clDice指标(0.66)。尤为重要的是,该方法成功抑制了由于膀胱充盈及排空过程中动态出现并消失的黏膜皱褶所导致的假阳性。因此,HAC架构提供了临床导航所需的可靠结构稳定性。
https://arxiv.org/abs/2602.09949
Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
自回归大规模语言模型(LLMs)通过将各种任务表达为离散自然语言标记的序列,并通过下一个标记预测进行训练,来很好地扩展规模。这种方法在自我监督下统一了理解和生成过程。为了将这种范式扩展到多模态数据,需要跨模式共享一个离散表示形式。然而,大多数视觉-语言模型(VLMs)仍然依赖于混合接口:与连续的Vision Transformer (ViT) 特征配对的离散文本标记。由于监督主要由文本驱动,这些模型往往偏向于理解方面,并且无法充分利用大规模非文本数据上的自监督学习。近期的工作探索了离散视觉标记化以实现完全自回归的多模态建模,展示出了朝着统一理解和生成迈进的巨大潜力。然而,现有的离散视觉标记由于编码容量有限,经常丢失信息,导致其在理解方面明显弱于连续特征VLMs。 我们提出了Kelix,这是一种全离散、自回归的一体化模型,能够弥合离散和连续视觉表示之间的理解差距。
https://arxiv.org/abs/2602.09843
Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
大多数自监督学习(SSL)方法通过将同一输入的不同视图对齐来学习连续的视觉表示,从而在如何跨表示维度结构化信息方面提供了有限的控制。在这项工作中,我们将视觉自监督学习框架化为教师网络和学生网络之间的离散通信过程,在一个固定容量的二进制信道中传输语义信息。与对齐连续特征不同,学生预测由教师生成的多标签二元消息。通过逐元素的二元交叉熵目标强制执行离散一致,并且编码率正则化项鼓励有效利用受限制的通道,从而促进结构化的表示形式。我们进一步表明,定期重新初始化投影头会加强这种效果,从而鼓励在多个离散编码中保持预测性的嵌入。 广泛的实验结果证明,在图像分类、检索和密集视觉预测任务上,以及通过自监督适应处理域变化时,与连续一致的基准相比,该方法实现了持续改进。除了骨干表示之外,我们还分析了所学习的二进制代码,并展示了它们形成了一种紧凑且具有信息量的离散语言,能够捕捉跨类可重用的语义因素。
https://arxiv.org/abs/2602.09764
Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at this http URL
数字全滑病理图像(WSIs)提供了像素级的高分辨率图像,这对于疾病诊断非常有用。然而,由于标注训练数据量有限,数字病理图像分析面临着重大挑战,因为手动注释特定区域或从大型WSI中裁剪的小块需要大量的时间和精力。弱监督下的多实例学习(MIL)通过只需要袋子级别的标签提供了一种实用且高效的解决方案,而每个袋子通常包含多个实例(斑块)。大多数MIL方法直接使用由各种图像编码器生成的冻结图像斑块特征作为输入,并主要集中在特征聚合上。然而,在MIL环境中用于编码器预训练的特征表示学习被很大程度忽视了。 在我们的工作中,我们提出了一种名为弱监督对比学习(WeakSupCon)的新颖特征表示学习框架,该框架在训练过程中结合使用袋子级别的标签信息。我们的方法不依赖于实例级别伪标记,但在特征空间中有效地将具有不同标签的斑块分开。实验结果表明,在三个数据集中,我们提出的WeakSupCon方法生成的图像特征相比于自监督对比学习方法带来了更好的下游MIL性能改进。 相关代码可在此链接访问:[此URL](请根据实际发布的地址替换)。
https://arxiv.org/abs/2602.09477
Each year, thousands of patients in need of heart transplants face life-threatening wait times due to organ scarcity. While allocation policies aim to maximize population-level outcomes, current approaches often fail to account for the dynamic arrival of organs and the composition of waitlisted candidates, thereby hampering efficiency. The United States is transitioning from rigid, rule-based allocation to more flexible data-driven models. In this paper, we propose a novel framework for non-myopic policy optimization in general online matching relying on potentials, a concept originally introduced for kidney exchange. We develop scalable and accurate ways of learning potentials that are higher-dimensional and more expressive than prior approaches. Our approach is a form of self-supervised imitation learning: the potentials are trained to mimic an omniscient algorithm that has perfect foresight. We focus on the application of heart transplant allocation and demonstrate, using real historical data, that our policies significantly outperform prior approaches -- including the current US status quo policy and the proposed continuous distribution framework -- in optimizing for population-level outcomes. Our analysis and methods come at a pivotal moment in US policy, as the current heart transplant allocation system is under review. We propose a scalable and theoretically grounded path toward more effective organ allocation.
每年,成千上万需要心脏移植的患者因器官短缺而面临生命威胁性的等待时间。尽管分配政策旨在最大化人口水平上的结果,但目前的方法常常未能考虑到新器官的到来以及等候名单候选者的构成情况,从而降低了效率。美国正从刚性、基于规则的分配方式转向更灵活的数据驱动模型。 本文提出了一种新颖的框架,用于依赖潜力概念的一般在线匹配中的非近视政策优化。该概念最初是在肾交换中引入的。我们开发了可扩展且准确的方法来学习更高维度和更具表达性的潜力值,相比以往方法而言更为有效。我们的方法是一种自监督模仿学习的形式:通过训练潜力以模仿一个拥有完美预知能力的理想算法来进行。 本文聚焦心脏移植分配的应用,并利用真实历史数据证明,在优化人口水平结果方面,我们的政策显著优于先前的方法——包括当前美国的标准策略以及提议的连续分布框架。 在分析和方法上,我们正处于美国政策转变的关键时刻,因为目前的心脏移植分配系统正在接受审查。我们提出了一条可扩展且理论基础坚实的路径,以实现更有效的器官分配。
https://arxiv.org/abs/2602.08878
The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
视频生成能力的提升带来了日益严峻的安全风险,使得可靠的检测变得越来越重要。本文介绍了VideoVeritas框架,该框架结合了精细感知和基于事实推理的能力。我们注意到尽管当前多模态大型语言模型(MLLMs)表现出强大的推理能力,但其在细节感知方面仍然存在局限性。为了解决这个问题,我们引入了联合偏好对齐和感知预训练强化学习(PPRL)。具体来说,在强化学习阶段,我们采用通用的时空定位和自我监督的对象计数方法,而不是直接优化检测任务,从而通过简单的感知预设任务来提升检测性能。为了便于稳健性评估,我们进一步推出了MintVid数据集,这是一个轻量级但高质量的数据集,包含了来自9个最先进生成器的3000段视频以及一个在内容上存在事实错误的真实世界收集子集。实验结果表明,现有方法往往倾向于要么浅层次推理,要么机械分析,而VideoVeritas则能够在多样化的基准测试中实现更加均衡的表现。
https://arxiv.org/abs/2602.08828
Speech Emotion Recognition (SER) is widely deployed in Human-Computer Interaction, yet the high computational cost of conventional models hinders their implementation on resource-constrained edge devices. Spiking Neural Networks (SNNs) offer an energy-efficient alternative due to their event-driven nature; however, their integration with continuous Self-Supervised Learning (SSL) representations is fundamentally challenged by distribution mismatch, where high-dynamic-range embeddings degrade the information coding capacity of threshold-based neurons. To resolve this, we propose Prompt-Tuned Spiking Neural Networks (PTS-SNN), a parameter-efficient neuromorphic adaptation framework that aligns frozen SSL backbones with spiking dynamics. Specifically, we introduce a Temporal Shift Spiking Encoder to capture local temporal dependencies via parameter-free channel shifts, establishing a stable feature basis. To bridge the domain gap, we devise a Context-Aware Membrane Potential Calibration strategy. This mechanism leverages a Spiking Sparse Linear Attention module to aggregate global semantic context into learnable soft prompts, which dynamically regulate the bias voltages of Parametric Leaky Integrate-and-Fire (PLIF) neurons. This regulation effectively centers the heterogeneous input distribution within the responsive firing range, mitigating functional silence or saturation. Extensive experiments on five multilingual datasets (e.g., IEMOCAP, CASIA, EMODB) demonstrate that PTS-SNN achieves 73.34\% accuracy on IEMOCAP, comparable to competitive Artificial Neural Networks (ANNs), while requiring only 1.19M trainable parameters and 0.35 mJ inference energy per sample.
语音情感识别(SER)在人机交互中得到了广泛应用,但传统模型的高计算成本阻碍了它们在资源受限的边缘设备上的部署。脉冲神经网络(SNNs)由于其事件驱动特性提供了一种节能的替代方案;然而,连续自我监督学习(SSL)表示与SNN集成时会受到分布不匹配的根本挑战,这种问题导致动态范围大的嵌入降低了基于阈值的神经元的信息编码能力。为解决这一问题,我们提出了提示调优脉冲神经网络(PTS-SNN),这是一种参数高效的类脑适应框架,能够将冻结的SSL骨干与脉冲动力学对齐。具体而言,我们引入了时移脉冲编码器来通过无参数通道偏移捕捉局部时间依赖关系,并建立稳定的特征基础。为了弥合领域差距,我们设计了一种感知上下文的膜电位校准策略。该机制利用脉冲稀疏线性注意力模块将全局语义上下文聚合到可学习的软提示中,动态调节参数化泄漏积分和放电(PLIF)神经元的偏置电压。这种调整有效将异构输入分布中心对齐在响应放电范围内,从而缓解功能沉默或饱和问题。 在五个跨语言数据集(如IEMOCAP、CASIA、EMODB)上的广泛实验表明,PTS-SNN在IEMOCAP上达到了73.34%的准确率,与竞争的人工神经网络(ANNs)相当,同时只需1.19M个可训练参数和每样本0.35 mJ推断能耗。
https://arxiv.org/abs/2602.08240
Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.
嵌入式链式思维(CoT)推理显著提升了视觉语言行动(VLA)模型的能力,但目前的方法依赖于固定的模板来指定推理的基本单元(例如场景中的对象、高层次计划和结构化的操作可能性)。这些模板可能会迫使策略处理与关键动作预测信号无关的信息,从而分散注意力。这导致了一个瓶颈:没有成功的策略就无法验证推理的质量;而缺乏高质量的推理又难以建立稳健的策略。 我们提出了R&B-EnCoRe方法,它使模型能够通过自我监督优化从互联网规模的知识中启动嵌入式推理训练。通过将推理视为重要加权变分推断中的潜在变量,模型可以生成并提炼出特定于主体执行策略的推理训练数据集,并且无需外部奖励、验证器或人工注释。我们使用1B、4B、7B和30B参数的各种VLA架构,在抓取(Franka Panda模拟,WidowX硬件)、足式导航(双足、轮式、自行车、四足)以及自主驾驶场景中验证了R&B-EnCoRe方法的有效性。我们的方法在操作成功率上实现了28%的提升,在导航评分方面获得了101%的改进,并且将碰撞率指标降低了21%,相较于那些不加区分地推理所有可用基本单元的模型,效果显著。 该方法使模型能够提炼出与成功控制相关的推理结果,绕过了手动注释工程并实现了互联网规模知识在物理执行中的应用。
https://arxiv.org/abs/2602.08167
Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in this http URL code is publicly available at this https URL.
近期的自监督视觉变换器(ViT),如DINOv3,为密集视觉任务提供了丰富的特征表示。本研究通过一个无训练基线FSSDINO来探究冻结后的DINOv3特征在少量样本语义分割(FSS)方面的内在能力,该方法利用了特定类别的原型和Gram矩阵改进。 我们在二元、多类别以及跨域(CDFSS)基准测试上的结果表明,即使采用简单的最后一层特征处理方式,FSSDINO也能与复杂的解码器或测试时间适应等专门化方法匹敌。尤为关键的是,我们进行了Oracle引导的层次分析,发现标准的最后一层特征和全局最优中间表示之间存在显著性能差距。 我们揭示了一个"最安全 vs. 最优"的困境:尽管Oracle证明了更高的性能是可实现的,并且可以媲美计算密集型适应方法的结果,但目前无监督和支持指导的选择指标始终无法超越最后一层基准的表现。这标志着在基础模型中存在着一个“语义选择差距”,即传统启发式方法未能可靠地识别出高质量特征的现象。 我们的研究确立了"最后一层"作为强大的基线,并为探索该架构的潜在语义潜力提供了严格的诊断。相关代码已公开发布,可在提供的链接地址找到。
https://arxiv.org/abs/2602.07550
Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.
边界表示法(B-rep)是计算机辅助设计(CAD)行业的标准。尽管深度学习在处理B-rep模型方面展现出巨大潜力,但现有的方法存在一个表现形式差距:连续的方法提供了分析精度但却显得视觉抽象;而离散方法则以牺牲几何精度为代价来提供直观清晰度。为了弥合这一差距,我们介绍了Brep2Shape,这是一种新颖的自监督预训练方法,旨在将抽象边界表示与直观形状表示对齐。该方法采用了一个感知几何的任务,在此任务中模型学会了从参数化贝塞尔控制点预测密集空间点,从而使网络更好地理解由抽象系数衍生出的物理流形。 为了增强这种对齐,我们提出了一种双Transformer骨干架构,具有并行处理表面和曲线标记的流,以便独立捕获它们独特的几何特性。此外,引入了拓扑注意力机制来建模曲面与曲线之间的相互依赖性,从而保持拓扑一致性。实验结果显示,Brep2Shape在各种下游任务中达到了最先进的准确性和更快的收敛速度,表现出显著的可扩展性。
https://arxiv.org/abs/2602.07429
Accurate segmentation of brain tissues from MRI scans is critical for neuroscience and clinical applications, but achieving consistent performance across the human lifespan remains challenging due to dynamic, age-related changes in brain appearance and morphology. While prior work has sought to mitigate these shifts by using self-supervised regularization with paired longitudinal data, such data are often unavailable in practice. To address this, we propose \emph{DuMeta++}, a dual meta-learning framework that operates without paired longitudinal data. Our approach integrates: (1) meta-feature learning to extract age-agnostic semantic representations of spatiotemporally evolving brain structures, and (2) meta-initialization learning to enable data-efficient adaptation of the segmentation model. Furthermore, we propose a memory-bank-based class-aware regularization strategy to enforce longitudinal consistency without explicit longitudinal supervision. We theoretically prove the convergence of our DuMeta++, ensuring stability. Experiments on diverse datasets (iSeg-2019, IBIS, OASIS, ADNI) under few-shot settings demonstrate that DuMeta++ outperforms existing methods in cross-age generalization. Code will be available at this https URL.
从MRI扫描中准确分割脑组织对于神经科学和临床应用至关重要,但由于大脑外观和形态随年龄变化而动态改变,在整个人生命周期内保持一致的性能仍然具有挑战性。尽管以前的工作试图通过使用带有纵向配对数据的自我监督正则化来缓解这些变化的影响,但在实践中此类数据往往不可用。为了解决这一问题,我们提出了\emph{DuMeta++},这是一种双元学习框架,能够在没有配对纵向数据的情况下运行。我们的方法结合了: 1. 元特征学习:从随时间和空间演变的脑结构中提取年龄无关的语义表示; 2. 元初始化学习:使分割模型能够以数据高效的方式进行适应。 此外,我们提出了一种基于记忆库的方法来执行类别感知正则化策略,以此在没有明确纵向监督的情况下强制执行纵向一致性。我们理论证明了DuMeta++的收敛性,从而确保其稳定性。在不同的数据集(iSeg-2019、IBIS、OASIS、ADNI)下进行的少量样本设置实验表明,DuMeta++在跨年龄泛化方面优于现有方法。代码将在该链接提供:[此URL]。 这是一篇研究论文摘要,概述了一种新的元学习框架用于改进脑组织MRI分割的方法,并强调了其如何通过理论证明确保性能的同时提高数据效率和模型适应性。
https://arxiv.org/abs/2602.07174
Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
完全无监督的分割流程通常会寻找最显著的对象,前提是该对象存在。因此,文献中报告的大多数方法提供了非确定性的分区结果,这些结果对初始化、种子顺序和阈值启发式非常敏感。我们提出了一种弱监督光谱分割框架PANC,它使用一组最小化的注释视觉标记来生成稳定、可控且可重复的对象掩码。从TokenCut方法出发,我们在令牌-令牌亲和图中添加了一些与锚节点耦合的先验知识。通过调整图的拓扑结构,我们偏向于让光谱特征空间倾向于那些与标注一致的分区结果。我们的方法保持了由密集自监督视觉特征强制执行的整体分组效果,并以注释标记为代价换取可重复性、用户控制和分割质量方面的显著提升。 在标准基准测试(如DUTS-TE、ECSSD、MS COCO)上,使用每个数据集5到30个标注的情况下,我们的无训练方法在弱监督与完全无监督的方法中达到了最佳性能。相反,在密集标签成本高或类内差异微妙的领域,该方法表现出色。我们在同质性、细粒度和纹理有限的领域报告了强大且可靠的结果,分别实现了CrackForest (CFD)上96.8%(比最先进方法提高14.43%),CUB-200-2011上78.0%(提高0.2%)以及HAM10000数据集上的78.8%(提高0.37%)的平均交并比(mIoU)。对于多对象基准测试,该框架展示了用户可控的语义分割。
https://arxiv.org/abs/2602.06912
Fetal echocardiography is essential for detecting congenital heart disease (CHD), facilitating pregnancy management, optimized delivery planning, and timely postnatal interventions. Among standard imaging planes, the four-chamber (4CH) view provides comprehensive information for CHD diagnosis, where clinicians carefully inspect the end-diastolic (ED) and end-systolic (ES) phases to evaluate cardiac structure and motion. Automated detection of these cardiac phases is thus a critical component toward fully automated CHD analysis. Yet, in the absence of fetal electrocardiography (ECG), manual identification of ED and ES frames remains a labor-intensive bottleneck. We present ORBIT (Orientation-Robust Beat Inference from Trajectories), a self-supervised framework that identifies cardiac phases without manual annotations under various fetal heart orientation. ORBIT employs registration as self-supervision task and learns a latent motion trajectory of cardiac deformation, whose turning points capture transitions between cardiac relaxation and contraction, enabling accurate and orientation-robust localization of ED and ES frames across diverse fetal positions. Trained exclusively on normal fetal echocardiography videos, ORBIT achieves consistent performance on both normal (MAE = 1.9 frames for ED and 1.6 for ES) and CHD cases (MAE = 2.4 frames for ED and 2.1 for ES), outperforming existing annotation-free approaches constrained by fixed orientation assumptions. These results highlight the potential of ORBIT to facilitate robust cardiac phase detection directly from 4CH fetal echocardiography.
胎儿超声心动图对于检测先天性心脏病(CHD)、促进孕期管理、优化分娩计划以及及时的产后干预至关重要。在标准成像切面中,四腔心视图(4CH)为CHD诊断提供了全面的信息,在此过程中医生会仔细检查舒张末期(ED)和收缩末期(ES)的心脏结构与运动情况。因此,自动检测这些心脏相位是实现完全自动化CHD分析的关键组成部分。然而,在没有胎儿心电图(ECG)的情况下,手动识别ED和ES帧仍然是一个劳动密集型的瓶颈问题。 我们提出了ORBIT(轨迹中的方向鲁棒性心跳推断),这是一种自监督框架,可以在不同胎儿心脏位置下无需人工标注情况下准确地检测出心脏相位。ORBIT使用配准作为自我监督任务,并学习心脏变形的潜在运动轨迹,在这些路径的转折点捕捉到了心肌从放松到收缩状态之间的转换过程,这使在各种不同的胎儿体位中都能精确定位ED和ES帧成为可能。 仅通过正常胎儿超声心动图视频训练,ORBIT在正常情况(ED均方误差为1.9帧,ES为1.6)及CHD病例(ED均方误差为2.4帧,ES为2.1)中都保持了稳定的表现,并超越了受固定方向假设约束的现有无需标注的方法。 这些结果突显出ORBIT具备直接从四腔心胎儿超声心动图图像中实现鲁棒性心脏相位检测的巨大潜力。
https://arxiv.org/abs/2602.06761
Weight initialization plays a crucial role in the optimization behavior and convergence efficiency of neural networks. Most existing initialization methods, such as Xavier and Kaiming initializations, rely on random sampling and do not exploit information from the optimization process itself. We propose a simple, yet effective, initialization strategy based on self-supervised pre-training using random noise as the target. Instead of directly training the network from random weights, we first pre-train it to fit random noise, which leads to a structured and non-random parameter configuration. We show that this noise-driven pre-training significantly improves convergence speed in subsequent tasks, without requiring additional data or changes to the network architecture. The proposed method is particularly effective for implicit neural representations (INRs) and Deep Image Prior (DIP)-style networks, which are known to exhibit a strong low-frequency bias during optimization. After noise-based pre-training, the network is able to capture high-frequency components much earlier in training, leading to faster and more stable convergence. Although random noise contains no semantic information, it serves as an effective self-supervised signal (considering its white spectrum nature) for shaping the initialization of neural networks. Overall, this work demonstrates that noise-based pre-training offers a lightweight and general alternative to traditional random initialization, enabling more efficient optimization of deep neural networks.
权重初始化在神经网络的优化行为和收敛效率中扮演着至关重要的角色。大多数现有的初始化方法,如Xavier和Kaiming初始化法,依赖于随机采样,并未利用优化过程本身的信息。我们提出了一种简单而有效的方法:基于自监督预训练使用随机噪声作为目标的初始化策略。与其直接从随机权重开始训练网络,我们首先让网络进行预训练以适应随机噪声,从而导致结构化而非随机化的参数配置。研究表明,这种由噪声驱动的预训练显著提高了后续任务中的收敛速度,无需额外数据或更改网络架构。 所提出的方法特别适用于隐式神经表示(INRs)和Deep Image Prior (DIP)风格的网络,在优化过程中这些网络通常表现出强烈的低频偏置现象。经过基于噪声的预训练后,网络能够在更早阶段捕获高频成分,从而实现更快且更为稳定的收敛过程。 尽管随机噪声中不包含语义信息,但从其白谱特性来看,它可作为有效的自监督信号用于调整神经网络的初始化。总体而言,这项工作展示了噪声驱动的预训练为传统的随机初始化提供了一种轻量级和通用的替代方案,并能够更高效地优化深度神经网络。
https://arxiv.org/abs/2602.06585
Chemical exchange saturation transfer (CEST) MRI is a non-invasive imaging modality for detecting metabolites. It offers higher resolution and sensitivity compared to conventional magnetic resonance spectroscopy (MRS). However, quantification of CEST data is challenging because the measured signal results from a complex interplay of many physiological variables. Here, we introduce a transformer-based neural network to fit parameters such as metabolite concentrations, exchange and relaxation rates of a physical model derived from Bloch-McConnell equations to in-vitro CEST spectra. We show that our self-supervised trained neural network clearly outperforms the solution of classical gradient-based solver.
化学交换饱和转移(CEST)磁共振成像是一种非侵入性的成像方式,用于检测代谢物。与传统的磁共振光谱学(MRS)相比,CEST MRI 提供了更高的分辨率和灵敏度。然而,由于测量信号受到许多生理变量的复杂相互作用影响,因此对 CEST 数据进行量化具有挑战性。在这里,我们介绍了一种基于变压器的神经网络,该网络能够拟合物理模型中的参数(例如代谢物浓度、交换率及弛豫率),这些参数是从 Bloch-McConnell 方程派生出来的,并将其应用于体外 CEST 光谱中。我们的研究结果显示,通过自我监督训练的神经网络在解决此类问题时明显优于传统的基于梯度的经典求解器。
https://arxiv.org/abs/2602.06574