We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.
我们提出了一种基于激活的数据归因方法,该方法能够追踪经过训练的语言模型中行为变化的责任训练数据点。通过计算测试提示和偏好对的激活差异向量,并根据余弦相似度进行排序,我们可以识别导致特定行为的数据点,并通过使用修改后的数据重新训练来进行因果验证这些归因结果。通过对行为-数据点相似性矩阵进行聚类分析,还可以实现新兴行为的无监督发现。当我们将其应用于OLMo 2的生产DPO(区分偏好优化)训练时,我们发现了由干扰项触发的合规行为:一种有害的行为,在这种情况下,当危险请求后面附加了良性格式化指令后,模型会对此进行响应。 通过过滤顶级排名的数据点可以减少这种行为63%,而改变这些数据点的标签则可以使该行为减少78%。我们的方法优于基于梯度的归因和大语言模型(LLM)判断基准,并且比这两种方法便宜超过10倍。这个在现实环境中出现的“模型生物”——源于污染偏好的数据而非故意注入——为安全技术提供了一个实际的评估标准。
https://arxiv.org/abs/2602.11079
Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.
超分辨率(SR)技术在处理现实世界的低分辨率(LR)图像时,往往会遇到复杂的、不规则的退化现象,这些退化是由自然场景获取过程中的固有复杂性所引起的。与根据定义明确的情境生成的合成LR图像所产生的SR伪影相比,在真实生活情境中产生的这些扭曲是高度不可预测且变化多端的。因此,评估从现实世界低分辨率图像获得的超分辨图像(SR-IQA)的质量仍然是一项具有挑战性和未充分探索的问题。 在这项工作中,我们提出了一种专门针对这种高难度、非结构化现实场景的无参考SR-IQA方法。该方法能够为真实世界的SR应用提供领域自适应型IQA,在数据稀缺的情况下尤其有效。我们认为,超分辨率图像中的退化很大程度上取决于所使用的SR算法,而不仅仅是由图像内容决定的。 为此,我们引入了一种自我监督学习(SSL)策略:在预处理阶段,先对多个针对SR模型的不同表示进行预训练。我们的对比学习框架通过将同一种SR方法生成的图片配成正样本对,并将不同方法产生的图片配成负样本对来构建特征集,完全忽略图像内容的影响。 提出的S3 RIQA方法进一步包含了针对性的预处理步骤以提取互补的质量信息,并附加了一个辅助任务以便更好地应对与不同SR缩放因子相关的各种退化模式。为此,我们创建了新的数据集SRMORSS,用于支持无监督预训练;该数据集中包含广泛应用于大量真实LR图像的不同SR算法的应用实例,从而填补现有数据集的空白。 在实际SR-IQA基准测试中的实验表明,S3 RIQA方法在大多数当前最先进的相关度量标准中表现一致优异。
https://arxiv.org/abs/2602.10744
The task of graph-level anomaly detection (GLAD) is to identify anomalous graphs that deviate significantly from the majority of graphs in a dataset. While deep GLAD methods have shown promising performance, their black-box nature limits their reliability and deployment in real-world applications. Although some recent methods have made attempts to provide explanations for anomaly detection results, they either provide explanations without referencing normal graphs, or rely on abstract latent vectors as prototypes rather than concrete graphs from the dataset. To address these limitations, we propose Prototype-based Graph-Level Anomaly Detection (ProtoGLAD), an interpretable unsupervised framework that provides explanation for each detected anomaly by explicitly contrasting with its nearest normal prototype graph. It employs a point-set kernel to iteratively discover multiple normal prototype graphs and their associated clusters from the dataset, then identifying graphs distant from all discovered normal clusters as anomalies. Extensive experiments on multiple real-world datasets demonstrate that ProtoGLAD achieves competitive anomaly detection performance compared to state-of-the-art GLAD methods while providing better human-interpretable prototype-based explanations.
图级异常检测(Graph-Level Anomaly Detection,简称GLAD)的任务是识别与数据集中大多数图形显著不同的异常图形。虽然深度GLAD方法展示出了令人振奋的性能,但它们黑盒化的特性限制了其在实际应用中的可靠性和部署能力。尽管一些近期的方法尝试为异常检测结果提供解释,但是这些方法要么不参考正常图来提供解释,要么依赖抽象的潜在向量作为原型而不是来自数据集的具体图形。为了应对这些局限性,我们提出了基于原型的图级异常检测(Prototype-based Graph-Level Anomaly Detection,简称ProtoGLAD),这是一种可解释的无监督框架,通过明确地与最近的正常原型图进行对比来为每个检测到的异常提供解释。它采用点集核函数以迭代方式从数据集中发现多个正常的原型图形及其相关聚类,然后识别远离所有已知正常集群的图形作为异常。在多个真实世界的数据集上进行了广泛的实验表明,ProtoGLAD与最先进的GLAD方法相比,在异常检测性能方面达到了竞争水平,并且提供了更好的基于原型的人类可解释性说明。
https://arxiv.org/abs/2602.10708
Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.
医学基础模型在受控基准测试中显示出巨大潜力,但仍因依赖特定任务的微调而难以广泛部署。在此,我们介绍了DermFM-Zero,这是一种通过遮蔽潜在建模和对比学习训练而成的皮肤病视觉-语言基础模型,该模型基于超过400万个跨模态数据点进行训练。我们在包括零样本诊断和多模式检索在内的20个基准测试中评估了DermFM-Zero,在不进行特定任务适应的情况下取得了最先进的性能。此外,我们还在涉及1,100多名临床医生的三项跨国读者研究中对其零样本能力进行了进一步评估。 在初级保健环境中,人工智能辅助使得全科医生能够在98种皮肤病情况下几乎将其鉴别诊断准确率提高了一倍。在专科环境下,该模型在多模式皮肤癌评估方面显著超越了持牌皮肤科医生的表现。在协作工作流程中,人工智能辅助使非专业人士能够超过未受助的专业人士,并且提高了管理的适当性。 最后,我们展示了DermFM-Zero的潜在表示具有可解释性:无监督稀疏自编码器可以解开临床相关的概念,这些概念超过了预定义词汇的方法,并允许有针对性地抑制由人工制品引起的偏差,从而增强了鲁棒性而无需重新训练。这些发现表明,基础模型能够提供有效的、安全的和透明的零样本临床决策支持。
https://arxiv.org/abs/2602.10624
Runtime quantification of vehicle operational intensity is essential for predictive maintenance and condition monitoring in commercial and heavy-duty fleets. Traditional metrics like mileage fail to capture mechanical burden, while unsupervised deep learning models detect statistical anomalies, typically transient surface shocks, but often conflate statistical stability with mechanical rest. We identify this as a critical blind spot: high-load steady states, such as hill climbing with heavy payloads, appear statistically normal yet impose significant drivetrain fatigue. To resolve this, we propose a Dual-Stream Architecture that fuses unsupervised learning for surface anomaly detection with macroscopic physics proxies for cumulative load estimation. This approach leverages low-frequency sensor data to generate a multi-dimensional health vector, distinguishing between dynamic hazards and sustained mechanical effort. Validated on a RISC-V embedded platform, the architecture demonstrates low computational overhead, enabling comprehensive, edge-based health monitoring on resource-constrained ECUs without the latency or bandwidth costs of cloud-based monitoring.
车辆运行强度的实时量化对于商用车队和重型车队的预测性维护及状态监测至关重要。传统的里程等指标无法捕捉机械负载,而无监督深度学习模型虽能检测统计异常(如瞬时表面冲击),却往往将统计稳定性误认为是机械休眠状态。我们发现这种错误认知是一个关键盲点:高负载稳定工况(例如重载爬坡)看似统计上正常,但实际上会对传动系统造成显著疲劳。 为解决这一问题,我们提出了一种双流架构,它融合了无监督学习以检测表面异常与宏观物理代理来估算累积负荷。这种方法利用低频传感器数据生成多维健康向量,能够区分动态危险和持续的机械努力。在RISC-V嵌入式平台上验证后,该架构展示了较低的计算开销,能够在资源受限的ECU上进行全面、基于边缘的健康监测,而无需面对云端监控带来的延迟或带宽成本。
https://arxiv.org/abs/2602.10432
Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
最近的音乐生成方法依赖于分离表示,通常被标记为结构与音色或局部与全局特征,以实现可控合成。然而,这些嵌入的基本特性仍然未被充分探索。在这项工作中,我们使用一种基于探针任务的方法框架来评估一组用于可控生成的音乐音频模型中的此类分离表示,并且这种方法超出了标准下游任务的范围。所选模型反映了多样化的无监督分离策略,包括归纳偏差、数据增强、对抗目标以及分阶段训练流程。此外,我们还单独分析了特定策略的效果。我们的分析涵盖了四个关键维度:信息性(informativeness)、等变性(equivariance)、不变性(invariance)和分离度(disentanglement),这些特性在不同的数据集、任务及受控转换中被评估。研究发现表明,嵌入的预期语义与其实际语义之间存在不一致之处,这暗示现有的策略未能产生真正意义上的分离表示,并且呼吁重新审视音乐生成中的可控性方法。
https://arxiv.org/abs/2602.10058
Reinforcement learning necessitates meticulous reward shaping by specialists to elicit target behaviors, while imitation learning relies on costly task-specific data. In contrast, unsupervised skill discovery can potentially reduce these burdens by learning a diverse repertoire of useful skills driven by intrinsic motivation. However, existing methods exhibit two key limitations: they typically rely on a single policy to master a versatile repertoire of behaviors without modeling the shared structure or distinctions among them, which results in low learning efficiency; moreover, they are susceptible to reward hacking, where the reward signal increases and converges rapidly while the learned skills display insufficient actual diversity. In this work, we introduce an Orthogonal Mixture-of-Experts (OMoE) architecture that prevents diverse behaviors from collapsing into overlapping representations, enabling a single policy to master a wide spectrum of locomotion skills. In addition, we design a multi-discriminator framework in which different discriminators operate on distinct observation spaces, effectively mitigating reward hacking. We evaluated our method on the 12-DOF Unitree A1 quadruped robot, demonstrating a diverse set of locomotion skills. Our experiments demonstrate that the proposed framework boosts training efficiency and yields an 18.3\% expansion in state-space coverage compared to the baseline.
强化学习需要专家精心设计奖励机制以诱导目标行为,而模仿学习则依赖于特定任务的昂贵数据。相比之下,无监督技能发现通过内在动机驱动可以潜在地减轻这些负担,学会一组多样且有用的技能。然而,现有的方法存在两个关键限制:它们通常依靠单一策略来掌握一系列多样的行为,却不建模其中共享结构或差异性,从而导致学习效率低下;此外,它们还容易受到奖励欺骗的影响,在这种情况下,奖励信号迅速增加并收敛,而所学的技能表现出不足的实际多样性。 在这项工作中,我们引入了一种正交专家混合(Orthogonal Mixture-of-Experts, OMoE)架构,该架构可防止多样行为坍缩为重叠表示,从而使单一策略能够掌握广泛的运动技能。此外,我们设计了一个多判别器框架,在其中不同的判别器在不同的观察空间上操作,有效地减轻了奖励欺骗的问题。我们在12自由度的Unitree A1四足机器人上对我们的方法进行了评估,展示了多样化的运动技能集。实验结果表明,所提出的架构提升了训练效率,并且与基准相比,状态空间覆盖范围增加了18.3%。
https://arxiv.org/abs/2602.09767
Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
测试时间适应(TTA)对于大型语言模型(LLMs),在部署时利用可用信号更新模型参数。本文重点研究了一个常见但较少探索的场景:无监督、样本特定的TTA,即模型使用单个提示独立进行调整,并且不依赖于黄金答案或外部监督信息。尽管这种方法很有吸引力,但简单的无监督TTA(采用固定的、手工设计的学习率)可能会不稳定:更新过程可能过度适应于特定提示的统计特性,偏离期望的答案分布,并最终降低生成质量。这种失败模式并不令人惊讶,因为在这种情况下,TTA必须在少数梯度步骤内针对单个提示进行调整,而标准训练则是在大规模数据集和长期优化时间范围内平均更新。 因此,我们提出了逐层动态测试时间适应框架,该框架明确地根据提示表示、LLM结构以及适应步骤调节TTA强度。在我们的设置中,TTA仅更新LoRA参数,并且一个轻量级的超网络预测每层、每个步骤的学习率乘数,从而实现精细控制。实验表明,在不同数据集和LLMs上使用我们提出的方法可以显著增强TTA性能,通过学习有效的适应步长和变换器层投影模式来提高稳定性并提升生成效果。
https://arxiv.org/abs/2602.09719
Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
谱聚类作为一种强大的无监督数据分析技术而闻名。大多数谱聚类方法都是基于单一模态数据驱动的,从而忽略了多模态表示中的丰富信息。受近期视觉-语言预训练成功的启发,本文从单模态向多模态扩展了谱聚类的技术范畴。特别是,我们提出了神经切片核谱聚类(Neural Tangent Kernel Spectral Clustering),该方法利用了预先训练的视觉-语言模型中跨模态对齐的信息。 通过将神经切片核与积极名词锚定在一起,即那些在语义上接近感兴趣图像的名词,我们将图像之间的亲和力定义为其视觉相似性和语义重叠共同作用的结果。我们展示了这种形式化方法能够增强簇内的联系,并抑制不同簇间的虚假连接,从而促进块对角结构的形成。 此外,我们还提出了一种正则化的亲和力扩散机制,该机制能够自适应地整合由不同提示生成的不同亲和矩阵。在16个基准测试集上的广泛实验——包括经典、大规模、细粒度以及领域偏移数据集上——表明我们的方法始终大幅度优于现有最先进技术。
https://arxiv.org/abs/2602.09586
Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.
无监督工业异常检测(UAD)在现代制造检查中至关重要,因为缺陷样本稀缺且需要可靠的检测方法。本文提出了一种名为HLGFA的框架,即高低分辨率引导特征对齐框架。该框架通过建模正常样本高分辨率和低分辨率表示之间的跨分辨率特征一致性来学习正常性,而不是依赖于像素级重建。 双分辨率输入通过一个共享且冻结的主干网络处理以提取多层特征,并将高分辨率表示分解为结构先验和细节先验,从而指导条件调制和门控残差校正下的低分辨率特征优化。在推理阶段,异常被自然地识别为跨分辨率对齐失效的区域。 此外,本文还引入了一种噪声感知的数据增强策略来抑制工业环境中常见的由干扰引起的响应现象。 通过标准基准测试进行的大量实验表明,HLGFA的有效性显著,MVTec AD数据集上的像素级AUROC和图像级AUROC分别达到了97.9%和97.5%,优于代表性的基于重建的方法和基于特征的方法。
https://arxiv.org/abs/2602.09524
Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
由于词性标注数据的稀缺,现有研究中针对低资源语言通常采用无监督方法进行词性标注。在这些方法中,基于平行语料库将高资源源语言的词性标签转移到低资源目标语言的方法被证明特别适合处理低资源语言环境中的问题。然而,这种方法严重依赖于平行语料库,而这种平行语料库对于许多低资源语言往往不可用。为了克服这一限制,我们提出了一种完全无监督的跨语言词性标注框架,该框架仅依靠单语文本,并利用无监督神经机器翻译(UNMT)系统。此 UNMT 系统首先将高资源语言中的句子翻译成低资源语言,从而构建伪平行句对。然后,根据单词对齐信息,我们按照标准投影流程训练目标语言的词性标注器。此外,我们提出了一种多源投影技术来校准目标侧的投影词性标签,以增强词性标注器的有效训练。 我们在28种语言对上评估了该框架的表现,包括四种源语言(英语、德语、西班牙语和法语)以及七种目标语言(南非荷兰语、巴斯克语、芬兰语、印尼语、立陶宛语、葡萄牙语和土耳其语)。实验结果表明,我们的方法能够实现与使用平行句对的基线跨语言词性标注器相媲美的性能,并且对于某些目标语言甚至超过了它的表现。此外,我们提出的多源投影技术进一步提升了整体性能,在前人工作基础上平均提高了1.3%的表现。
https://arxiv.org/abs/2602.09366
This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.
这项工作提供了迄今为止最大的南方居留型杀人鲸(SRKW)声学数据集,其中包括其环境中的其他海洋哺乳动物。我们系统地搜索了SRKW栖息地中所有可用的公共档案水听器数据(超过30年的音频数据)。该搜索采用了一种弱监督、正例-未标记主动学习策略来识别所有的海洋哺乳动物实例。基于转换器的检测器在DEEPAL、DCLDE-2026和两个新引入的专业注释数据集上,在准确性、能源效率和速度方面优于现有最佳检测器。 该检测模型在95%敏感度下具有0至28.8%的具体性。我们的多类物种分类器在DCLDE-2026数据集中获得了42.1%的前一准确率(训练类别为11,测试类别为4),而生态类型分类器则获得了43.0%的前一准确率(训练类别为4,测试类别为5)。 我们提供了919小时的SRKW数据、230小时的大白海豚数据、1374小时来自未标记生态类型的虎鲸数据、1501小时的座头鲸数据、88小时的海狮数据以及246小时的太平洋侧大腹海豚数据和超过784小时的其他未指定海洋哺乳动物的数据。该SRKW数据集比DCLDE-2026、加拿大海洋网络和OrcaSound的总和还要大。 经过整理后的物种标签在CC-BY 4.0许可下提供,相应的音频数据根据原所有者的许可证提供。 这一全面的数据集适用于无监督机器翻译、栖息地使用调查以及对这种极度濒危生态类型的保护工作。
https://arxiv.org/abs/2602.09295
As a fundamental data mining task, unsupervised time series anomaly detection (TSAD) aims to build a model for identifying abnormal timestamps without assuming the availability of annotations. A key challenge in unsupervised TSAD is that many anomalies are too subtle to exhibit detectable deviation in any single view (e.g., time domain), and instead manifest as inconsistencies across multiple views like time, frequency, and a mixture of resolutions. However, most cross-view methods rely on feature or score fusion and do not enforce analysis-synthesis consistency, meaning the frequency branch is not required to reconstruct the time signal through an inverse transform, and vice versa. In this paper, we present Learnable Fusion of Tri-view Tokens (LEFT), a unified unsupervised TSAD framework that models anomalies as inconsistencies across complementary representations. LEFT learns feature tokens from three views of the same input time series: frequency-domain tokens that embed periodicity information, time-domain tokens that capture local dynamics, and multi-scale tokens that learns abnormal patterns at varying time series granularities. By learning a set of adaptive Nyquist-constrained spectral filters, the original time series is rescaled into multiple resolutions and then encoded, allowing these multi-scale tokens to complement the extracted frequency- and time-domain information. When generating the fused representation, we introduce a novel objective that reconstructs fine-grained targets from coarser multi-scale structure, and put forward an innovative time-frequency cycle consistency constraint to explicitly regularize cross-view agreement. Experiments on real-world benchmarks show that LEFT yields the best detection accuracy against SOTA baselines, while achieving a 5x reduction on FLOPs and 8x speed-up for training.
无监督时间序列异常检测(TSAD)作为一种基本的数据挖掘任务,旨在构建一个模型来识别没有假设注释可用性的异常时间戳。在无监督的TSAD中,一个关键挑战是许多异常过于微妙,无法在一个单一视图(例如时域)上显示出可检测的变化,并且它们通常表现为跨多个视图(如时间、频率和混合分辨率)的一致性问题。然而,大多数跨视图方法依赖于特征或分数融合,并不强制执行分析-合成一致性,这意味着频域分支不需要通过逆变换来重构时域信号。 在这篇论文中,我们提出了一个统一的无监督TSAD框架——三视令牌可学习融合(Learnable Fusion of Tri-view Tokens, LEFT),该框架将异常建模为跨互补表示的一致性问题。LEFT从同一输入时间序列的三个视角学习特征令牌:频率域令牌嵌入周期信息、时域令牌捕捉局部动态,并且多尺度令牌在不同时间序列粒度上学习异常模式。 通过学习一组自适应Nyquist约束谱滤波器,原始的时间序列被重缩放为多个分辨率并进行编码,使得这些多尺度令牌可以补充提取的频域和时域信息。当生成融合表示时,我们引入了一个新颖的目标,即从较粗的多尺度结构中重构细粒度目标,并提出了一种创新的时间-频率循环一致性约束,以明确地规范跨视图的一致性。 在现实世界的基准测试实验中显示,LEFT相对于最先进的基线方法获得了最佳的检测精度,同时实现了5倍的FLOPs减少和8倍的训练加速。
https://arxiv.org/abs/2602.08638
Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)能力(即长上下文理解)的一种重要范式。然而,这种方法通常依赖于金标准答案或由强大教师模型和人类专家提供的显式评估准则,这既耗时又昂贵。在这项工作中,我们研究了无监督方法以增强LLM的长上下文理解能力,并且无需重型的人类标注或教师模型的监管。 具体而言,我们在一篇长文档中用特殊的占位符替换掉几个段落,然后通过强化学习训练语言模型从一组候选选项中正确识别并按顺序填补缺失的段落来重构该文档。这种训练范式使模型能够捕捉到全局叙事连贯性,从而显著提升其处理长上下文的能力。 我们在两个广泛使用的基准测试集RULER和LongBench v2上验证了我们方法的有效性。在RULER数据集中获得了显着收益的同时,这种方法也在没有任何手动整理的长上下文问答数据的情况下,在LongBench v2上实现了合理的性能提升。此外,我们进行了大量的消融研究来分析奖励设计、数据策划策略、训练方案以及数据规模对模型性能的影响。 我们的代码、数据和模型已公开发布。
https://arxiv.org/abs/2602.08237
Reliable foreign-object anomaly detection and pixel-level localization in conveyor-belt coal scenes are essential for safe and intelligent mining operations. This task is particularly challenging due to the highly unstructured environment: coal and gangue are randomly piled, backgrounds are complex and variable, and foreign objects often exhibit low contrast, deformation, occlusion, resulting in coupling with their surroundings. These characteristics weaken the stability and regularity assumptions that many anomaly detection methods rely on in structured industrial settings, leading to notable performance degradation. To support evaluation and comparison in this setting, we construct \textbf{CoalAD}, a benchmark for unsupervised foreign-object anomaly detection with pixel-level localization in coal-stream scenes. We further propose a complementary-cue collaborative perception framework that extracts and fuses complementary anomaly evidence from three perspectives: object-level semantic composition modeling, semantic-attribution-based global deviation analysis, and fine-grained texture matching. The fused outputs provide robust image-level anomaly scoring and accurate pixel-level localization. Experiments on CoalAD demonstrate that our method outperforms widely used baselines across the evaluated image-level and pixel-level metrics, and ablation studies validate the contribution of each component. The code is available at this https URL.
在输送带煤炭场景中,可靠且精准的异物检测和像素级别的定位对于确保安全及智能采矿作业至关重要。然而,由于环境的高度无序性——如煤块与矸石随机堆放、背景复杂多变以及异物体常常呈现出低对比度、变形或遮挡等特点,导致其难以从周围环境中区分出来。这些特性削弱了许多异常检测方法在结构化工业设置中依赖的稳定性和规则性的假设,并导致了性能显著下降。 为了支持此类场景下的评估和比较,我们构建了一个名为**CoalAD**的新基准测试平台,该平台用于煤炭流场景中的无监督异物异常检测以及像素级别的定位。此外,我们还提出了一种互补线索协作感知框架,该框架从三个方面提取并融合了不同的异常证据:对象级别语义组成建模、基于语义属性的全局偏差分析和细粒度纹理匹配。这种整合后的输出提供了稳健的图像级异常评分及精确的像素级定位。 实验结果表明,在CoalAD基准测试上,我们的方法在评估的所有图像级和像素级指标中都优于广泛使用的基础模型,并且消融研究验证了每个组成部分的重要性。相关代码可在提供的链接处获取。
https://arxiv.org/abs/2602.07694
Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in this http URL code is publicly available at this https URL.
近期的自监督视觉变换器(ViT),如DINOv3,为密集视觉任务提供了丰富的特征表示。本研究通过一个无训练基线FSSDINO来探究冻结后的DINOv3特征在少量样本语义分割(FSS)方面的内在能力,该方法利用了特定类别的原型和Gram矩阵改进。 我们在二元、多类别以及跨域(CDFSS)基准测试上的结果表明,即使采用简单的最后一层特征处理方式,FSSDINO也能与复杂的解码器或测试时间适应等专门化方法匹敌。尤为关键的是,我们进行了Oracle引导的层次分析,发现标准的最后一层特征和全局最优中间表示之间存在显著性能差距。 我们揭示了一个"最安全 vs. 最优"的困境:尽管Oracle证明了更高的性能是可实现的,并且可以媲美计算密集型适应方法的结果,但目前无监督和支持指导的选择指标始终无法超越最后一层基准的表现。这标志着在基础模型中存在着一个“语义选择差距”,即传统启发式方法未能可靠地识别出高质量特征的现象。 我们的研究确立了"最后一层"作为强大的基线,并为探索该架构的潜在语义潜力提供了严格的诊断。相关代码已公开发布,可在提供的链接地址找到。
https://arxiv.org/abs/2602.07550
Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
无监督对象中心学习(OCL)将视觉场景分解为独立的实体。槽注意力是一种流行的方法,它将单个对象表示为称为槽的潜在向量。当前方法仅从预训练的视觉变换器(ViT)的最后一层获取这些槽表示,忽略了其他层次中编码的有价值的、语义丰富的信息。为了更好地利用这种潜在的语义信息,我们引入了MUFASA,这是一种轻量级即插即用框架,用于基于槽注意力的方法进行无监督对象分割。我们的模型在ViT编码器的多个特征层上计算槽注意力,充分利用它们的语义丰富性。我们提出了一种融合策略,将从多层获取的槽聚合为统一的对象中心表示。将MUFASA整合到现有的OCL方法中,在多个数据集上的分割结果得到改善,并且在仅增加轻微推理开销的同时设置了新的最新水平,同时改进了训练收敛速度。
https://arxiv.org/abs/2602.07544
Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps by 22-42%. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.
大型语言模型在控制熟练度的句子简化方面表现出有限的能力,特别是在跨越较大可读性级别进行简化时。我们提出了一种框架,通过动态路径规划、语义感知示例选择和带有对话历史记录的链式思维生成方法,将复杂的简化过程分解为可以管理的步骤。在两个基准测试中使用五种语言的评估显示,我们的方法提高了简化的有效性,并减少了22-42%的计算步骤。人工评估确认了简化有效性和意义保持之间的基本权衡。值得注意的是,即使是人类标注者也难以就语义保存判断达成一致,这凸显了该任务内在的复杂性。我们的工作表明,虽然逐步简化可以提高控制能力,但在广泛简化过程中保持语义保真度仍然是一个开放性的挑战。
https://arxiv.org/abs/2602.07499
Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
完全无监督的分割流程通常会寻找最显著的对象,前提是该对象存在。因此,文献中报告的大多数方法提供了非确定性的分区结果,这些结果对初始化、种子顺序和阈值启发式非常敏感。我们提出了一种弱监督光谱分割框架PANC,它使用一组最小化的注释视觉标记来生成稳定、可控且可重复的对象掩码。从TokenCut方法出发,我们在令牌-令牌亲和图中添加了一些与锚节点耦合的先验知识。通过调整图的拓扑结构,我们偏向于让光谱特征空间倾向于那些与标注一致的分区结果。我们的方法保持了由密集自监督视觉特征强制执行的整体分组效果,并以注释标记为代价换取可重复性、用户控制和分割质量方面的显著提升。 在标准基准测试(如DUTS-TE、ECSSD、MS COCO)上,使用每个数据集5到30个标注的情况下,我们的无训练方法在弱监督与完全无监督的方法中达到了最佳性能。相反,在密集标签成本高或类内差异微妙的领域,该方法表现出色。我们在同质性、细粒度和纹理有限的领域报告了强大且可靠的结果,分别实现了CrackForest (CFD)上96.8%(比最先进方法提高14.43%),CUB-200-2011上78.0%(提高0.2%)以及HAM10000数据集上的78.8%(提高0.37%)的平均交并比(mIoU)。对于多对象基准测试,该框架展示了用户可控的语义分割。
https://arxiv.org/abs/2602.06912
Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
从单张图像推断三维结构,尤其是在被遮挡的区域,仍然是以视觉为中心的自动驾驶领域中的一个基本但未解决的挑战。现有的无监督方法通常训练神经辐射场,并在评估时将网络输出视为占据概率,这忽视了训练和评估过程之间的不一致性。此外,广泛使用的二维真实标签无法揭示由于几何约束不足而引起的被遮挡区域内在的不确定性。 为了解决这些问题,本文提出了一个重新设计的基准测试,用于无监督单目三维占据预测。我们首先解释了体积渲染过程中涉及的变量,并确定了一个最符合物理一致性的占据概率表示形式。在此基础上,通过将新识别出的表示与体素级别的三维占据真实标签对齐,改进现有的评估协议,从而使无监督方法能够以一种与有监督方法一致的方式进行评估。 此外,为了在被遮挡区域施加明确的约束条件,我们引入了一种感知遮挡的极化机制,该机制利用多视角视觉线索来增强这些区域内占据空间和自由空间之间的区分度。大量的实验表明,我们的方法不仅显著优于现有的无监督方法,而且还能达到有监督方法的性能水平。 在论文发布后,我们将提供源代码和评估协议以供使用。
https://arxiv.org/abs/2602.06488