Geochemical anomaly detection plays a critical role in mineral exploration as deviations from regional geochemical baselines may indicate mineralization. Existing studies suffer from two key limitations: (1) single region scenarios which limit model generalizability; (2) proprietary datasets, which makes result reproduction unattainable. In this work, we introduce \textbf{GeoChemAD}, an open-source benchmark dataset compiled from government-led geological surveys, covering multiple regions, sampling sources, and target elements. The dataset comprises eight subsets representing diverse spatial scales and sampling conditions. To establish strong baselines, we reproduce and benchmark a range of unsupervised anomaly detection methods, including statistical models, generative and transformer-based approaches. Furthermore, we propose \textbf{GeoChemFormer}, a transformer-based framework that leverages self-supervised pretraining to learn target-element-aware geochemical representations for spatial samples. Extensive experiments demonstrate that GeoChemFormer consistently achieves superior and robust performance across all eight subsets, outperforming existing unsupervised methods in both anomaly detection accuracy and generalization capability. The proposed dataset and framework provide a foundation for reproducible research and future development in this direction.
https://arxiv.org/abs/2603.13068
Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual-based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi-head attention query evolution as a short horizon predictable process. A gradient-updated reconstruction pathway is coupled with a history-only predictor that forecasts future query vectors from past context. This is trained via a masked predictor-target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail-aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude-level detection. On proprietary in-vehicle telemetry with interval annotations and on the TSB-AD multi-variate suite (17 datasets, 180 series) with threshold-free and range-aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL this https URL.
https://arxiv.org/abs/2603.12916
Most real-world IoT data analysis tasks, such as clustering and anomaly event detection, are unsupervised and highly susceptible to the presence of outliers. In addition to sporadic scattered outliers caused by factors such as faulty sensor readings, IoT systems often exhibit clustered outliers. These occur when multiple devices or nodes produce similar anomalous measurements, for instance, owing to localized interference, emerging security threats, or regional false alarms, forming micro-clusters. These clustered outliers can be easily mistaken for normal behavior because of their relatively high local density, thereby obscuring the detection of both scattered and contextual anomalies. To address this, we propose a novel outlier detection paradigm that leverages the natural neighboring relationships using graph structures. This facilitates multi-perspective anomaly evaluation by incorporating reference sets at both local and global scales derived from the graph. Our approach enables the effective recognition of scattered outliers without interference from clustered anomalies, whereas the graph structure simultaneously helps reflect and isolate clustered outlier groups. Extensive experiments, including comparative performance analysis, ablation studies, validation on downstream clustering tasks, and evaluation of hyperparameter sensitivity, demonstrate the efficacy of the proposed method. The source code is available at this https URL.
https://arxiv.org/abs/2603.12847
This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.
https://arxiv.org/abs/2603.12711
Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^*$-HC, which can automatically determine an optimal number of clusters $k^*$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^*$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^*$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.
https://arxiv.org/abs/2603.12684
Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. {\small Code: \href{this https URL}{this http URL}}
https://arxiv.org/abs/2603.12468
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
理解自由活动动物的行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而,这两项任务仍然高度依赖于人工标注或不稳定的无监督流程,从而限制了其可扩展性和再现性。我们提出了BehaviorVLM,这是一种统一的视觉-语言框架,用于姿态估计和行为理解,无需特定任务微调,并通过详细的、明确且可验证的推理步骤引导预训练的视觉-语言模型(VLMs),大大减少了对人类标注的需求。 对于姿态估计,我们利用量子点接地的行为数据并提出了一种多阶段管道,该管道集成了时间、空间和跨视图推理。这种设计极大地减少了人工注释的工作量,并通过几何检查(如重新投影误差)来揭示低置信度标签,从而生成可后续过滤、修正或用于微调下游姿态模型的标签。 对于行为理解,我们提出了一种集成深度嵌入聚类以发现过度分割的行为、基于VLM的逐片段视频描述以及LLM推理合并和语义标记行为段的管道。这种行为分析流程可以直接从视觉信息中操作,并不需要关键点来划分行为。 这些组件共同使得多动物行为的大规模、可解释且标签轻量级的分析成为可能。
https://arxiv.org/abs/2603.12176
Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.
无监督概念提取旨在从单张图像中抽取概念,然而现有方法无法有效地提取出可组合的内在概念。为解决这一问题,本文提出了一项新的任务叫做组成性和可解释性的固有概念抽取(CI-ICE)。该任务旨在利用基于扩散机制的文本到图像模型从单张图片中抽取可组合的对象级和属性级的概念,以便通过这些概念的组合重构原始概念。 为了实现这个目标,我们提出了名为HyperExpress的方法,它通过两个核心方面来解决CI-ICE任务。具体而言,首先,我们提出了一种利用双曲空间固有的分层建模能力的概念学习方法,从而在保持概念之间的层次结构和关系依赖性的同时准确地分离出概念;其次,我们引入了一种基于概念的优化方法,将概念嵌入空间映射以维持复杂的关系,并确保概念可组合。我们的方法在从单张图像中提取组成性和解释性的内在概念方面表现出色。
https://arxiv.org/abs/2603.11795
Real-world multivariate time series, particularly in critical infrastructure such as electrical power grids, are often corrupted by noise and anomalies that degrade the performance of downstream tasks. Standard data cleaning approaches often rely on disjoint strategies, which involve detecting errors with one model and imputing them with another. Such approaches can fail to capture the full joint distribution of the data and ignore prediction uncertainty. This work introduces Conditional Imputation and Noisy Data Integrity (CINDI), an unsupervised probabilistic framework designed to restore data integrity in complex time series. Unlike fragmented approaches, CINDI unifies anomaly detection and imputation into a single end-to-end system built on conditional normalizing flows. By modeling the exact conditional likelihood of the data, the framework identifies low-probability segments and iteratively samples statistically consistent replacements. This allows CINDI to efficiently reuse learned information while preserving the underlying physical and statistical properties of the system. We evaluate the framework using real-world grid loss data from a Norwegian power distribution operator, though the methodology is designed to generalize to any multivariate time series domain. The results demonstrate that CINDI yields robust performance compared to competitive baselines, offering a scalable solution for maintaining reliability in noisy environments.
现实世界中的多变量时间序列,尤其是在如电力电网等关键基础设施中,常常受到噪声和异常值的影响,这些因素会降低下游任务的性能。标准的数据清洗方法通常依赖于独立策略,即使用一种模型检测错误并用另一种模型进行填充(或称插补)。这种做法往往无法捕捉数据的完整联合分布,并且忽略了预测中的不确定性。这项工作介绍了一种名为“条件插补和噪声数据完整性”(CINDI) 的无监督概率框架,旨在在复杂的时序数据中恢复数据的一致性。 与分段的方法不同,CINDI 将异常检测和插补统一到一个基于条件归一化流的端到端系统中。通过建模数据的确切条件似然,该框架能够识别低概率片段,并迭代地采样统计上一致的替代值。这使得 CINDI 能够有效地重复利用已学得的信息,同时保持系统的物理和统计特性不变。 我们使用来自挪威电力分销商的真实电网损失数据来评估这一框架,尽管其方法论设计为适用于任何多变量时间序列领域。实验结果表明,CINDI 在与竞争基准相比时表现出了稳健的性能,并提供了一个在嘈杂环境中维护可靠性的可扩展解决方案。
https://arxiv.org/abs/2603.11745
In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.
在缺乏带有词义标注的数据的情况下,词义诱导(WSI)成为了词义消歧的一个有吸引力的替代方案,尤其是在资源匮乏或特定领域的背景下。在这篇论文中,我们强调了当前WSI评估中的方法论问题,并提出了一种基于SemCor数据集衍生出的新评价体系,该评价体系尊重原始语料库的多义性和频率分布。我们在词性划分下评估预训练嵌入和聚类算法的效果,同时提出并评估了一种基于大型语言模型(LLM)的英语WSI方法。我们还对数据增强来源(包括由LLM生成的数据、语料库和词典)、使用维基词典进行数据增强的半监督场景进行了评价,并且研究了必须链接约束的数量以及每个词条的聚类数量。 我们的发现表明,没有任何一种无监督的方法(无论是我们提出的还是以前已有的方法)能够超越强大的“一个词条对应一个簇”的启发式方法。此外,我们还发现:(i) 结果和最佳系统在不同词性划分下可能会有所不同;(ii) 大型语言模型在这个任务上遇到困难;(iii) 数据增强是有益的;以及(iv) 利用维基词典确实有助于提高效果,在我们的测试集上超越了先前的最佳现有技术(SOTA)方法3.3%。WSI问题仍未解决,需要更好地定义词汇表和大型语言模型在词汇语义能力方面的应用。
https://arxiv.org/abs/2603.11686
Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.
语义对应对于处理缺乏明确对应标注的多样化野外图像至关重要。虽然最近的2D基础模型提供了强大的特征,但通过最近邻伪标签进行无监督学习时存在关键限制:这种方法仅在局部操作,忽略结构关系,并且依赖于二维外观无法解决由于对称性或重复特征引起的几何歧义问题。为了解决这些问题,我们在本工作中将伪标签生成重新表述为融合格罗莫夫-瓦瑟斯坦(Fused Gromov-Wasserstein, FGW)问题,从而同时优化特征间的相似性和结构内的连贯性。我们的框架“你之形状”(Shape-of-You, SoY) 利用一个3D基础模型在几何空间中定义这种内在结构,解决了上述的歧义问题。然而,由于FGW是一个计算上不可行的二次问题,我们通过基于锚点的线性化来对其进行近似处理。由此产生的概率传输计划提供了一种结构一致但有噪声的监督信号。因此,我们引入了一个软目标损失动态地将该计划提供的指导与网络预测相结合,从而建立一个抗噪的学习框架。“你之形状”在SPair-71k和AP-10k数据集上取得了最先进的性能,在没有明确几何标注的情况下建立了语义对应的新基准。代码可在“Shape-of-You”项目中获取。 --- 这段文字描述了一个名为“Shape-of-You”的新方法,它旨在解决无监督学习过程中遇到的关于图像对齐和结构一致性的挑战。“你之形状”通过使用一个基于3D模型的方法来处理几何空间中的内在结构问题,并引入了一种新的损失函数以提高算法在面对有噪声的数据时的表现。这种方法在两个基准数据集上展示了显著的优势,显示了其在语义对应任务上的优越性能和潜力。
https://arxiv.org/abs/2603.11618
Non-repetitive solid-state LiDAR scanning leads to an extremely sparse measurement regime for detecting airborne UAVs: a small quadrotor at 10-25 m typically produces only 1-2 returns per scan, which is far below the point densities assumed by most existing detection approaches and inadequate for robust multi-target data association. We introduce an unsupervised, LiDAR-only pipeline that addresses both detection and tracking without the need for labeled training data. The detector integrates range-adaptive DBSCAN clustering with a three-stage temporal consistency check and is benchmarked on real-world air-to-air flight data under eight different parameter configurations. The best setup attains 0.891 precision, 0.804 recall, and 0.63 m RMSE, and a systematic minPts sweep verifies that most scans contain at most 1-2 target points, directly quantifying the sparsity regime. For multi-target tracking, we compare deterministic Hungarian assignment with joint probabilistic data association (JPDA), each coupled with Interacting Multiple Model filtering, in four simulated scenarios with increasing levels of ambiguity. JPDA cuts identity switches by 64% with negligible impact on MOTA, demonstrating that probabilistic association is advantageous when UAV trajectories approach one another closely. A two-environment evaluation strategy, combining real-world detection with RTK-GPS ground truth and simulation-based tracking with identity-annotated ground truth, overcomes the limitations of GNSS-only evaluation at inter-UAV distances below 2 m.
非重复固态LiDAR扫描导致用于检测空中无人机的测量密度极低:在10-25米范围内,一个小型四旋翼通常只产生每个扫描周期内的1-2个返回信号,这远远低于大多数现有检测方法假设的点云密度,并且对于稳健的多目标数据关联来说是不够的。我们介绍了一种无监督、仅使用LiDAR的处理流程,该流程能够同时解决检测和跟踪问题,无需标注训练数据。该检测器集成了范围自适应DBSCAN聚类与三阶段时间一致性检查,并在八个不同的参数配置下,对真实世界中的空对空飞行数据进行了基准测试。最佳设置实现了0.891的精度、0.804的召回率和0.63米的均方根误差(RMSE),并通过对minPts进行全面系统扫描验证了大多数扫描中包含的目标点不超过1-2个,直接量化了稀疏测量模式。 对于多目标跟踪,我们比较了确定性匈牙利分配与联合概率数据关联(JPDA)在四个逐步增加模糊度级别的模拟场景中的表现。每个方法都配以交互多重模型滤波器。结果显示,在MOTA几乎不受影响的情况下,JPDA将身份交换次数减少了64%,表明当无人机轨迹相互接近时,概率关联具有优势。 我们采用了一种结合了真实世界检测与RTK-GPS地面真值以及基于模拟的跟踪与标注身份的地面真值的双环境评估策略。这种策略克服了仅使用GNSS评价在2米以内距离上的无人机间通信中的局限性。
https://arxiv.org/abs/2603.11586
Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.
无监督伪装目标检测(UCOD)由于目标物体与其周围环境的高内在相似性以及对噪声伪标签的依赖,仍然是一个极具挑战性的任务。这些伪标签阻碍了细粒度纹理的学习。尽管现有的优化策略旨在减轻标签噪音的影响,但它们常常忽视了内在感知线索,导致边界溢出和结构模糊。相比之下,在没有伪标签指导的情况下学习会生成细节损失严重的粗略特征。 为了解决这些问题,我们提出了一种统一的UCOD框架,该框架同时提升了伪标签的可靠性和特征的真实性。我们的方法引入了多线索本体感知模块,通过整合低级纹理线索和中级语义来提取内在视觉先验,从而实现掩码与原生物体信息之间精确对齐。 此外,伪标签进化融合通过教师-学生互动智能地优化标签,并采用深度可分离卷积进行高效的语义去噪。它还结合了谱张量注意力融合技术,通过多层注意力图之间的紧凑频谱聚合有效地平衡语义和结构信息。 最后,局部伪标签细化在利用注意多样性恢复细纹理并增强边界保真度方面起着关键作用,从而实现了细节优化的本地化改进。 我们在多个UCOD数据集上进行了广泛实验,结果表明我们的方法取得了最先进的性能,具有出色的细节感知能力、鲁棒的边界对齐和复杂伪装场景下的强大泛化能力。
https://arxiv.org/abs/2603.11521
LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
基于LLM(大型语言模型)的文本嵌入器通常会编码输入的语义内容。然而,嵌入任务需要将多样化的输入映射到相似的输出上。通常,这种输入-输出问题通过使用对比学习方法训练带有配对数据的嵌入模型来解决。在这项工作中,我们提出了一种新颖的自监督方法——LLM2Vec-Gen,采用了一种不同的范式:不编码输入,而是学习表示模型可能的回答。具体来说,我们在大型语言模型(LLM)词汇表中添加了可训练的特殊标记,并将它们附加到输入上,然后优化这些标记以在固定长度序列中表示模型的响应。通过使用LLM自身的完成查询以及无监督嵌入教师提供的蒸馏目标来引导训练。这种形式化有助于弥合输入-输出之间的差距,并且可以将大型语言模型的能力(如安全对齐和推理能力)转移到嵌入任务上。重要的是,大型语言模型的核心架构保持不变,而且训练只需要未标记的查询即可进行。 LLM2Vec-Gen在大规模文本嵌入基准测试(MTEB)中实现了最先进的自监督性能,在无监督嵌入教师方面的表现提高了9.3%。此外,我们观察到有害内容检索减少了多达43.2%,并且嵌入任务中的推理能力有了高达29.3%的提升。最后,学习得到的嵌入是可以解释的,并且可以解码成文本以揭示它们的语义内容。
https://arxiv.org/abs/2603.10913
Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP's cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at this https URL.
基于学习的实景去雾方法已经取得了显著的进步,但仍面临着在多样化的实景雾霾场景中适应性差的问题。这些问题主要源于缺乏有效的无监督机制来处理未标记数据以及全模型微调成本高昂。为了解决这些挑战,我们提出了“haze-to-clear text-directed loss”(基于文本的去雾损失),该方法利用CLIP的跨模态能力将实景图像去雾问题重新定义为潜在空间中的语义对齐问题,在没有参考图的情况下提供了明确的无监督跨模态指导。此外,我们还引入了Bilevel Layer-positioning LoRA (BiLaLoRA)策略,这种方法不仅可以学习LoRA参数还可以自动搜索注入层的位置,从而实现关键网络层次的目标适应性调整。 大量的实验表明,我们的方法在多个现实世界的去雾基准测试中优于现有最先进的方法。相关代码已公开发布在此链接:[https URL](注意:此处的URL为示例,请替换为实际提供的链接地址)。
https://arxiv.org/abs/2603.10872
Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.
最近在视觉-语言-行动(VLA)模型方面的进展使得机器人能够执行越来越复杂的任务。然而,通过模仿学习训练的VLA模型在动态环境中运行时可靠性较低,并且往往在非分布环境(OOD)下表现不佳。为了解决这个问题,我们提出了机器人条件归一化流(RC-NF),这是一个用于机器人异常检测和干预的实时监控模型,可以确保机器人的状态与对象运动轨迹与其任务相一致。RC-NF将任务感知型机器人和物体状态处理过程在归一化流程中解耦,并且只需要正样本进行无监督训练,在推理过程中通过概率密度函数计算准确的机器人异常分数。 我们进一步介绍了LIBERO-Anomaly-10,这是一个用于模拟评估的基准数据集,包含三类机器人的异常情况。RC-NF在所有异常类型中的监控机器人任务性能上都优于以前的方法。实际实验表明,RC-NF可以作为VLA模型(如pi0)的一个即插即用模块,在必要时提供实时非分布信号,并能够以低于100毫秒的响应延迟进行状态级回滚或任务级重新规划。 这些结果表明,RC-NF显著提升了基于视觉-语言-行动系统的机器人在动态环境中的鲁棒性和适应性。
https://arxiv.org/abs/2603.11106
With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at this https URL
随着AIGC(AI生成内容)技术的迅速发展,图像取证将面临前所未有的挑战。传统方法无法应对由不断进步的图像生成技术所制造出的真实度越来越高的图片。为了便于识别AI生成的图像及其源头模型,近年来,生成图像水印和AI生成图像归属追踪成为了研究的关键焦点。然而,现有的方法依赖于具体模型,在新出现且未见过的生成器面前不具备通用性和可扩展性。为了解决这些局限性,本项工作提出了一种新的AI生成图像归属追踪范式,将其视为一个实例检索问题而非传统的图像分类问题。我们提出了一个高效的非特定模型框架,称为基于低比特平面的Deepfake归属追踪(LIDA)。LIDA的输入由低比特指纹生成模块产生,在训练过程中则是通过无监督预训练后进行后续的少量样本归属适应。全面的实验表明,无论是在零样本和少样本设置下,LIDA都能实现深度伪造检测与图像归属追踪上的最优性能。代码可在上述链接中获取。
https://arxiv.org/abs/2603.10583
Hyperspectral images capture vast amounts of high-dimensional spectral information about a scene, making labeling an intensive task that is resistant to out-of-the-box statistical methods. Unsupervised learning of clusters allows for automated segmentation of the scene, enabling a more rapid understanding of the image. Partitioning the spectral information contained within the data via dictionary learning in Wasserstein space has proven an effective method for unsupervised clustering. However, this approach requires balancing the spectral profiles of the data, blurring the classes, and sacrificing robustness to outliers and noise. In this paper, we suggest improving this approach by utilizing unbalanced Wasserstein barycenters to learn a lower-dimensional representation of the underlying data. The deployment of spectral clustering on the learned representation results in an effective approach for the unsupervised learning of labels.
高光谱图像能够捕捉场景中的大量高维光谱信息,这使得标注任务变得非常繁重,并且难以用现成的统计方法解决。无监督学习聚类可以实现场景的自动化分割,从而加快对图像的理解速度。通过在Wasserstein空间中使用字典学习来划分数据中的光谱信息已被证明是一种有效的无监督聚类方法。然而,这种方法需要平衡数据的光谱轮廓,这会导致模糊化类别并牺牲对外部异常值和噪声的鲁棒性。在这篇文章中,我们建议采用不平衡的Wasserstein中心点来改进这一方法,通过这种方式学习潜在数据的低维表示。在所学表示上应用谱聚类可以实现一种有效的无监督标签学习方法。
https://arxiv.org/abs/2603.10132
In interventional radiology, Cone-Beam Computed Tomography (CBCT) is a helpful imaging modality that provides guidance to practicians during minimally invasive procedures. CBCT differs from traditional Computed Tomography (CT) due to its limited reconstructed field of view, specific artefacts, and the intra-arterial administration of contrast medium. While CT benefits from abundant publicly available annotated datasets, interventional CBCT data remain scarce and largely unannotated, with existing datasets focused primarily on radiotherapy applications. To address this limitation, we leverage a proprietary collection of unannotated interventional CBCT scans in conjunction with annotated CT data, employing domain adaptation techniques to bridge the modality gap and enhance liver segmentation performance on CBCT. We propose a novel unsupervised domain adaptation (UDA) framework based on the formalism of Margin Disparity Discrepancy (MDD), which improves target domain performance through a reformulation of the original MDD optimization framework. Experimental results on CT and CBCT datasets for liver segmentation demonstrate that our method achieves state-of-the-art performance in UDA, as well as in the few-shot setting.
在介入放射学中,锥形束计算机断层扫描(CBCT)是一种有用的成像模式,它为医生在进行微创手术时提供了指导。CBCT与传统的计算机断层扫描(CT)不同,因为它具有有限的重建视野、特定的伪影以及通过动脉内给药的方式使用对比剂。尽管CT可以从大量公开的注释数据集中受益,但介入CBCT的数据仍然很少且大部分未被标注,现有数据集主要关注放射治疗应用领域。为了解决这一限制,我们利用了一套专有的未标注介入CBCT扫描图像,并结合了标注过的CT数据,运用领域适应技术来弥合不同成像模式之间的差距并提高在CBCT上肝脏分割的性能。 为此,我们提出了一种基于边界差异偏差(MDD)形式化的新型无监督领域适应(UDA)框架。通过重新定义原始MDD优化框架,该方法可以改善目标领域的表现。实验结果表明,在使用CT和CBCT数据集进行肝脏分割时,我们的方法在无监督领域适应和少量样本设置中均达到了最先进的性能水平。
https://arxiv.org/abs/2603.09932
Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected this http URL further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
动作条件视频模型为构建能够直接从数据中改进的通用机器人模拟器提供了有前景的方法。然而,尽管在大规模机器人数据集上进行了训练,当前最先进的视频模型仍然难以预测物理上一致的机器人与物体之间的交互,这种一致性对于机器人操作至关重要。为了弥合这一差距,我们提出了PlayWorld,这是一种简单、可扩展且完全自主的数据管道,用于从互动经验中训练高保真度的视频世界模拟器。 与依赖成功偏好的人类演示的先前方法不同,PlayWorld是第一个能够仅通过无监督机器人的自玩学习系统的先驱,这使得自然大规模数据收集成为可能,并捕捉到复杂、长尾物理交互,这对于建模现实中的物体动力学至关重要。在各种操作任务中进行的实验表明,PlayWorld能生成高质量且物理上一致的预测,这些预测涉及复杂的接触密集型互动,这是仅凭人类采集的数据训练的世界模型所无法捕捉到的。 此外,我们展示了PlayWorld在细粒度故障预测和策略评估方面的多功能性,相对于人类收集的数据最多可以提高40%的效果。最后,我们演示了如何在世界模型中应用强化学习,并表明当部署于现实世界时,这能将成功率提升65%,从而改进政策表现。 总的来说,PlayWorld提供了一种新的方式来利用机器人互动数据构建高质量的物理模拟器,这对于推进机器人技术的发展具有重要意义。
https://arxiv.org/abs/2603.09030