In modern manufacturing, Visual Anomaly Detection (VAD) is essential for automated inspection and consistent product quality. Yet, increasingly dynamic and flexible production environments introduce key challenges: First, frequent product changes in small-batch and on-demand manufacturing require rapid model updates. Second, legacy edge hardware lacks the resources to train and run large AI models. Finally, both anomalous and normal training data are often scarce, particularly for newly introduced product variations. We investigate on-device continual learning for unsupervised VAD with localization, extending the PatchCore to incorporate online learning for real-world industrial scenarios. The proposed method leverages a lightweight feature extractor and an incremental coreset update mechanism based on k-center selection, enabling rapid, memory-efficient adaptation from limited data while eliminating costly cloud retraining. Evaluations on an industrial use case are conducted using a testbed designed to emulate flexible production with frequent variant changes in a controlled environment. Our method achieves a 12% AUROC improvement over the baseline, an 80% reduction in memory usage, and faster training compared to batch retraining. These results confirm that our method delivers accurate, resource-efficient, and adaptive VAD suitable for dynamic and smart manufacturing.
在现代制造业中,视觉异常检测(VAD)对于自动化检验和保持产品质量一致性至关重要。然而,日益动态且灵活的生产环境带来了几个关键挑战:首先,在小批量和按需制造过程中产品频繁变化需要快速更新模型;其次,传统的边缘设备缺乏足够的资源来训练和运行大型AI模型;最后,无论是异常数据还是正常数据都往往稀缺,特别是对于新引入的产品变体。我们研究了在设备端的持续学习方法,用于无监督VAD并定位异常情况,并将PatchCore扩展为在线学习机制以适应现实世界的工业场景。该方法利用了一个轻量级特征提取器和基于k-中心选择的增量核心集更新机制,在有限的数据下实现快速、内存高效的调整,同时消除了昂贵的云重新训练需求。我们在一个模拟灵活生产环境(在受控环境中频繁变化变体)的测试平台上进行了工业应用场景的评估。我们的方法相比基线模型实现了AUROC提高了12%,减少了80%的内存使用,并且与批量再训练相比,训练速度更快。这些结果证实了我们的方法能够提供准确、资源高效并且适应性强的VAD,适用于动态和智能制造环境。
https://arxiv.org/abs/2512.13497
LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.
大型语言模型(LLMs)实现了显著的多步推理能力,但通过后期训练蒸馏有效转移这些技能仍然具有挑战性。现有的数据选择方法从手动策划到基于长度、熵或整体损失的启发式方法,都无法捕捉个体推理步骤的因果重要性,从而限制了蒸馏效率。为了解决这一问题,我们提出了用于推理的注意力影响(AIR),这是一个原理驱动的、无监督且无需训练的框架,它利用检索头的机制洞察来选择具有高价值的后期训练数据。 AIR 首先识别现成模型中的关键推理注意头,然后构建一个禁用头部影响的弱化参考模型,并最终量化由此产生的损失分歧作为注意力影响得分。此分数支持在步骤和样本层面进行细粒度评估,支持按步骤加权微调和全局样本选择。 跨多个推理基准实验表明,AIR 一致地提高了推理准确性,超越了启发式基线并有效隔离了最关键的步骤和样本。我们的工作为 LLM 中的推理蒸馏建立了一种机制驱动且数据高效的方法。
https://arxiv.org/abs/2512.13279
The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.
深度完成问题涉及从单一稀疏深度图和RGB图像预测密集的深度图像。在没有地面真实深度数据的情况下,监督学习方法无法应用,因此已经提出了几种无监督的方法来处理各种数据集中的深度完成任务。然而,这些模型需要辅助数据来估计深度值,这与实际情况相差甚远。单目深度估计(MDE)模型可以从单一图像生成一个合理的相对深度图,但目前没有研究将稀疏深度图和MDE结合起来进行深度完成;简单的仿射变换对深度图处理会导致较高的误差,因为MDE在估算物体之间的深度差异时不够准确。 我们引入了StarryGazer框架,这是一个领域无关的框架,可以从单一稀疏深度图像和RGB图像预测密集深度图像,而不依赖于地面真实深度数据,而是利用大型单目深度估计模型的力量。首先,使用预训练的MDE模型生成相对深度图。这些图被分割并随机缩放以形成合成对,用于密集伪地面真值和相应的稀疏深度图。然后通过与合成对一起训练一个精炼网络,该网络结合了相对深度图和RGB图像,以此提高模型的准确性和鲁棒性。StarryGazer在各种数据集上展示了优于现有无监督方法以及变换后的MDE结果的表现,证明了我们的框架能够有效利用MDE模型的力量,并且通过使用稀疏深度信息适当纠正错误。
https://arxiv.org/abs/2512.13147
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via this https URL.
带有可验证奖励的强化学习(RLVR)已被证明在通过利用答案一致性信号来指导策略优化方面,对于训练大型推理模型(LRMs)非常有效。然而,这种方法存在标注成本高的问题。为了解决这个问题,最近的研究探索了仅基于模型内部一致性的无监督RLVR方法,例如通过熵和多数投票等方式获取奖励。尽管这些方法看似有前景,但它们通常在训练后期阶段会遭受模型崩溃的问题,这可能是由于缺乏外部监督导致错误推理模式被强化所引起的。 在这项工作中,我们调查了一种新的半监督RLVR范式,该范式利用一个小的标注集来指导未标注样本上的RLVR训练。我们的关键见解是,有监督奖励对于稳定未标注样本上的基于一致性的训练至关重要,并确保仅将已验证为正确的推理模式纳入强化学习中。 从技术上讲,我们提出了一种有效的策略优化算法——TraPO,该算法通过匹配未标注样本的学习轨迹与标注样本的相似性来识别可靠的未标注样本。在此基础上,TraPO在六个广泛使用的数学推理基准(包括AIME24/25、AMC、MATH-500、Minerva和奥林匹克竞赛)以及三个分布外任务(ARC-c、GPQA-diamond和MMLU-pro)上实现了显著的数据效率和强大的泛化能力。 使用1K个标注样本和3K个未标注样本,TraPO达到了42.6%的平均准确率,超过了在45K个未标注样本上训练的最佳无监督方法(38.3%)。值得注意的是,在使用4K个标注样本和12K个未标注样本时,TraPO甚至在所有基准测试中都超过了完全基于有监督模型使用的全部45K个标注样本的性能,并且仅用了10%的数据量。 代码可通过提供的链接访问。
https://arxiv.org/abs/2512.13106
Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at this https URL.
变化检测(CD)从多时相观测中识别场景的变化,在城市建设和环境监测领域被广泛应用。目前大多数现有的变化检测方法依赖于监督学习,这使得性能高度依赖于数据集,并且需要较高的标注成本;它们通常只关注少数预定义的类别,并且在不同场景中的泛化能力较差。随着视觉基础模型如SAM2和CLIP的兴起,新的机会出现以缓解这些限制。我们提出了一种基于冻结版的SAM2和CLIP构建的方法——统一开放词汇变化检测(UniVCD)。这是一种无监督、开放词汇的变化检测方法,它可以跨多样化的场景和成像几何形态识别出无类别标签的变化,并且无需任何标注数据或配对变化图像。 为了实现这一目标,我们引入了一个轻量级的特征对齐模块,该模块连接了来自SAM2的空间详细表示与来自CLIP的语义先验,从而能够在保持少量可训练参数的同时进行高分辨率和语义感知的变化估计。在此基础上,还提出了一条精简后的后处理流水线来抑制噪声和伪变化,这提升了具有清晰边界对象的检测精度。 在几个公开的二元变化检测(BCD)和语义变化检测(SCD)基准测试上的实验表明,UniVCD在关键指标如F1分数和IoU上均实现了持续强劲的表现,并且与现有的开放词汇变化检测方法相比,在这些关键指标中的表现持平或更优。结果证明了利用冻结的视觉基础模型以及轻量级多模态对齐进行无监督变化检测是开放式词汇变化检测的一种实用有效的方法论。 相关代码和预训练模型将在以下链接发布:[https URL] (请将URL替换为实际发布的链接)
https://arxiv.org/abs/2512.13089
Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.
在超声图像中对甲状腺结节进行精确分割对于诊断和治疗计划至关重要。然而,结节与其周围组织之间的模糊边界、大小变化以及注释超声数据的稀缺性给自动分割带来了重大挑战。现有的深度学习模型难以整合来自整个甲状腺的信息,并且难以在不同病例之间有效地推广。为了解决这些挑战,我们提出了一种基于半监督多任务Transformer网络(SSMT-Net)的方法。该方法利用未标注的数据,在初步的无监督阶段提升以Transformer为中心的编码器特征提取能力。进入有监督阶段后,模型同时优化结节分割、腺体分割和结节大小估计,融合局部和全局上下文特征。在TN3K和DDTI数据集上的广泛评估表明,SSMT-Net超越了现有的最先进的方法,在准确性和鲁棒性方面表现更佳,这预示着其潜在的临床应用价值。
https://arxiv.org/abs/2512.12662
Authorship analysis has traditionally focused on lexical and stylistic cues within text, while higher-level narrative structure remains underexplored, particularly for low-resource languages such as Urdu. This work proposes a graph-based framework that models Urdu novels as character interaction networks to examine whether authorial style can be inferred from narrative structure alone. Each novel is represented as a graph where nodes correspond to characters and edges denote their co-occurrence within narrative proximity. We systematically compare multiple graph representations, including global structural features, node-level semantic summaries, unsupervised graph embeddings, and supervised graph neural networks. Experiments on a dataset of 52 Urdu novels written by seven authors show that learned graph representations substantially outperform hand-crafted and unsupervised baselines, achieving up to 0.857 accuracy under a strict author-aware evaluation protocol.
传统的作者分析主要集中在文本中的词汇和风格线索上,而叙事结构的高层次特征尤其是对于乌尔都语等资源较少的语言而言,研究得还不够充分。这项工作提出了一种基于图的框架,将乌尔都语小说建模为角色互动网络,以探讨仅通过叙事结构是否可以推断出作者的独特风格。在这个模型中,每本小说都被表示成一个图形,其中节点对应于角色,而边则表示他们在叙述中的共现情况。 我们系统地比较了多种图表示方法,包括全局结构特征、节点级别的语义摘要、无监督的图嵌入和有监督的图神经网络。通过对由七位作者撰写的52本乌尔都语小说的数据集进行实验,结果表明学习到的图表示在严格的以作者为中心的评估协议下表现优于手动设计和无监督基线模型,最高准确率达到了0.857。
https://arxiv.org/abs/2512.12654
Local field potentials (LFPs) can be routinely recorded alongside spiking activity in intracortical neural experiments, measure a larger complementary spatiotemporal scale of brain activity for scientific inquiry, and can offer practical advantages over spikes, including greater long-term stability, robustness to electrode degradation, and lower power requirements. Despite these advantages, recent neural modeling frameworks have largely focused on spiking activity since LFP signals pose inherent modeling challenges due to their aggregate, population-level nature, often leading to lower predictive power for downstream task variables such as motor behavior. To address this challenge, we introduce a cross-modal knowledge distillation framework that transfers high-fidelity representational knowledge from pretrained multi-session spike transformer models to LFP transformer models. Specifically, we first train a teacher spike model across multiple recording sessions using a masked autoencoding objective with a session-specific neural tokenization strategy. We then align the latent representations of the student LFP model to those of the teacher spike model. Our results show that the Distilled LFP models consistently outperform single- and multi-session LFP baselines in both fully unsupervised and supervised settings, and can generalize to other sessions without additional distillation while maintaining superior performance. These findings demonstrate that cross-modal knowledge distillation is a powerful and scalable approach for leveraging high-performing spike models to develop more accurate LFP models.
局部场电位(LFPs)可以在皮层神经实验中与脉冲活动常规记录在一起,能够测量更大范围的空间和时间尺度的大脑活动,为科学研究提供补充。相比脉冲活动,LFP信号具有更高的长期稳定性、对电极退化的鲁棒性以及更低的功耗等实用优势。尽管有这些优点,最近的神经建模框架大多集中在脉冲活动上,因为LFP信号因其群体水平和聚合性质而给建模带来固有的挑战,这通常会导致对未来任务变量(如运动行为)预测能力较低。 为了解决这一问题,我们引入了一种跨模式知识蒸馏框架,该框架将预训练的多会话脉冲变换模型中的高保真表示知识转移到LFP变换模型中。具体来说,首先使用带有特定于每个会话的神经标记策略的掩码自动编码目标,在多个记录会话上对教师脉冲模型进行训练。然后使学生LFP模型的潜在表征与教师脉冲模型的相一致。 我们的结果显示,蒸馏后的LFP模型在完全无监督和有监督设置中都优于单个会话和多会话LFP基线,并且能够推广到其他会话,而无需进一步的蒸馏同时保持优越性能。这些发现表明跨模式知识蒸馏是一种强大且可扩展的方法,用于利用高性能脉冲模型来开发更准确的LFP模型。
https://arxiv.org/abs/2512.12461
Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context-Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre-trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo-labels by explicitly modeling class-wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross-domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA->Cityscapes and GTA5->Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state-of-the-art results with an average improvement of 7%.
无监督领域适应(UDA)使语义分割模型能够从有标签的源域泛化到无标签的目标域。然而,现有的UDA方法仍然难以弥合因跨域上下文模糊性、特征表示不一致以及类别伪标签噪声等原因导致的领域差距。为了解决这些挑战,我们提出了Omni-level Masking for Unsupervised Domain Adaptation(OMUDA),这是一种统一框架,引入了在不同表示层次上的分层掩码策略。具体来说,OMUDA包括: 1. 上下文感知掩码(CAM)策略:该策略能够自适应地区分前景和背景,并平衡全局上下文与局部细节; 2. 特征蒸馏掩码(FDM)策略:通过从预训练模型的知识转移来增强鲁棒且一致的特征学习; 3. 类别解耦掩码(CDM)策略:该策略通过显式建模类别不确定性,减轻了噪声伪标签的影响。 这种分层掩码范例在上下文、表示和分类级别上有效地减少了领域偏移,并提供了一种超越现有方法的统一解决方案。多项具有挑战性的跨域语义分割基准实验验证了OMUDA的有效性。特别地,在SYNTHIA->Cityscapes和GTA5->Cityscapes任务中,OMUDA可以无缝集成到现有的UDA方法中,并始终取得最先进的结果,平均改进达到7%。
https://arxiv.org/abs/2512.12303
Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.
最近的自监督学习(SSL)进展展示了在没有人类注释的情况下学习三维点云表示的巨大潜力。然而,由于不规则几何形状、容易出现捷径问题的重建和不平衡的语义分布,对于三维点云的SSL仍然面临严峻挑战。为此,我们提出了DOS(Distilling Observable Softmaps),这是一种新颖的SSL框架,仅在可观察(未遮罩)点上自我提炼语义相关性软映射。这种策略可以防止信息从被掩码区域泄漏,并且相比离散标记到原型分配提供了更丰富的监督信号。为了应对无监督设置中不平衡语义的问题,我们引入了齐夫定律原型并通过修改后的Sinkhorn-Knopp算法(Zipf-Sinkhorn)将其整合,该算法在原型使用上施加幂律先验并在训练过程中调节目标软映射的锐度。 DOS在nuScenes、Waymo、SemanticKITTI、ScanNet和ScanNet200等多个基准测试中,在语义分割和三维物体检测任务上超越了当前最先进的方法,且无需依赖额外的数据或注释。我们的结果表明,可观察点软映射提炼提供了一种可扩展且有效的学习稳健三维表示的范式。
https://arxiv.org/abs/2512.11465
Transforming bi-dimensional sets of image pixels into mono-dimensional sequences with a Peano scan (PS) is an established technique enabling the use of hidden Markov chains (HMCs) for unsupervised image segmentation. Related Bayesian segmentation methods can compete with hidden Markov fields (HMFs)-based ones and are much faster. PS has recently been extended to the contextual PS, and some initial experiments have shown the value of the associated HMC model, denoted as HMC-CPS, in image segmentation. Moreover, HMCs have been extended to hidden evidential Markov chains (HEMCs), which are capable of improving HMC-based Bayesian segmentation. In this study, we introduce a new HEMC-CPS model by simultaneously considering contextual PS and evidential HMC. We show its effectiveness for Bayesian maximum posterior mode (MPM) segmentation using synthetic and real images. Segmentation is performed in an unsupervised manner, with parameters being estimated using the stochastic expectation--maximization (SEM) method. The new HEMC-CPS model presents potential for the modeling and segmentation of more complex images, such as three-dimensional or multi-sensor multi-resolution images. Finally, the HMC-CPS and HEMC-CPS models are not limited to image segmentation and could be used for any kind of spatially correlated data.
将二维图像像素集通过Peano扫描(PS)转换为一维序列是一种已建立的技术,它使得隐马尔可夫链(HMCs)能够用于无监督图像分割。相关的贝叶斯分割方法可以与基于隐藏马尔科夫场(HMFs)的方法相媲美,并且速度更快。最近,PS已被扩展到上下文Peano扫描(CPS),初步实验表明与此关联的HMC模型——即HMC-CPS,在图像分割中具有价值。此外,HMC已经被扩展为隐藏证据马尔可夫链(HEMCs),这些能够改进基于HMC的贝叶斯分割。在这项研究中,我们通过同时考虑上下文Peano扫描和证据HMC,引入了一种新的HEMC-CPS模型,并展示了其在合成图像和真实图像上的最大后验模式(MPM)分割中的有效性。该分割以无监督方式进行,参数估计使用随机期望-最大化(SEM)方法进行。新提出的HEMC-CPS模型对于复杂图像的建模与分割具有潜力,例如三维或多传感器多分辨率图像。最后,HMC-CPS和HEMC-CPS模型不仅限于图像分割应用,并且可以用于任何类型的空间相关数据处理。 这个翻译总结了原文的研究背景、方法创新以及潜在的应用前景。希望这对您有所帮助!
https://arxiv.org/abs/2512.11939
Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.
无监督工业异常检测需要在没有标签数据的情况下准确识别缺陷。传统的自编码器(autoencoder)方法常常难以完全抑制异常情况,并且会丢失细节信息,因为它们的单一解码过程无法有效处理严重程度和规模不同的异常。 我们提出了一种递归自动编码器(Recursive architecture for AutoEncoder, RcAE),它通过迭代重建逐步压制异常并精炼正常结构。与传统的单次解码模型不同,这种递归设计自然会产生一系列重建结果,逐层揭示被抑制的异常模式。为了利用这一系列重建动态过程,我们引入了跨递归检测(Cross Recursion Detection, CRD)模块来跟踪递归步骤之间的不一致性,增强对细微和大规模异常的识别能力。 此外,我们还加入了细节保留网络(Detail Preservation Network, DPN),以恢复通常在重建过程中丢失的高频纹理。广泛的实验表明,我们的方法显著优于现有的非扩散方法,并且其性能与最新的扩散模型相当,但仅使用了后者10%的参数量和提供了更快的推理速度。这些结果突显了我们方法在实际应用中的实用性和效率。
https://arxiv.org/abs/2512.11284
Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.
预训练的图像恢复模型在处理现实世界中未见过的退化情况时往往表现不佳,因为它们之间存在显著的领域差距。适应这些未知领域的挑战在于,无法获得这种未见过数据的真实标签,并且传统的适应方法通常需要复杂的架构调整。我们提出了LEGO(从生成式预言家学习),这是一个实用的三阶段框架,用于在没有配对数据的情况下进行后期训练域适应。LEGO将这一无监督问题转化为可操作的伪监督问题。 首先,我们使用预训练模型获得初始恢复图像。 其次,我们利用一个冻结的大规模生成器来细化这些估计值,将其转换为高质量的伪真实标签。 最后,我们在一种混合监督策略下对原始模型进行微调,该策略结合了分布内数据和新的伪配对数据。 这种方法可以在不牺牲原有鲁棒性或需要架构修改的情况下,使模型适应新领域。实验表明,LEGO有效地弥合了领域差距,在各种现实世界基准测试中显著提高了性能。
https://arxiv.org/abs/2512.11121
Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
自监督预训练已经在语言、单个二维图像和视频的基础模型中引发了革命,但对于从多视角图像学习具有三维感知的表示方法来说,这一领域仍然相对未被探索。在本文中,我们提出了E-RayZer,这是一种自我监督的大规模3D视觉模型,能够直接从无标签图像中学习真正的三维感知表示。与之前的方法(如RayZer)不同,后者通过潜在空间视图合成间接推断出三维信息,E-RayZer则直接在三维空间内操作,执行基于显式几何的自监督三维重建。这种形式化方法消除了捷径解决方案,并生成了以几何为基础的表示。 为了确保收敛性和可扩展性,我们引入了一种新颖的细粒度学习课程,该课程从简单到复杂的样本组织训练过程,并完全在无监督的方式下整合异构数据源。实验表明,E-RayZer在姿态估计任务上显著优于RayZer,在某些情况下甚至能匹敌或超越完全基于监督的重建模型(如VGGT)。此外,其学习到的表示方法在迁移到3D下游任务时,能够超过领先的视觉预训练模型(例如DINOv3、CroCo v2、VideoMAE V2和RayZer)的表现。因此,E-RayZer确立了一个新的范式,用于三维感知的视觉预训练。
https://arxiv.org/abs/2512.10950
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
稀疏自编码器(SAE)在大型语言模型(LLM)和大型视觉-语言模型(LVLM)中承诺了一种统一的方法,用于机制可解释性、概念发现以及模型控制。然而,要实现这一潜力,所学特征必须既具有可解释性又具有可控性。为此,我们引入了两个新的计算成本低廉的可解释性和可控性的度量标准,并对LVLM进行了系统分析。我们的分析揭示了两项观察结果:(i) 大多数SAE神经元表现出低可解释性或低可控性,或者两者兼有,这使得它们在下游使用中效果不佳;(ii) 由于SAE的无监督性质,用户所需的概念通常不在所学字典中,从而限制了其实用价值。为了解决这些局限性,我们提出了概念瓶颈稀疏自编码器(CB-SAE)——这是一种新颖的事后框架,通过剪除低效神经元并添加一个与用户定义的概念集对齐的轻量级概念瓶颈来扩展潜在空间。结果表明,CB-SAE在LVLM和图像生成任务中分别提高了32.1%的可解释性和14.5%的可控性。我们将提供我们的代码和模型权重。
https://arxiv.org/abs/2512.10805
Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at this https URL.
无监督细胞类型识别对于揭示和表征单细胞组学研究中的异质性群体至关重要。尽管已经开发出了一系列聚类方法,但大多数方法仅专注于内在的细胞结构,并忽略了细胞-基因关联在其区分密切相关细胞类型过程中的关键作用,这限制了它们的分辨能力。 为此,我们提出了一种称为scRCL(Refinement Contrastive Learning)框架的方法,该方法明确地将细胞-基因相互作用纳入考量,以获得更有信息量的表示。具体而言,我们引入了两个对比分布对齐组件,通过有效地利用细胞间的结构关系揭示可靠的内在细胞结构。此外,我们开发了一个增强模块,该模块集成了基因相关性结构学习功能,能够通过捕捉潜在的细胞-基因关联来改进细胞嵌入。这个模块加强了细胞与其关联基因之间的联系,并优化表示学习以挖掘生物学上重要的关系。 在多个单细胞RNA测序和空间转录组学基准数据集中进行的广泛实验表明,我们的方法在细胞类型识别准确性方面始终优于现有的最先进技术。此外,下游生物分析证实了恢复的细胞群体表现出一致的基因表达特征,进一步验证了我们方法的生物学相关性。代码可在提供的链接中获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2512.10640
Vision-and-Language Navigation (VLN) requires agents to navigate complex environments by following natural-language instructions. General Scene Adaptation for VLN (GSA-VLN) shifts the focus from zero-shot generalization to continual, environment-specific adaptation, narrowing the gap between static benchmarks and real-world deployment. However, current GSA-VLN frameworks exclude user feedback, relying solely on unsupervised adaptation from repeated environmental exposure. In practice, user feedback offers natural and valuable supervision that can significantly enhance adaptation quality. We introduce a user-feedback-driven adaptation framework that extends GSA-VLN by systematically integrating human interactions into continual learning. Our approach converts user feedback-navigation instructions and corrective signals-into high-quality, environment-aligned training data, enabling efficient and realistic adaptation. A memory-bank warm-start mechanism further reuses previously acquired environmental knowledge, mitigating cold-start degradation and ensuring stable redeployment. Experiments on the GSA-R2R benchmark show that our method consistently surpasses strong baselines such as GR-DUET, improving navigation success and path efficiency. The memory-bank warm start stabilizes early navigation and reduces performance drops after updates. Results under both continual and hybrid adaptation settings confirm the robustness and generality of our framework, demonstrating sustained improvement across diverse deployment conditions.
视觉与语言导航(VLN)要求代理根据自然语言指令在复杂的环境中进行导航。通用场景适应性VLN(GSA-VLN)将焦点从零样本泛化转移到持续的环境特定适应上,缩小了静态基准测试和实际部署之间的差距。然而,当前的GSA-VLN框架排除了用户反馈,仅依赖于通过重复环境暴露进行无监督适应。在实践中,用户反馈提供了自然且有价值的指导,可以显著提高适应质量。我们介绍了一个以用户反馈驱动的适应性框架,该框架扩展了GSA-VLN,并系统地将人类互动融入持续学习中。我们的方法将用户反馈(导航指令和纠正信号)转换为高质量、环境对齐的训练数据,从而实现高效且现实的适应。此外,一个记忆库预热启动机制进一步重用了先前获得的环境知识,缓解了冷启动退化,并确保稳定重新部署。在GSA-R2R基准测试上的实验表明,我们的方法在导航成功率和路径效率方面始终优于强大的基线模型,如GR-DUET。记忆库预热启动有助于早期导航的稳定性并减少更新后的性能下降。无论是持续性还是混合适应设置下,结果都证实了我们框架的强大性和通用性,在各种部署条件下保持持续改进。
https://arxiv.org/abs/2512.10322
We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73\% and 28\% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.
我们介绍了Stylized Meta-Album(SMA),这是一个新的图像分类元数据集,包含24个数据集(12个内容数据集和12个风格化数据集),旨在推进出界分布(OOD)泛化的研究及相关话题。通过使用来自12个主题分类数据集的样式转换技术创建而成,SMA提供了一套多样且全面的4800组组合,将各种主题(对象、植物、动物、人类动作、纹理)与多种风格相结合。SMA允许对组和类进行灵活控制,使我们能够配置数据集以反映多样化基准测试场景。 虽然理想的数据收集会捕捉广泛的群体多样性,但实际限制通常使得这一目标难以实现。SMA通过对其风格、主题类别及领域进行灵活控制来解决此问题,从而建立大型且可配置的群体结构,让数据集能反映多种现实世界中的基准测试情景。这种设计不仅扩大了组和类别的多样性,还为评估模型在各种多样化的群组和领域设置下的表现打开了新的方法论方向,并可用于研究更多真实条件下公平性、鲁棒性和适应性的问题。 为了展示SMA的有效性,我们实施了两个基准测试:(1) 一个利用SMA的领域、类别及群体多样性进行OOD泛化和群体公平性评估的新基准;我们的发现表明,尽管简单的平衡以及使用群体信息的算法在现有基准中仍然具有竞争力,但增加群体多样性显著影响了公平性,改变了算法的优势及其相对排名。(2) 一项无监督域适应(UDA)基准测试,利用SMA的群组多样性来评价UDA算法跨更多场景的表现。与现有的努力相比,在封闭集设置和UniDA设置下,该基准测试提供了一个更全面的评估,误差范围降低了73% 和 28%,展示了SMA在扩大群体多样性的优化过程中实现更广泛公平性,并最终提高了最差群体准确性。 这些应用场景突显了SMA对传统基准结果具有显著影响的潜力。
https://arxiv.org/abs/2512.09773
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
多参与者会议在商业谈判和医疗咨询等领域中普遍存在,此类会议常常会讨论涉及敏感信息的内容,如商业机密、战略规划及患者状况等。先前的研究表明,攻击者可以通过检测物体上由讲话引起的微小振动来利用毫米波雷达窃听房间内的谈话内容。然而,这种窃听攻击无法区分多参与者会议中的哪个人说了什么内容,这可能导致误解和糟糕的决策制定。 在本文中,我们回答了“谁说了什么”的问题。通过利用普遍存在物体带来的空间多样性,我们提出了一种攻击系统,该系统可以让攻击者无需事先了解如身份、参与人数或座位安排等信息的情况下,远程窃听面对面会议中的谈话内容。由于面对面会议的参与者通常坐在不同的位置上,他们的讲话会在附近的物体上产生不同的振动模式。为了利用这一点,我们设计了一个鲁棒性的无监督方法来检测频域内的讲话引起的振动差异,从而区分不同参与者。同时,探索了一种基于深度学习框架的方法来结合来自不同物体的信号以提升语音质量。 通过广泛的实验验证了这一概念性攻击在语音分类和信号增强方面的可行性。实验结果表明,在会议室内有几位参与者的场景下,我们的攻击方法可以达到高达0.99的语音分类准确率。此外,我们的攻击技术在所有真实世界的情况下都展示了持续的语音质量提升效果,并且适用于雷达与物体之间不同距离的情况。 该研究工作不仅揭示了当前毫米波雷达技术可能带来的隐私安全风险,还为未来的防护措施提出了新的挑战和思考方向。
https://arxiv.org/abs/2512.09285
We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative fiting algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is an analytical primitive that is simultaneously (1) expressive, being able to model various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.
我们提出了一种框架,用于将三维形状转换为紧凑且可编辑的解析基元组合体,直接解决了重构保真度与简洁性之间的持久权衡问题。我们的方法结合了两项关键贡献:一种新颖的基元,称为SuperFrustum,以及一种迭代拟合算法,即残差基元拟合(ResFit)。SuperFrustum是一种分析基元,同时具备以下特点:(1) 表达能力强,能够模拟圆柱体、球体、圆锥体及其收缩和弯曲形式等各种常见实体;(2) 可编辑性强,参数化方式简洁,仅需8个参数;(3) 优化性好,在几乎所有情况下,其与基元参数相关的符号距离场均可微。ResFit是一种无监督过程,它在全局形状分析与局部优化之间交替进行,通过迭代拟合基元以适应未解释的残差来发现每个输入形状的简洁而准确的分解方法。 在各种3D基准测试中,我们的方法取得了最先进的成果,在使用几乎只有先前工作一半数量基元的情况下,提高了IoU(交并比)超过9个百分点。由此产生的组合体架起了密集型3D数据与人类可控设计之间的桥梁,生成了高保真度且可编辑的形状程序。
https://arxiv.org/abs/2512.09201