Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
视频异常检测(Video Anomaly Detection,VAD)主要关注识别视频中的异常情况。监督方法需要一定量的领域内训练数据,并且通常难以泛化到未见过的异常情况中去。相比之下,无训练的方法利用大型语言模型(Large Language Models, LLMs)固有的世界知识来检测异常,但面临着在定位细粒度视觉转换和多样事件方面的挑战。 因此,我们提出了EventVAD,这是一种基于事件感知的视频异常检测框架,结合了定制化的动态图架构和多模态LLMs,并通过时间-事件推理将二者相结合。具体来说,EventVAD首先使用带有时间衰减约束的动态时空图模型来捕捉以事件为中心的视频特征。然后,它执行自适应噪声过滤,并利用信号比率阈值检测事件边界,这借助于无监督统计特性实现。该统计边界检测模块降低了长时间视频处理对于多模态LLMs(Multimodal LLMs, MLLMs)的复杂性,并通过事件一致性提高了它们的时间推理能力。最后,它采用分层提示策略来引导MLLMS进行推理并最终做出决定。 我们在UCF-Crime和XD-Violence数据集上进行了广泛的实验。结果显示,在无训练设置下,使用7B参数量级MLLM的EventVAD达到了最先进的性能(State-of-the-Art, SOTA),超过了使用7B及以上规模LLMs的强大基线模型。
https://arxiv.org/abs/2504.13092
In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at this https URL.
本文提出了一种使用预训练的大规模语言模型(LLM)从文档中提取概念的方法。与传统的通过提取关键短语来总结文档中讨论的重要信息的方法相比,我们的方法解决了一个更具挑战性的任务,即提取所有与特定领域相关联的概念,而不仅仅是重要的那些。通过对两个广泛使用的基准数据集进行全面评估,我们展示了相较于最先进的技术,本方法提高了F1分数。此外,我们还探讨了在这些模型中使用提示词进行无监督概念抽取的潜力。所提取的概念旨在支持本体论领域的覆盖率评价以及促进本体学习,突显了大规模语言模型在概念抽取任务中的有效性。我们的源代码和数据集可在以下网址公开获取:[此链接](请将"[此链接]"替换为实际提供的URL)。
https://arxiv.org/abs/2504.12915
Digital pathology, augmented by artificial intelligence (AI), holds significant promise for improving the workflow of pathologists. However, challenges such as the labor-intensive annotation of whole slide images (WSIs), high computational demands, and trust concerns arising from the absence of uncertainty estimation in predictions hinder the practical application of current AI methodologies in histopathology. To address these issues, we present a novel trustful fully unsupervised multi-level segmentation methodology (TUMLS) for WSIs. TUMLS adopts an autoencoder (AE) as a feature extractor to identify the different tissue types within low-resolution training data. It selects representative patches from each identified group based on an uncertainty measure and then does unsupervised nuclei segmentation in their respective higher-resolution space without using any ML algorithms. Crucially, this solution integrates seamlessly into clinicians workflows, transforming the examination of a whole WSI into a review of concise, interpretable cross-level insights. This integration significantly enhances and accelerates the workflow while ensuring transparency. We evaluated our approach using the UPENN-GBM dataset, where the AE achieved a mean squared error (MSE) of 0.0016. Additionally, nucleus segmentation is assessed on the MoNuSeg dataset, outperforming all unsupervised approaches with an F1 score of 77.46% and a Jaccard score of 63.35%. These results demonstrate the efficacy of TUMLS in advancing the field of digital pathology.
由人工智能(AI)增强的数字病理学在提高病理学家的工作流程方面具有巨大潜力。然而,诸如全滑数字图像(WSIs)人工标注劳动强度大、计算需求高以及由于缺乏预测不确定性估计而产生的信任问题等挑战阻碍了当前AI方法在组织病理学中的实际应用。为了解决这些问题,我们提出了一种新型的可信赖的完全无监督多层次分割方法(TUMLS),用于处理WSI。 TUMLS采用自动编码器(AE)作为特征提取器,利用低分辨率训练数据来识别不同类型的组织。该方法基于不确定性度量从每个已识别的组中选择代表性切片,并在相应的高分辨率空间内执行无监督细胞核分割,且不使用任何机器学习算法。这种方法的关键优势在于它可以无缝地集成到临床医生的工作流程中,将整个WSI的检查转变为对简洁、可解释的跨层次见解的审查。这种整合显著提升了工作流程的效率并确保了透明度。 我们利用UPENN-GBM数据集评估了我们的方法,其中AE在均方误差(MSE)指标上达到了0.0016的成绩。此外,在MoNuSeg数据集上的细胞核分割评估显示,TUMLS优于所有无监督方法,F1得分为77.46%,Jaccard得分则为63.35%。这些结果证明了TUMLS在推进数字病理学领域方面的有效性。
https://arxiv.org/abs/2504.12718
Existing 3D human pose estimation methods often suffer in performance, when applied to cross-scenario inference, due to domain shifts in characteristics such as camera viewpoint, position, posture, and body size. Among these factors, camera viewpoints and locations {have been shown} to contribute significantly to the domain gap by influencing the global positions of human poses. To address this, we propose a novel framework that explicitly conducts global transformations between pose positions in the camera coordinate systems of source and target domains. We start with a Pseudo-Label Generation Module that is applied to the 2D poses of the target dataset to generate pseudo-3D poses. Then, a Global Transformation Module leverages a human-centered coordinate system as a novel bridging mechanism to seamlessly align the positional orientations of poses across disparate domains, ensuring consistent spatial referencing. To further enhance generalization, a Pose Augmentor is incorporated to address variations in human posture and body size. This process is iterative, allowing refined pseudo-labels to progressively improve guidance for domain adaptation. Our method is evaluated on various cross-dataset benchmarks, including Human3.6M, MPI-INF-3DHP, and 3DPW. The proposed method outperforms state-of-the-art approaches and even outperforms the target-trained model.
现有的三维人体姿态估计方法在跨场景推理中通常会因诸如摄像机视角、位置、姿势和体型等特征的领域转移(domain shift)而表现不佳。在这其中,摄像机视点和位置已被证明对领域的差距有显著影响,因为它们会影响人类姿态的全局位置。为了解决这个问题,我们提出了一种新颖的框架,该框架明确地在源域和目标域的摄像机坐标系统之间进行姿势位置的全局变换。 我们的方法从伪标签生成模块开始,它应用于目标数据集中的2D姿态来生成伪3D姿态。然后,全局变换模块利用以人体为中心的坐标系作为新的桥梁机制,无缝地对不同领域的姿态的位置方向进行对齐,确保一致的空间参考。为了进一步增强泛化能力,我们还集成了一种姿势增强器(Pose Augmentor),用于处理人类姿势和体型的变化。 这个过程是迭代性的,允许经过精炼的伪标签逐步改善跨领域适应的指导。我们在各种跨数据集基准测试中评估了该方法,包括Human3.6M、MPI-INF-3DHP 和 3DPW 数据集。我们的方法超越了现有的最佳方法,并且甚至超过了目标训练模型的表现。
https://arxiv.org/abs/2504.12699
Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at this https URL.
多类别无监督异常检测算法(MUAD)因其相对较低的部署成本和改进的训练效率而受到越来越多的关注。然而,由于现有工业异常检测(IAD)数据集的局限性,人们对MUAD方法在实际应用中的有效性提出了质疑。这些数据集中包含了许多不太可能由同一工厂生产的类别,并且未能涵盖多种结构或外观特征。此外,缺陷也不反映现实世界的特性。 因此,我们引入了异质同类工业异常检测(HSS-IAD)数据集,该数据集包含了8,580张金属类工业部件的图像以及精确的异常标注。这些部件在结构和外观上表现出变化,并带有与基本材料非常相似的细微缺陷。此外,我们还提供了用于合成异常生成的前景图。最后,在多类别和类别分离设置下评估了流行IAD方法在此数据集上的性能,证明其有可能弥合现有数据集与实际工厂条件之间的差距。 该数据集可在以下链接获取:[此 https URL](请将方括号中的内容替换为实际URL)。
https://arxiv.org/abs/2504.12689
The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.
在图像生成领域,扩散模型的广泛应用增加了对符合隐私要求的“遗忘”机制的需求。然而,由于扩散模型具有高维特性和复杂的特征表示,实现选择性遗忘仍然极具挑战性。现有的方法难以同时去除敏感信息并保持非敏感区域的一致性。为了解决这个问题,我们提出了一种基于提示分层编辑和无训练本地特征移除的自动数据集创建框架,并构建了ForgetMe数据集以及介绍了纠缠评估指标(Entangled)。该指标通过评估目标区域与背景区域之间的相似性和一致性来量化遗忘的效果,并支持配对(Entangled-D)及非配对(Entangled-S)图像数据,从而实现无监督评价。 ForgetMe数据集包含了广泛的现实和合成场景,包括CUB-200-2011(鸟类)、斯坦福犬、ImageNet以及一个合成的猫数据集。我们通过在Stable Diffusion上应用低秩适应(Low-Rank Adaptation, LoRA)微调方法,在该数据集上实现了选择性遗忘,并验证了ForgetMe数据集和纠缠度量的有效性,从而确立了它们作为选择性遗忘基准的地位。 我们的工作提供了一个可扩展且灵活的解决方案,以推进保护隐私的生成式人工智能技术。
https://arxiv.org/abs/2504.12574
While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.
在预训练阶段可用的指标(如困惑度)与缩放定律研究中的模型性能有很好的相关性,但在固定模型大小下的预测能力仍然不清楚,这阻碍了有效的模型选择和发展。为了解决这一缺口,我们将选择预训练检查点以最大化下游微调性能的任务表述为一个二元分类问题:预测在监督微调(SFT)之后,在预训练方面有所不同的两个大型语言模型(LLM)中哪一个表现更好。我们使用50个10亿参数的LLM变体构建了一个数据集,这些变体具有系统性变化的预训练配置,例如目标或数据,并评估它们在SFT后的各种下游任务上的性能。首先进行了一项研究并证明了传统的困惑度是一个误导性的指标。因此,我们引入了一些新颖的无监督和有监督的代理指标,这些指标源自预训练阶段,并成功地将相对性能预测误差率降低了超过50%。尽管这项任务本身具有内在复杂性,但我们展示了所提出的代理指标在特定场景下的实际应用价值,为各种下游任务优化预训练方案的设计开辟了更加高效的道路。
https://arxiv.org/abs/2504.12491
Deep learning has provided considerable advancements for multimedia systems, yet the interpretability of deep models remains a challenge. State-of-the-art post-hoc explainability methods, such as GradCAM, provide visual interpretation based on heatmaps but lack conceptual clarity. Prototype-based approaches, like ProtoPNet and PIPNet, offer a more structured explanation but rely on fixed patches, limiting their robustness and semantic consistency. To address these limitations, a part-prototypical concept mining network (PCMNet) is proposed that dynamically learns interpretable prototypes from meaningful regions. PCMNet clusters prototypes into concept groups, creating semantically grounded explanations without requiring additional annotations. Through a joint process of unsupervised part discovery and concept activation vector extraction, PCMNet effectively captures discriminative concepts and makes interpretable classification decisions. Our extensive experiments comparing PCMNet against state-of-the-art methods on multiple datasets show that it can provide a high level of interpretability, stability, and robustness under clean and occluded scenarios.
深度学习为多媒体系统带来了显著的进步,但深层模型的可解释性仍然是一个挑战。目前最先进的事后可解释性方法,如GradCAM,提供了基于热力图的视觉解释,但却缺乏概念上的清晰度。基于原型的方法(例如ProtoPNet和PIPNets),虽然提供了一种更为结构化的解释方式,却依赖于固定的补丁,这限制了它们的鲁棒性和语义一致性。为了解决这些问题,提出了一个动态学习有意义区域中的可解释性原型的“部分原型概念挖掘网络”(PCMNet)。该网络将原型聚类成概念组,生成无需额外标注的语义基础解释。通过无监督的部分发现和激活向量提取联合过程,PCMNet能够有效捕捉判别式概念,并做出可解释性的分类决策。我们广泛的实验在多个数据集上比较了PCMNet与最先进的方法的结果表明,在干净场景和遮挡场景下,它都能提供高水平的可解释性、稳定性和鲁棒性。
https://arxiv.org/abs/2504.12197
Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \href{github}{this https URL
透射变换估计是计算机视觉中的一个基础任务,在多个领域都有广泛应用。近年来,深度学习技术的进步显著改善了透射变换的估计能力,特别是通过无监督学习方法提高了其鲁棒性和泛化性。然而,在处理复杂运动时精确预测透射变换仍然是一个挑战。为此,本研究提出了一种利用视频编码的新方法,特别地是通过充分利用视频中固有的运动向量(MVs)来实现这一目标。 我们介绍了CodingHomo,这是一个用于透射变换估计的无监督框架。该框架包含了一个Mask-Guided Fusion (MGF)模块,能够识别并利用MV中的有益特征,从而提高透射变换预测的准确性。此外,还提供了一种Mask-Guided Homography Estimation (MGHE) 模块,在从粗到细的透射变换细化过程中消除不希望有的特征。 CodingHomo优于现有的最先进的无监督方法,提供了良好的鲁棒性和泛化性。代码和数据集可在[GitHub](https://github.com/your-repo-url-here)上获得。
https://arxiv.org/abs/2504.12165
A domain (distribution) shift between training and test data often hinders the real-world performance of deep neural networks, necessitating unsupervised domain adaptation (UDA) to bridge this gap. Online source-free UDA has emerged as a solution for practical scenarios where access to source data is restricted and target data is received as a continuous stream. However, the open-world nature of many real-world applications additionally introduces category shifts meaning that the source and target label spaces may differ. Online source-free universal domain adaptation (SF-UniDA) addresses this challenge. Existing methods mainly rely on self-training with pseudo-labels, yet the relationship between pseudo-labeling and adaptation outcomes has not been studied yet. To bridge this gap, we conduct a systematic analysis through controlled experiments with simulated pseudo-labeling, offering valuable insights into pseudo-labeling for online SF-UniDA. Our findings reveal a substantial gap between the current state-of-the-art and the upper bound of adaptation achieved with perfect pseudo-labeling. Moreover, we show that a contrastive loss enables effective adaptation even with moderate pseudo-label accuracy, while a cross-entropy loss, though less robust to pseudo-label errors, achieves superior results when pseudo-labeling approaches perfection. Lastly, our findings indicate that pseudo-label accuracy is in general more crucial than quantity, suggesting that prioritizing fewer but high-confidence pseudo-labels is beneficial. Overall, our study highlights the critical role of pseudo-labeling in (online) SF-UniDA and provides actionable insights to drive future advancements in the field. Our code is available at this https URL.
在训练数据和测试数据之间存在的领域(分布)变化通常会阻碍深度神经网络的现实世界性能,这就需要无监督领域适应(UDA)来弥合这种差距。在线源不可用的无监督领域适应(Online source-free UDA)已成为解决方案,在实际应用场景中,这种情况受限于无法访问训练数据而只能接收连续的数据流。然而,许多真实世界的应用因其开放特性还引入了类别变化的问题,即源域和目标域的标签空间可能不同。为解决这一挑战,提出了在线无源通用领域适应(SF-UniDA)方法。 现有方法主要依赖自我训练与伪标签,但伪标记与其适应效果之间的关系尚未被研究过。为了填补这一空白,我们通过带有模拟伪标签的控制实验进行了系统的分析,提供了对线上 SF-UniDA 中伪标注的重要见解。我们的发现揭示了当前最先进的技术与使用完美伪标签时所达到的最佳适应结果之间存在显著差距。 此外,我们还展示了对比损失即使在中等准确度的伪标记下也能实现有效的适应,而交叉熵损失虽然对伪标记错误的鲁棒性较差,在伪标签接近完美时却能取得优越的结果。最后,我们的研究发现表明,伪标签精度通常比数量更重要,这意味着优先选择少数但具有高置信度的伪标签是有益的。 总体而言,我们的研究表明伪标记在(线上)SF-UniDA 中扮演着关键角色,并提供了有价值的见解以推动该领域的未来进步。我们的代码可在以下网址获取:[这里提供实际链接]。
https://arxiv.org/abs/2504.11992
Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
使用大量未标注数据对视觉语言模型(VLMs)进行微调最近引起了广泛关注。然而,一个关键挑战仍然是缺乏高质量的伪标签数据。当前的伪标签策略往往难以解决语义和视觉信息之间的不匹配问题,导致无监督提示学习(UPL)方法性能不佳。在本文中,我们提出了一种简单而有效的方法——通过扩散增强判别丰富度(Augmenting Discriminative Richness via Diffusions,简称AiR),该方法旨在以更全面的方式表示类别,从而促进分类任务的完成。 具体而言,我们的方法包括一个伪标签生成模块,该模块利用高保真合成样本来创建辅助分类器。这种辅助分类器能够捕捉更多的视觉变化,将文本-图像对分类转变为更为稳健的图像-图像对分类。此外,我们还利用基于扩散模型的合成样本多样性来增强提示学习,提供更丰富的语义-视觉对齐信息。 在五个公开基准数据集(包括RESISC45和Flowers102)上进行的广泛实验,并且跨三种学习范式(无监督学习UL、半监督学习SSL和任务相关零样本学习TRZSL),证明了AiR方法相对于最先进的无监督提示学习方法实现了显著而一致的性能提升。
https://arxiv.org/abs/2504.11930
Unsupervised anomaly detection in hyperspectral images (HSI), aiming to detect unknown targets from backgrounds, is challenging for earth surface monitoring. However, current studies are hindered by steep computational costs due to the high-dimensional property of HSI and dense sampling-based training paradigm, constraining their rapid deployment. Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. Specifically, we design an asymmetrical anomaly detection paradigm that utilizes region-level instances as an efficient alternative to dense pixel-level samples. In this paradigm, a low-cost Mamba-based module is introduced to discover global contextual attributes of regions that are essential for HSI reconstruction. Additionally, we develop a consensus learning strategy from the optimization perspective to simultaneously facilitate background reconstruction and anomaly compression, further alleviating the negative impact of anomaly reconstruction. Theoretical analysis and extensive experiments across eight benchmarks verify the superiority of ACMamba, demonstrating a faster speed and stronger performance over the state-of-the-art.
无监督异常检测在高光谱图像(HSI)中面临挑战,特别是在地球表面监测方面,目标是识别背景中的未知目标。然而,由于HSI的高维特性以及密集采样训练模式导致高昂的计算成本,当前的研究受到限制,这阻碍了它们的快速部署和应用。 我们的关键观察是,在训练过程中,并非同一同质区域内的所有样本都是必要的;巧妙地选择样本可以作为减少成本的强大替代方案。受此启发,我们提出了一种不对称共识状态空间模型(ACMamba),旨在大幅降低计算成本而不影响准确性。具体来说,我们设计了一个不对称的异常检测模式,该模式利用区域级别的实例作为一种高效的密集像素级别采样替代品。在此框架下,引入了基于“蟒蛇”(Mamba)的低成本模块来发现对HSI重建至关重要的全局上下文属性。 此外,从优化的角度出发,我们开发了一种共识学习策略,以同时促进背景重构和异常压缩,进一步减轻异常重构带来的负面影响。理论分析和在八个基准数据集上的广泛实验验证了ACMamba的优势,展示了比当前最先进的方法更快的速度和更强的性能。
https://arxiv.org/abs/2504.11781
Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.
从跨语言的角度来看,本族词和外来词遵循不同的语音规则。例如,在英语中,日耳曼语系和拉丁语系起源的单词表现出不同的重音模式,并且特定的句法结构仅适用于日耳曼语动词。然而,当将这些基于词汇来源的概括视为认知模型时,它们在可学习性方面面临着挑战,因为一般语言学习者无法获得单词的历史来源信息。 在这项研究中,我们提出了计算证据,表明英语词汇中的日耳曼—拉丁区别可以从单个单词的音系信息中进行学习。具体而言,我们在从语料库提取出的单词上进行了无监督聚类分析,并且得到的单词聚类在很大程度上与词源上的区分一致。模型发现的这些类别还恢复了以前文献记录的各种语言概括,关于相应的词汇来源类。 此外,我们的研究结果还揭示了一些此前未被识别的准词源集群特征,为未来实验研究提供了新的假设。
https://arxiv.org/abs/2504.11770
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
我们研究了在复杂点云中进行3D物体分割的难题,无需人类对3D场景的手动标注来进行监督。现有无监督方法通常依赖于预训练的2D特征相似性或外部信号(如运动)来将3D点分组为对象,这些方法往往只能识别简单的对象,例如汽车;或者由于预训练特征缺乏明显的物体特性,分割出的对象质量较差。 在本文中,我们提出了一种新的两阶段管道模型,称为GrabS。我们的方法的核心理念是,在第一阶段从对象数据集中学习生成性和判别性的以对象为中心的先验知识作为基础,然后在第二阶段设计一个具身代理(embodied agent),通过查询预训练的生成性先验来学会发现多个物体。 我们在两个真实世界的数据集和一个新的合成数据集上对我们的方法进行了广泛的评估,展示了显著的分割性能,并且明显超越了现有的所有无监督方法。
https://arxiv.org/abs/2504.11754
Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: this https URL
最近在无源非监督视频领域适应(SFUVDA)方面的进展利用了视觉语言模型来增强伪标签生成。然而,诸如噪声伪标签和过于自信的预测等挑战限制了它们跨域适应的有效性。我们提出了一种名为Co-STAR的新框架,该框架结合了课程学习,并实现了在源训练教师与对比视觉语言模型(CLIP)之间的协作自我训练。我们的课程学习方法采用了一个基于可靠性的权重函数,用于衡量教师和CLIP之间双向预测对齐的程度,在自信预测和不确定预测之间进行平衡。此功能保留了困难样本的不确定性,同时当两个模型的预测紧密对准时优先选择可靠的伪标签。 为了进一步提高适应性,我们提出了自适应课程正则化,这是一种根据样本的信心分数和预测稳定性以概率、自适应的方式调整学习优先级的方法,从而减轻对噪声及过于自信样本过度拟合的问题。广泛的实验结果显示,在多个视频领域适应基准上,Co-STAR始终优于现有的最先进的SFUVDA方法。 代码可在以下网址获取:[此URL]
https://arxiv.org/abs/2504.11669
We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! this https URL
我们提出了PartField,这是一种前馈方法,用于学习基于部件的三维特征。该方法捕捉到了部件及其层级结构的一般概念,无需依赖预定义模板或文本名称,并且可以应用于不同模式下的开放世界三维形状。在推断时,PartField只需要进行一次3D前向传递,这与先前的方法相比,在运行时间和鲁棒性方面有了显著的提升。 我们的模型通过将从标记数据集和大规模无监督数据集上的图像分割中提取的二维和三维部件提案进行对比学习提炼来训练。它生成了一个可以聚类以产生层级部件分解的连续特征场。比较表明,PartField比其他最近的类别不可知的部件分割方法准确率提高了多达20%,且通常要快几个数量级。 除了单一形状的部件分解之外,在不同形状之间所学到的领域的一致性也出现了,这使得诸如联合分割和对应关系等任务成为可能。我们在多种应用中展示了这些通用的、层级化及一致性的三维特征场的能力。请访问我们的网页
https://arxiv.org/abs/2504.11451
Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) are a promising tool for generating semantic explanations of the decision-making processes in deep neural networks, having applications in both model improvement and understanding. It is vital that the explanation is accurate, or faithful, to the model, yet we identify several limitations of prior faithfulness metrics that inhibit an accurate evaluation; most notably, prior metrics involve only the set of concepts present, ignoring how they may be spatially distributed. We address these limitations with Surrogate Faithfulness (SF), an evaluation method that introduces a spatially-aware surrogate and two novel faithfulness metrics. Using SF, we produce Optimally Faithful (OF) explanations, where concepts are found that maximize faithfulness. Our experiments show that (1) adding spatial-awareness to prior U-CBEMs increases faithfulness in all cases; (2) OF produces significantly more faithful explanations than prior U-CBEMs (30% or higher improvement in error); (3) OF's learned concepts generalize well to out-of-domain data and are more robust to adversarial examples, where prior U-CBEMs struggle.
后验、无监督的概念基解释方法(U-CBEM)是生成深度神经网络决策过程语义解释的有前途工具,其在模型改进和理解方面都有应用。确保这些解释能够准确反映或忠实于原始模型至关重要,然而我们发现先前的忠实度指标存在一些限制,阻碍了对解释的精确评估;尤其突出的是,之前的指标只考虑概念集本身的存在性,而忽略了它们可能的空间分布情况。 为了解决这些问题,我们引入了一种新的评估方法——代理忠实度(Surrogate Faithfulness, SF),该方法引入了一个空间感知的替代模型和两种新颖的忠实度指标。利用SF,我们可以生成最优忠实度(Optimally Faithful, OF)解释,在这种情况下,通过寻找最大化忠实度的概念来实现。 我们的实验表明: 1. 将空间意识添加到先前的U-CBEM中可以提高所有情况下的忠实度; 2. 相较于先前的U-CBEMs,OF能够生成显著更忠实的解释(误差降低30%或更高); 3. OF学习的概念在处理领域外的数据时具有良好的泛化能力,并且对对抗性样本更加稳健,在这些方面,之前的U-CBEM表现不佳。
https://arxiv.org/abs/2504.10833
Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.
在低光场景下的小说视角合成(NVS)仍然面临着严峻挑战,由于输入图像质量差,含有严重噪声、动态范围小(LDR),以及初始化不可靠等问题。尽管基于NeRF的方法近期取得了显著进展,但大多数方法计算成本高,并且一些方法依赖于精心捕获或预处理的数据——例如原始传感器数据或多重曝光序列——这极大地限制了其实际应用性。相比之下,3D高斯点画(3DGS)能够实现实时渲染并保持竞争性的视觉效果;然而,现有的基于3DGS的方法在处理低光sRGB输入图像时存在问题,导致高斯初始化不稳定且噪声抑制不力。 为了解决这些问题,我们提出了LL-Gaussian框架,该框架可以从低光条件下的sRGB图像中进行三维重建和增强,并实现伪正常光线的视角合成。我们的方法引入了三个关键创新: 1. **端到端的低光高斯初始化模块(LLGIM)**:利用基于学习的方法中的密集先验从多视图立体匹配(MVS)生成高质量初始点云。 2. **双分支高斯分解模型**:该模型可以将场景固有的属性(反射率和照明)与瞬态干扰解耦,实现稳定且可解释的优化。 3. **无监督优化策略**:通过物理约束和扩散先验同时引导分解和增强过程。 此外,我们贡献了一组在极端低光环境中收集的数据集,并展示了LL-Gaussian的有效性。相较于最先进的基于NeRF的方法,LL-Gaussian能够在推理速度上提高2000倍,在训练时间上减少至原来的2%,并且提供更高质量的重建和渲染效果。
https://arxiv.org/abs/2504.10331
We present a new self-supervised deep-learning-based Ghost Imaging (GI) reconstruction method, which provides unparalleled reconstruction performance for noisy acquisitions among unsupervised methods. We present the supporting mathematical framework and results from theoretical and real data use cases. Self-supervision removes the need for clean reference data while offering strong noise reduction. This provides the necessary tools for addressing signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge low-light GI scenarios. Notable examples include micro- and nano-scale x-ray emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples. Their applications include in-vivo and in-operando case studies for biological samples and batteries.
我们提出了一种新的基于自监督深度学习的鬼成像(GI)重建方法,该方法在无监督方法中提供了对噪声采集前所未有的重建性能。本文介绍了支持性的数学框架以及来自理论和实际数据用例的结果。自监督消除了对干净参考数据的需求,同时提供强大的降噪功能。这为解决新兴和前沿低光条件下鬼成像信号与噪声比问题提供了必要的工具。值得注意的例子包括微米级和纳米级X射线发射成像(例如剂量敏感样品的X射线荧光成像)。它们的应用范围包括生物样本和电池在体和实时案例研究中的应用。
https://arxiv.org/abs/2504.10288
Characterization of atomic-scale materials traditionally requires human experts with months to years of specialized training. Even for trained human operators, accurate and reliable characterization remains challenging when examining newly discovered materials such as two-dimensional (2D) structures. This bottleneck drives demand for fully autonomous experimentation systems capable of comprehending research objectives without requiring large training datasets. In this work, we present ATOMIC (Autonomous Technology for Optical Microscopy & Intelligent Characterization), an end-to-end framework that integrates foundation models to enable fully autonomous, zero-shot characterization of 2D materials. Our system integrates the vision foundation model (i.e., Segment Anything Model), large language models (i.e., ChatGPT), unsupervised clustering, and topological analysis to automate microscope control, sample scanning, image segmentation, and intelligent analysis through prompt engineering, eliminating the need for additional training. When analyzing typical MoS2 samples, our approach achieves 99.7% segmentation accuracy for single layer identification, which is equivalent to that of human experts. In addition, the integrated model is able to detect grain boundary slits that are challenging to identify with human eyes. Furthermore, the system retains robust accuracy despite variable conditions including defocus, color temperature fluctuations, and exposure variations. It is applicable to a broad spectrum of common 2D materials-including graphene, MoS2, WSe2, SnSe-regardless of whether they were fabricated via chemical vapor deposition or mechanical exfoliation. This work represents the implementation of foundation models to achieve autonomous analysis, establishing a scalable and data-efficient characterization paradigm that fundamentally transforms the approach to nanoscale materials research.
传统上,对原子尺度材料的表征需要经过数月至数年专门培训的人类专家。即使对于训练有素的操作人员而言,在检查新发现的二维(2D)结构等新材料时,仍难以进行准确和可靠的表征。这种瓶颈促使了全自动实验系统的研发需求,该系统能够在不需要大量训练数据的情况下理解研究目标。在这项工作中,我们介绍了ATOMIC(用于光学显微镜及智能表征的自主技术),这是一个端到端框架,集成了基础模型以实现对二维材料的完全自主、零样本表征。 我们的系统融合了视觉基础模型(例如Segment Anything Model)、大型语言模型(例如ChatGPT)、无监督聚类和拓扑分析,通过提示工程实现了显微镜控制、样品扫描、图像分割以及智能分析的自动化。这消除了对额外训练的需求。在分析典型的MoS2样本时,我们的方法达到了99.7%单层识别的分割精度,与人类专家的表现相当。此外,集成模型能够检测到人眼难以辨别的晶界缝隙。 更重要的是,在包括焦距变化、色温波动和曝光差异在内的各种条件下,该系统仍能保持稳健的准确性。它适用于广泛范围内的常见2D材料——无论是通过化学气相沉积还是机械剥离制备的石墨烯、MoS2、WSe2和SnSe等材料。这项工作展示了基础模型在实现自主分析中的应用,确立了一种可扩展且数据效率高的表征范式,从根本上改变了纳米尺度材料研究的方法。
https://arxiv.org/abs/2504.10281