We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
这项研究对RF-DETR目标检测基础模型和YOLOv12目标检测模型配置进行了详细的比较,旨在识别复杂果园环境中带有标签模糊、遮挡以及背景融合特点的绿色水果。研究人员开发了一个自定义数据集,其中包含单类(绿色水果)和多类(被遮挡和未被遮挡的绿色水果)注释,以评估模型在动态真实世界条件下的性能。 RF-DETR目标检测模型采用DINOv2主干网络及可变形注意力机制,在全局上下文建模方面表现出色,能够有效识别部分遮挡或模糊不清的绿色水果。相比之下,YOLOv12利用基于CNN的注意力机制来增强局部特征提取,优化了计算效率和边缘部署场景下的应用。 在单类检测中,RF-DETR达到了最高的平均精度(mAP50)为0.9464,证明其在杂乱环境中定位绿色水果的能力更胜一筹。虽然YOLOv12N记录的最高mAP@50:95值为0.7620,但在复杂的空间场景中,RF-DETR始终表现出色。对于多类检测任务,RF-DETR以0.8298的mAP@50领先,展示了其区分被遮挡和未被遮挡水果的能力,而YOLOv12L在mAP@50:95中得分最高为0.6622,表明它在详细遮挡背景下具有更好的分类效果。 训练动力学分析显示,RF-DETR在单类设置下快速收敛,尤其在前10个时期达到稳定状态,这证明了基于变换器架构的模型在适应动态视觉数据方面的效率。这些发现验证了RF-DETR在精准农业应用中的有效性,而YOLOv12则更适合于需要快速响应的应用场景。 > 关键词:RF-DETR目标检测、YOLOv12、YOLOv13、YOLOv14、YOLOv15、YOLOE、YOLO World、YOLO、“只需看一次”(You Only Look Once)、Roboflow、检测变换器(Detection Transformers)、CNNs
https://arxiv.org/abs/2504.13099
Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
视频异常检测(Video Anomaly Detection,VAD)主要关注识别视频中的异常情况。监督方法需要一定量的领域内训练数据,并且通常难以泛化到未见过的异常情况中去。相比之下,无训练的方法利用大型语言模型(Large Language Models, LLMs)固有的世界知识来检测异常,但面临着在定位细粒度视觉转换和多样事件方面的挑战。 因此,我们提出了EventVAD,这是一种基于事件感知的视频异常检测框架,结合了定制化的动态图架构和多模态LLMs,并通过时间-事件推理将二者相结合。具体来说,EventVAD首先使用带有时间衰减约束的动态时空图模型来捕捉以事件为中心的视频特征。然后,它执行自适应噪声过滤,并利用信号比率阈值检测事件边界,这借助于无监督统计特性实现。该统计边界检测模块降低了长时间视频处理对于多模态LLMs(Multimodal LLMs, MLLMs)的复杂性,并通过事件一致性提高了它们的时间推理能力。最后,它采用分层提示策略来引导MLLMS进行推理并最终做出决定。 我们在UCF-Crime和XD-Violence数据集上进行了广泛的实验。结果显示,在无训练设置下,使用7B参数量级MLLM的EventVAD达到了最先进的性能(State-of-the-Art, SOTA),超过了使用7B及以上规模LLMs的强大基线模型。
https://arxiv.org/abs/2504.13092
Eradicating poverty is the first goal in the United Nations Sustainable Development Goals. However, aporophobia -- the societal bias against people living in poverty -- constitutes a major obstacle to designing, approving and implementing poverty-mitigation policies. This work presents an initial step towards operationalizing the concept of aporophobia to identify and track harmful beliefs and discriminative actions against poor people on social media. In close collaboration with non-profits and governmental organizations, we conduct data collection and exploration. Then we manually annotate a corpus of English tweets from five world regions for the presence of (1) direct expressions of aporophobia, and (2) statements referring to or criticizing aporophobic views or actions of others, to comprehensively characterize the social media discourse related to bias and discrimination against the poor. Based on the annotated data, we devise a taxonomy of categories of aporophobic attitudes and actions expressed through speech on social media. Finally, we train several classifiers and identify the main challenges for automatic detection of aporophobia in social networks. This work paves the way towards identifying, tracking, and mitigating aporophobic views on social media at scale.
消除贫困是联合国可持续发展目标中的首要目标。然而,对生活在贫困中的人的偏见——即阿波霍菲亚(aporophobia)——构成了设计、批准和实施减贫政策的主要障碍。这项工作旨在通过操作化理解阿波霍菲亚概念来识别并追踪社会媒体上针对穷人的有害信念和歧视性行为。在与非政府组织和政府部门密切合作的情况下,我们进行数据收集和探索。然后,我们手动标注来自全球五个地区的英语推文语料库,以确定其中是否存在(1)直接表达的阿波霍菲亚言论,以及(2)针对或批评他人阿波霍菲亚观点或行为的声明,从而全面描述与对穷人的偏见和歧视相关的社交媒体讨论。基于这些标注的数据,我们设计了一个分类体系,涵盖了在社交网络上通过言语表达的阿波霍菲亚态度和行动类别。最后,我们训练了几种分类器,并识别了自动检测社交网络中阿波霍菲亚的主要挑战。这项工作为大规模地识别、追踪并缓解社交媒体上的阿波霍菲亚观点铺平道路。
https://arxiv.org/abs/2504.13085
Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at this https URL.
超声心动图对于心血管疾病的检测至关重要,但严重依赖有经验的超声技师。超声心动图探头引导系统通过提供获取标准切面图像的实时移动指令,为AI辅助或完全自主扫描提供了具有前景的解决方案。然而,开发用于此类任务的有效机器学习模型仍然极具挑战性,因为这些模型必须掌握心脏解剖结构以及探针运动与视觉信号之间复杂的相互作用。为此,我们提出了EchoWorld,这是一个专为探头引导设计的感知动作的世界建模框架,它编码了解剖知识和由运动引起的视觉动态,并且能够有效利用过去的视动序列来提高指导精度。 EchoWorld采用了一种受世界建模原理启发的预训练策略,在这种策略中,模型预测掩码解剖区域并模拟探针调整后的视觉效果。在微调阶段,我们在此预先训练的模型基础上引入了一个感知动作的关注机制,该机制能够有效地整合历史视动数据,从而实现精确且适应性强的探头引导。 EchoWorld是在超过200万张来自200多例常规检查的超声图像上进行训练的,这些图像涵盖了关键的超声心动图知识,并通过定性分析得到了验证。此外,与现有的视觉骨干网和指导框架相比,我们的方法在单帧和序列评估协议中显著降低了引导误差。 代码可在以下链接获取:[https URL] (请将 [https URL] 替换为实际提供的链接地址)
https://arxiv.org/abs/2504.13065
Anomaly detection is a crucial task in computer vision, yet collecting real-world defect images is inherently difficult due to the rarity and unpredictability of anomalies. Consequently, researchers have turned to synthetic methods for training data augmentation. However, existing synthetic strategies (e.g., naive cut-and-paste or inpainting) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this thesis, we introduced a novel pipeline that generates synthetic anomalies through Math-Physics model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator(SQE). By incorporating physical modeling of cracks, corrosion, and deformation, our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity using wavelet transforms and boundary synergy blocks. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our approach, we conducted comprehensive experiments on three widely adopted industrial anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, the proposed pipeline achieves state-of-the-art (SOTA) results in both image-AUROC and pixel-AUROC, confirming the effectiveness of our MaPhC2F and BiSQAD.
异常检测是计算机视觉中的一个重要任务,但由于缺陷图像在现实世界中极为稀有且难以预测,收集这些图像非常困难。因此,研究人员转向合成方法来增强训练数据。然而,现有的合成策略(如简单的剪切粘贴或修复)忽略了缺陷的物理成因,导致生成的异常不一致且质量低劣,限制了模型对现实复杂情况的泛化能力。在本论文中,我们引入了一种新型管道,该管道通过数学-物理模型指导生成合成异常,并采用粗到细的方法进行优化,同时利用双层级优化策略和合成质量评估器(SQE)。我们的方法通过对裂缝、腐蚀和变形等缺陷进行物理建模来产生现实的缺陷掩膜,在两个阶段进一步增强它们。第一阶段(npcF)通过基于PDE的一致性实现全局连贯的异常结构,而第二阶段(npcF++)则利用小波变换和边界协同块进一步提升局部细节的真实性。 此外,我们还采用了SQE驱动的加权机制,确保在训练过程中对高质量合成样本给予更多关注。为了验证我们的方法的有效性,在三个广泛采用的工业异常检测基准数据集(MVTec AD、VisA 和 BTAD)上进行了全面实验。在这些数据集中,所提出的管道在图像AUROC和像素AUROC方面均达到了最先进的(SOTA)结果,这证实了我们MaPhC2F和BiSQAD的有效性。
https://arxiv.org/abs/2504.12970
The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.
视频内容作为沟通和传播错误信息的媒介影响力日益增强,凸显了在多语言、多主题环境中有效分析主张的需求。现有针对虚假信息检测的努力主要集中在书面文本上,而对处理视频字幕中口语复杂性的需求留下了显著空白。我们介绍了ViClaim数据集,该数据集包含1,798个跨三种语言(英语、德语和西班牙语)以及六个主题的注释视频字幕。字幕中的每一句话都被标注为三个与主张相关的类别之一:值得事实核查、不值得事实核查或观点表达。我们开发了一个定制的注解工具,以简化这一复杂过程。使用最先进的多语言语言模型进行的实验显示,在交叉验证中具有很强的表现(宏观F1得分高达0.896),但也揭示了在处理未见过的主题时的泛化能力挑战,特别是在不同的领域中尤为明显。我们的研究结果突显了在视频字幕中检测主张的复杂性。ViClaim为推进基于视频沟通中的虚假信息检测提供了坚实的基础,并解决了多模态分析中的一个关键缺口。
https://arxiv.org/abs/2504.12882
Large pretrained vision foundation models have shown significant potential in various vision tasks. However, for industrial anomaly detection, the scarcity of real defect samples poses a critical challenge in leveraging these models. While 2D anomaly generation has significantly advanced with established generative models, the adoption of 3D sensors in industrial manufacturing has made leveraging 3D data for surface quality inspection an emerging trend. In contrast to 2D techniques, 3D anomaly generation remains largely unexplored, limiting the potential of 3D data in industrial quality inspection. To address this gap, we propose a novel yet simple 3D anomaly generation method, 3D-PNAS, based on Perlin noise and surface parameterization. Our method generates realistic 3D surface anomalies by projecting the point cloud onto a 2D plane, sampling multi-scale noise values from a Perlin noise field, and perturbing the point cloud along its normal direction. Through comprehensive visualization experiments, we demonstrate how key parameters - including noise scale, perturbation strength, and octaves, provide fine-grained control over the generated anomalies, enabling the creation of diverse defect patterns from pronounced deformations to subtle surface variations. Additionally, our cross-category experiments show that the method produces consistent yet geometrically plausible anomalies across different object types, adapting to their specific surface characteristics. We also provide a comprehensive codebase and visualization toolkit to facilitate future research.
大型预训练视觉基础模型在各种视觉任务中展现出了巨大的潜力。然而,在工业异常检测领域,实际缺陷样本的稀缺性给这些模型的应用带来了重大挑战。虽然二维异常生成技术已经随着现有生成模型的发展而得到了显著提升,但工业制造中三维传感器的采用使利用三维数据进行表面质量检查成为了一种新兴趋势。与二维方法相比,三维异常生成的研究尚处于初步阶段,这限制了三维数据在工业质量检测中的应用潜力。 为了解决这一问题,我们提出了一种基于Perlin噪声和曲面参数化的新型且简单的三维异常生成方法——3D-PNAS(3D Perlin Noise Anomaly Synthesis)。我们的方法通过将点云投影到二维平面上,在Perlin噪声场中采样多尺度噪声值,并沿点云的法线方向扰动来生成真实的三维表面异常。通过全面的可视化实验,我们展示了包括噪声尺度、扰动强度和八度等关键参数如何提供对生成的异常进行精细控制的能力,从而能够创建从显著变形到细微表面变化等各种缺陷模式。 此外,我们的跨类别实验表明,该方法在不同类型的对象上可以产生一致且几何上合理的异常,并适应它们各自的表面特性。我们还提供了全面的代码库和可视化工具包,以促进未来的研究工作。
https://arxiv.org/abs/2504.12856
Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.
最近在工业异常检测领域的进展强调了深入逻辑异常分析的必要性,这要求识别和解释物体之间的意外关系、计数以及空间配置。现有的方法通常依赖大规模外部推理模块或复杂的流水线设计,阻碍了实际部署和可解释性。为解决这些限制,我们提出了一项新任务——推理逻辑异常检测(Reasoning Logical Anomaly Detection, RLAD),该任务通过融入逻辑推理来扩展传统的异常检测方法。为此,我们提出一个新框架 LAD-Reasoner,这是一个基于 Qwen2.5-VL 3B 的定制化轻量级多模态语言模型。 我们的方法采用两阶段训练范式:首先使用监督微调(Supervised Fine-Tuning, SFT)进行细粒度视觉理解;然后通过组相对策略优化(Group Relative Policy Optimization, GRPO)来精炼逻辑异常检测,并确保推理的连贯性和人类可读性。关键的是,奖励信号源自于检测准确率和输出结构质量,从而无需构建链式思维(Chain of Thought, CoT)推理数据。 在 MVTec LOCO AD 数据集上的实验表明,尽管 LAD-Reasoner 的规模显著较小,它与 Qwen2.5-VL-72B 在精度和 F1 值上表现出同等的性能,并且在生成简洁、可解释的理由方面更胜一筹。这种统一设计减少了对大型模型和复杂管道的依赖,同时提供了逻辑异常检测透明且易于理解的见解。代码和数据将公开发布。
https://arxiv.org/abs/2504.12749
We present MaskMark, a simple, efficient and flexible framework for image watermarking. MaskMark has two variants: MaskMark-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection, and MaskMark-ED, which focuses on local watermark embedding and extraction with enhanced robustness in small regions, enabling localized image protection. Built upon the classical Encoder- Distortion-Decoder training paradigm, MaskMark-D introduces a simple masking mechanism during the decoding stage to support both global and local watermark extraction. A mask is applied to the watermarked image before extraction, allowing the decoder to focus on selected regions and learn local extraction. A localization module is also integrated into the decoder to identify watermark regions during inference, reducing interference from irrelevant content and improving accuracy. MaskMark-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions for enhanced robustness. Comprehensive experiments show that MaskMark achieves state-of-the-art performance in global watermark extraction, local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. MaskMark is also flexible, by adjusting the distortion layer, it can adapt to different robustness requirements with just a few steps of fine-tuning. Moreover, our approach is efficient and easy to optimize, requiring only 20 hours on a single A6000 GPU with just 1/15 the computational cost of WAM.
我们介绍了一种简单、高效且灵活的图像水印框架——MaskMark。MaskMark有两种变体:MaskMark-D 支持全局水印嵌入、水印定位以及局部水印提取,适用于篡改检测等应用;而 MaskMark-ED 则专注于在小区域中增强鲁棒性的局部水印嵌入和提取,从而实现图像的区域性保护。基于经典的编码器-扰动-解码器训练范式,MaskMark-D 在解码阶段引入了一种简单的掩膜机制来支持全局及局部水印的提取。在提取之前对带有水印的图像应用一个掩模,使解码器能够专注于选定区域,并学习进行局部提取。同时,在解码器中集成了定位模块以识别推理时的水印区域,减少无关内容的干扰并提高准确性。MaskMark-ED 通过将掩膜机制扩展到编码阶段来增强设计,指导编码器在指定的局部区域内嵌入水印,从而进一步提升鲁棒性。 全面的实验表明,MaskMark 在全局水印提取、局部水印提取、水印定位以及多水印嵌入方面均达到了最先进的性能。它不仅超过了所有现有的基准模型(包括最近领先的局部水印模型WAM),同时还能保持高视觉质量的水印图像。此外,MaskMark 还具有灵活性,通过调整扰动层,可以仅需几步微调就能适应不同的鲁棒性要求。而且,我们的方法既高效又易于优化,在单个A6000 GPU上仅需要20小时计算时间,并且只需WAM的1/15计算成本。
https://arxiv.org/abs/2504.12739
The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.
在自然语言处理(NLP)和2D视觉领域,预训练模型利用大量数据所取得的显著成就激励我们探索广泛数据预训练在自动驾驶3D感知中的潜力。为此,本文提出了一种利用来自异构数据集的大规模未标注数据对3D感知模型进行预训练的方法。我们介绍了一个自我监督的预训练框架,该框架可以从无标签的数据中从头开始学习有效的3D表示,并结合基于提示适配器的领域自适应策略来减少数据集偏差。这种方法在诸如3D目标检测、鸟瞰图(BEV)分割、3D目标跟踪和占用预测等下游任务上显著提升了模型性能,同时随着训练数据量的增加而表现出稳定提升的表现,展示了持续受益于3D感知模型以支持自动驾驶的潜力。我们将发布源代码以激发社区进一步的研究探索。
https://arxiv.org/abs/2504.12709
Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at this https URL.
多类别无监督异常检测算法(MUAD)因其相对较低的部署成本和改进的训练效率而受到越来越多的关注。然而,由于现有工业异常检测(IAD)数据集的局限性,人们对MUAD方法在实际应用中的有效性提出了质疑。这些数据集中包含了许多不太可能由同一工厂生产的类别,并且未能涵盖多种结构或外观特征。此外,缺陷也不反映现实世界的特性。 因此,我们引入了异质同类工业异常检测(HSS-IAD)数据集,该数据集包含了8,580张金属类工业部件的图像以及精确的异常标注。这些部件在结构和外观上表现出变化,并带有与基本材料非常相似的细微缺陷。此外,我们还提供了用于合成异常生成的前景图。最后,在多类别和类别分离设置下评估了流行IAD方法在此数据集上的性能,证明其有可能弥合现有数据集与实际工厂条件之间的差距。 该数据集可在以下链接获取:[此 https URL](请将方括号中的内容替换为实际URL)。
https://arxiv.org/abs/2504.12689
Arabidopsis is a widely used model plant to gain basic knowledge on plant physiology and development. Live imaging is an important technique to visualize and quantify elemental processes in plant development. To uncover novel theories underlying plant growth and cell division, accurate cell tracking on live imaging is of utmost importance. The commonly used cell tracking software, TrackMate, adopts tracking-by-detection fashion, which applies Laplacian of Gaussian (LoG) for blob detection, and Linear Assignment Problem (LAP) tracker for tracking. However, they do not perform sufficiently when cells are densely arranged. To alleviate the problems mentioned above, we propose an accurate tracking method based on Genetic algorithm (GA) using knowledge of Arabidopsis root cellular patterns and spatial relationship among volumes. Our method can be described as a coarse-to-fine method, in which we first conducted relatively easy line-level tracking of cell nuclei, then performed complicated nuclear tracking based on known linear arrangement of cell files and their spatial relationship between nuclei. Our method has been evaluated on a long-time live imaging dataset of Arabidopsis root tips, and with minor manual rectification, it accurately tracks nuclei. To the best of our knowledge, this research represents the first successful attempt to address a long-standing problem in the field of time-lapse microscopy in the root meristem by proposing an accurate tracking method for Arabidopsis root nuclei.
拟南芥是一种广泛用于获取植物生理和发育基本知识的模型植物。活体成像是可视化并量化植物发育过程中基本元素过程的重要技术。为了揭示新的理论,阐明植物生长和细胞分裂背后的机制,对活体成像中的精确细胞追踪至关重要。目前常用的细胞追踪软件TrackMate采用基于检测的追踪方法,通过高斯拉普拉斯算子(LoG)进行斑点检测,并使用线性分配问题(LAP)跟踪器来进行追踪。然而,在细胞密集排列的情况下,这种方法的效果并不理想。 为了解决上述问题,我们提出了一种基于遗传算法(GA)并结合拟南芥根部细胞模式和体积间空间关系知识的精确追踪方法。我们的方法可以描述为从粗到细的过程:首先进行相对简单的线性级别的细胞核追踪,然后根据已知的细胞文件线性排列及其细胞核之间的空间关系来进行复杂的核追踪。 我们在拟南芥根尖长时间活体成像数据集上对这种方法进行了评估,并在经过少量人工修正后,该方法能够精确地追踪细胞核。据我们所知,这项研究是首次成功尝试解决时序显微镜技术中长期存在的问题——即为拟南芥根部核提供准确的跟踪方法,这在分生组织领域尤为突出。
https://arxiv.org/abs/2504.12676
Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.
最近的深度学习进展,特别是频率动态卷积(FDY conv),通过实现频域自适应特征提取,在声音事件检测(SED)方面取得了显著改进。然而,FDY conv 依赖于时间平均池化(temporal average pooling, TAP),该方法将所有时间帧同等对待,这限制了其捕捉短暂声音事件(如警报铃声、敲门声和爆破音)的能力。 为解决这一局限性,我们提出了一个结合时序注意力池化的频率动态卷积(TFD conv)。新的方法用时间注意力池化(TAP)替代传统的平均池化。TAP 通过三种互补机制自适应地加权时间特征:时间注意力池化(TA)用于强调显著特征,速度注意力池化(VA)用于捕捉瞬态变化,传统平均池化则确保在处理静止信号时的鲁棒性。 消融研究表明,在仅增加参数数量14.8%的情况下,TFD conv 就能使平均PSDS1分数较FDY conv提升3.02%。此外,类别分析(classwise ANOVA)和Tukey HSD检验进一步表明,对于瞬态事件密集的场景,TFD conv 能显著提高检测性能,并且优于现有的 FDY conv 模型。值得注意的是,在PSDS1评分上,TFD conv 达到了0.456的最高分,超越了之前的最佳SED系统。 我们还探讨了 TAP 与其它FDY conv变体(包括扩张频率动态卷积(DFD conv)、部分频率动态卷积(PFD conv)和多尺度频率动态卷积(MDFD conv))的兼容性。其中,TAP 和 MDFD conv 的结合表现最佳,PSDS1评分达到了0.459,这验证了时序注意力与多尺度频域适应性的互补优势。 这些发现确立了 TFD conv 作为增强瞬态敏感性和整体特征鲁棒性的一种强大且通用的SED框架。
https://arxiv.org/abs/2504.12670
This paper presents a novel autonomous drone-based smoke plume tracking system capable of navigating and tracking plumes in highly unsteady atmospheric conditions. The system integrates advanced hardware and software and a comprehensive simulation environment to ensure robust performance in controlled and real-world settings. The quadrotor, equipped with a high-resolution imaging system and an advanced onboard computing unit, performs precise maneuvers while accurately detecting and tracking dynamic smoke plumes under fluctuating conditions. Our software implements a two-phase flight operation, i.e., descending into the smoke plume upon detection and continuously monitoring the smoke movement during in-plume tracking. Leveraging Proportional Integral-Derivative (PID) control and a Proximal Policy Optimization based Deep Reinforcement Learning (DRL) controller enables adaptation to plume dynamics. Unreal Engine simulation evaluates performance under various smoke-wind scenarios, from steady flow to complex, unsteady fluctuations, showing that while the PID controller performs adequately in simpler scenarios, the DRL-based controller excels in more challenging environments. Field tests corroborate these findings. This system opens new possibilities for drone-based monitoring in areas like wildfire management and air quality assessment. The successful integration of DRL for real-time decision-making advances autonomous drone control for dynamic environments.
本文介绍了一种新颖的自主无人机烟羽追踪系统,该系统能够在高度不稳定的大气条件下导航和跟踪烟羽。该系统集成了先进的硬件与软件,并且包括一个全面的模拟环境,以确保在控制和现实世界设置中均能实现稳健性能。四旋翼飞行器配备了高分辨率成像系统和高级机载计算单元,在不断变化的情况下能够执行精确操作并准确地检测和跟踪动态烟羽。 我们的软件实施了两个阶段的飞行操作:即探测到烟羽后下降进入烟羽,并在进入烟羽后持续监测烟雾运动。通过利用比例积分微分(PID)控制以及基于近端策略优化(Proximal Policy Optimization)的深度强化学习(DRL)控制器,使系统能够适应烟羽动态变化。 借助Unreal Engine模拟器,在各种烟雾-风环境场景下评估了系统的性能,从稳定的气流到复杂的不稳定性波动。结果显示:虽然PID控制器在简单情况下表现良好,但基于DRL的控制器在更复杂和具有挑战性的环境中表现出色。实地测试验证了这些发现。 该系统为无人机监测开辟了新的可能性,特别是在野火管理和空气质量评估等领域。将深度强化学习成功集成到实时决策制定中,有助于自主无人机控制在动态环境中的发展与应用。
https://arxiv.org/abs/2504.12664
This technical report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation, a critical factor influencing the overall NuScenes Detection Score. While StreamPETR exhibits strong 3D bounding box detection performance as reflected by its high mean Average Precision our analysis identified velocity estimation as a substantial bottleneck when evaluated on the NuScenes dataset. To overcome this limitation, we propose a customized positional embedding strategy tailored to enhance temporal modeling capabilities. Experimental evaluations conducted on the NuScenes test set demonstrate that our improved approach achieves a state-of-the-art NDS of 70.86% using the ViT-L backbone, setting a new benchmark for camera-only 3D object detection.
该技术报告介绍了一项针对StreamPETR框架的改进,特别注重提升速度估计性能。速度估计是影响整体NuScenes Detection Score(NDS)的关键因素之一。虽然StreamPETR在三维边界框检测方面表现出色,其高平均精度均值(mean Average Precision)表明了这一点,但我们的分析发现在使用NuScenes数据集评估时,速度估计是一个重要的瓶颈问题。 为克服这一限制,我们提出了一种定制化的位置嵌入策略,旨在增强时间建模能力。在NuScenes测试集中进行的实验评估显示,我们的改进方法取得了70.86%的新状态-of-the-art NDS成绩,并且仅使用了视觉变换器(ViT-L)主干网络,在纯摄像头三维物体检测领域树立了一个新的基准。
https://arxiv.org/abs/2504.12643
Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at this https URL.
构建变化检测在城市发展、灾害评估和军事侦察中仍面临挑战。尽管像Segment Anything Model (SAM)这样的基础模型显示出了强大的分割能力,但SAM在建筑变化检测任务上因领域差距问题而受到限制。现有的基于适配器的微调方法面对不平衡的建筑物分布时也遇到了挑战,导致细微变化难以被发现,并且边缘提取不够准确。此外,在使用光学流处理双时间序列错位的变化检测中,背景噪声对其构成了威胁,影响了建筑变化的检测并削弱了检测精度和边缘识别。 为了解决这些问题,我们提出了一种新的基于SAM的网络——带有分布感知傅里叶适应和边界约束扭曲(FAEWNet)的建筑物变化检测方法。该网络利用SAM编码器从遥感图像中提取丰富的视觉特征。为了引导SAM关注特定地面目标,我们提出了一个分布感知傅里叶聚合适配器来整合任务导向的变化信息。这个适配器不仅有效地解决了领域差距问题,还注意到了变化建筑的分布情况。此外,为了解决噪声干扰和高度偏移估计中的错位问题,我们设计了一个新的流模块以精化建筑物边缘提取,并增强了对变化建筑物的感知。 在LEVIR-CD、S2Looking以及WHU-CD数据集上的最先进的实验结果证明了FAEWNet的有效性。该代码可在以下链接获取:[这里提供原网址]。
https://arxiv.org/abs/2504.12619
Purpose: The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. The use of computer vision approaches for the automatic recognition of perioperative events enables identification of bottlenecks for OR optimization. However, privacy concerns limit the use of computer vision for automated event detection from OR videos, which makes privacy-preserving approaches needed for OR workflow analysis. Methods: We propose a two-stage pipeline for privacy-preserving OR video analysis and event detection. In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos. In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection. We evaluate this method on an internal dataset of 38 simulated surgical trials with five event classes. Results: Our results indicate that this DT-based approach to the OR event detection model achieves performance on par and sometimes even better than raw RGB video-based models on detecting OR events. Conclusion: DTs enable privacy-preserving OR workflow analysis, facilitating the sharing of de-identified data across institutions and they can potentially enhance model generalizability by mitigating domain-specific appearance differences.
目的:手术室(OR)是一个复杂的环境,优化工作流程对于减少成本和改善患者结果至关重要。使用计算机视觉方法自动识别围手术期事件能够帮助识别瓶颈以进行手术室的优化。然而,隐私问题限制了从手术室视频中通过计算机视觉来进行自动化事件检测的应用,因此需要采取保护隐私的方法来分析手术室的工作流程。 方法:我们提出了一种两阶段流水线用于隐私保护的手术室视频分析和事件检测。在第一阶段,利用视觉基础模型进行深度估计和语义分割,从常规RGB视频中生成去身份化的数字孪生(DT)以表示手术室。在第二阶段,使用SafeOR模型,这是一种融合了双流方法的方法,该方法处理分割掩码和深度图来检测手术事件。我们在一个内部数据集上评估了这种方法,这个数据集包含了38次模拟的外科手术试验,并且有五种类别的事件。 结果:我们的结果显示,基于DT的方法在检测手术室事件时,其性能与基于原始RGB视频模型的表现相当,有时甚至更好。 结论:数字孪生(DT)能够进行隐私保护下的手术工作流程分析,有助于跨机构共享去身份化数据,并有可能通过缓解特定领域的外观差异来增强模型的泛化能力。
https://arxiv.org/abs/2504.12552
Timely and accurate detection of hurricane debris is critical for effective disaster response and community resilience. While post-disaster aerial imagery is readily available, robust debris segmentation solutions applicable across multiple disaster regions remain limited. Developing a generalized solution is challenging due to varying environmental and imaging conditions that alter debris' visual signatures across different regions, further compounded by the scarcity of training data. This study addresses these challenges by fine-tuning pre-trained foundational vision models, achieving robust performance with a relatively small, high-quality dataset. Specifically, this work introduces an open-source dataset comprising approximately 1,200 manually annotated aerial RGB images from Hurricanes Ian, Ida, and Ike. To mitigate human biases and enhance data quality, labels from multiple annotators are strategically aggregated and visual prompt engineering is employed. The resulting fine-tuned model, named fCLIPSeg, achieves a Dice score of 0.70 on data from Hurricane Ida -- a disaster event entirely excluded during training -- with virtually no false positives in debris-free areas. This work presents the first event-agnostic debris segmentation model requiring only standard RGB imagery during deployment, making it well-suited for rapid, large-scale post-disaster impact assessments and recovery planning.
及时且准确地检测飓风造成的残骸对于有效的灾害响应和社区韧性至关重要。虽然灾后航拍图像容易获取,但适用于多个灾区的稳健残骸分割解决方案仍然有限。由于不同地区环境和成像条件的变化会改变残骸的视觉特征,并且训练数据稀缺,开发通用解决方案极具挑战性。本研究通过微调预训练的基础视觉模型来应对这些挑战,在相对较小、高质量的数据集上实现了稳健性能。 具体而言,这项工作引入了一个开源数据集,包括约1,200张从飓风伊安(Ian)、艾达(Ida)和艾克(Ike)中手动标注的航拍RGB图像。为了减少人为偏见并提高数据质量,标签从多个注释者处收集,并且采用了视觉提示工程。最终微调得到的模型fCLIPSeg在飓风艾达(Ida)的数据上实现了0.70的Dice得分——该灾害事件完全未包含在训练集中,并且几乎在无残骸区域没有假阳性结果。 这项工作首次提出了一个不依赖具体事件的残骸分割模型,部署时仅需标准RGB图像即可,使其非常适合于快速、大规模的灾后影响评估和恢复规划。
https://arxiv.org/abs/2504.12542
Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at this https URL.
事件相机承诺通过其低延迟、高动态范围和异步事件特性实现视觉感应的范式转变。然而,高质量标注数据的稀缺阻碍了它们在深度学习驱动的计算机视觉领域的广泛应用。为了解决这个问题,已经提出了几种模拟器来生成合成事件数据,用于检测和估计任务的模型训练。但是,与传统的帧基相机相比,事件相机独特的传感器设计对精确仿真构成了挑战。因此,大多数仿真的数据无法准确再现真实事件相机捕获的数据。 受现有使用深度特征进行图像比较工作的启发,我们引入了事件质量评分(EQS),这是一种利用RVT架构激活的质量度量标准。通过在DSEC驾驶数据集上的模拟到现实的实验表明,在基于模拟事件训练之后,更高的EQS意味着对真实世界数据有更好的泛化能力。因此,优化EQS可以开发出更逼真的事件相机仿真器,有效减少模拟与实际之间的差距。 EQS可在以下网址获得:[此处应为链接,请访问原网站获取详细信息]。
https://arxiv.org/abs/2504.12515