Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
在手术室(OR)中对手术器械的准确计数是确保患者安全的关键前提。尽管大型视觉语言模型和代理AI近期取得了进展,但在复杂且密集的情景下精确计数仍然极具挑战性,尤其是在手术器械紧密堆积的情况下。为了解决这个问题,我们引入了一种名为“Chain-of-Look”的新型视觉推理框架。该框架模仿了人类的顺序计数过程,并通过施加一种结构化的视觉链来实现这一目标,而不是依赖于传统的无序对象检测方法。这种视觉链指导模型沿着连贯的空间轨迹进行计数,从而提高了复杂场景中的准确性。 为了进一步确保视觉链的物理合理性,我们引入了一种相邻损失函数(neighboring loss function),它明确地建模了密集排列手术器械所固有的空间约束条件。 此外,我们还推出了SurgCount-HD,这是一个包含1,464张高密度手术器械图像的新数据集。通过广泛的实验,我们的方法在具有挑战性的、复杂且密集的手术器械计数任务中超越了现有最佳的方法(如CountGD和REC),以及多模态大型语言模型(如Qwen和ChatGPT)。
https://arxiv.org/abs/2602.11024
Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.
在遮挡情况下进行精确的面部特征点检测仍然具有挑战性,尤其是在外观变化大且因旋转导致自我遮挡的人脸(包括真实人类、拟人化角色及其他非人类设计)上。现有的检测器通常是在处理遮挡时隐式地定位特征点,而不预测每个特征点的可见性,后者对下游应用是有益的信息。我们提出了OccFace,这是一个针对通用人脸的遮挡感知框架,涵盖了人类、拟人化人物以及其他非人类的设计。 OccFace采用统一的密集100个特征点布局和基于热图的骨干网络,并添加了一个遮挡模块,该模块通过结合局部证据与跨特征点上下文来共同预测特征点坐标和每个特征点的可见性。可见性监督结合了手动标签以及从掩码-热图重叠中推导出伪可见性的特征点感知掩码。我们还创建了一套包含100个特征点标注及每个特征点可见性的遮挡感知评估工具,该工具报告可视与不可视特征点的NME,并使用Occ AP、F1@0.5和ROC-AUC来评估可见性。 实验表明,在外部遮挡和头部大旋转的情况下具有更好的鲁棒性,尤其是在被遮挡区域中表现更为突出,同时保持了在可视特征点上的准确性。
https://arxiv.org/abs/2602.10728
With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.
随着高分辨率遥感和航拍图像的日益普及,面向对象检测已成为地理信息更新、海上监视和灾害响应的关键能力。然而,由于背景杂乱、严重的尺度变化以及显著的方向变换等问题,这一任务仍然具有挑战性。现有方法主要通过多尺度特征融合(如使用特征金字塔网络)或利用注意力机制进行上下文建模来提高性能,但它们往往缺乏显式的前景建模,并且不利用几何方向先验信息,这限制了其特征的判别能力。 为克服这些局限性,我们提出了FGAA-FPN:一种用于定向目标检测的前景引导角度感知特征金字塔网络。FGAA-FPN建立在一个层次化功能分解的基础上,该方法考虑了不同金字塔层级的空间分辨率和语义抽象之间的差异,从而强化多尺度表示。具体来说,“前景引导特征调制”模块在弱监督下学习前景显著性以增强目标区域并抑制低层特征中的背景干扰。“角度感知多头注意力”模块则编码相对方向关系,以便指导高层语义特征间的全局交互。 在DOTA v1.0和DOTA v1.5数据集上的广泛实验表明,FGAA-FPN实现了最先进的结果,分别达到了75.5%和68.3%mAP。
https://arxiv.org/abs/2602.10710
Self-driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi-task learning has emerged as a powerful approach to address complex problems in driving perception. Multi-task networks offer several advantages, including increased computational efficiency, real-time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi-task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end-to-end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an mAP@0.5:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real-time performance. These results underscore AurigaNet's potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here this https URL.
自驾车具有显著潜力,能够减少交通事故、缓解交通拥堵,并增强城市流动性。然而,为自动驾驶车辆开发可靠的AI系统仍然是一个重大挑战。在过去十年中,多任务学习已经作为一种强大的方法出现,用于解决驾驶感知中的复杂问题。多任务网络提供了多种优势,包括计算效率的提高、实时处理能力、资源利用优化以及泛化性能提升。在这项研究中,我们提出了AurigaNet,这是一种先进的多任务网络架构,旨在推动自动驾驶感知技术的发展边界。AurigaNet整合了三个关键的任务:目标检测、车道检测和可行驶区域实例分割。该系统使用BDD100K数据集进行训练和评估,这一数据集以其多样化的驾驶条件而闻名。 AurigaNet的关键创新在于其端到端的实例分割能力,这大大提高了自动驾驶车辆路径估计的准确性和效率。实验结果表明,在可行驶区域分割方面,AurigaNet达到了85.2%的IoU(交并比),超过了最接近的竞争者0.7%。在车道检测中,AurigaNet实现了60.8%的IoU,超越了其他模型超过30%。此外,在交通目标检测上,AurigaNet取得了mAP@0.5:0.95为47.6%,比下一个领先的模型高出了2.9%。 为了验证AurigaNet的实际可行性,我们将其部署在嵌入式设备如Jetson Orin NX上,并展示了其具有竞争力的实时性能。这些结果突显了AurigaNet作为自动驾驶感知系统稳健和高效解决方案的巨大潜力。该代码可在[此链接](https://this https URL)找到。
https://arxiv.org/abs/2602.10660
Recently, Image processing has advanced Faster and applied in many fields, including health, industry, and transportation. In the transportation sector, object detection is widely used to improve security, for example, in traffic security and passenger crossings at train stations. Some accidents occur in the train crossing area at the station, like passengers uncarefully when passing through the yellow line. So further security needs to be developed. Additional technology is required to reduce the number of accidents. This paper focuses on passenger detection applications at train stations using YOLOX and Edge AI Accelerator hardware. the performance of the AI accelerator will be compared with Jetson Orin Nano. The experimental results show that the Hailo-8 AI hardware accelerator has higher accuracy than Jetson Orin Nano (improvement of over 12%) and has lower latency than Jetson Orin Nano (reduced 20 ms).
最近,图像处理技术取得了快速进展,并在医疗、工业和交通等多个领域得到了广泛应用。特别是在交通运输行业,物体检测被广泛用于提高安全性,例如,在交通安全以及车站乘客过道的安全方面。在火车站的列车穿越区域也发生了一些事故,比如乘客不小心经过黄色标线时发生的意外。因此,进一步加强安全措施的需求显得尤为迫切。为了减少事故发生率,需要采用更多技术手段。 本文重点关注使用YOLOX和Edge AI加速器硬件在火车站进行乘客检测的应用。文中将对比Hailo-8 AI硬件加速器与Jetson Orin Nano的性能表现。实验结果显示,Hailo-8 AI硬件加速器相比Jetson Orin Nano具有更高的准确率(提升了超过12%),同时延迟也更低(减少了20毫秒)。
https://arxiv.org/abs/2602.10593
Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on this https URL.
部署视觉基础模型通常依赖于高效的适应策略,而传统的全量微调则面临着成本高昂和效率低下的问题。尽管增量微调(delta-tuning)方法在增强大规模语言模型(LLM)适应过程中的性能和效率方面已经证明非常有效,但这种方法的优势却无法直接应用于视觉基础模型的全量微调管道中。为了推动视觉任务适应效率的边界,我们提出了一种基于复数线性投影优化(Complex Linear Projection Optimization, CoLin)的适配器。从架构设计上讲,我们设计了一种新颖的低秩复数适配器,仅向骨干模型引入大约1%的参数。在效率方面,理论上证明了低秩复合矩阵在训练过程中会遇到严重的收敛问题,并通过定制损失函数解决了这一挑战。 广泛的实验(包括目标检测、分割、图像分类以及旋转物体检测(遥感场景))表明,CoLin首次以仅增加1%参数量的方式超越了全量微调和经典增量微调方法的表现,为视觉基础模型的部署提供了一种新颖且高效的解决方案。我们已经将代码发布在[这个链接](https://example.com)上。
https://arxiv.org/abs/2602.10513
Recent advances in generative image models have enabled the creation of highly realistic political deepfakes, posing risks to information integrity, public trust, and democratic processes. While automated deepfake detectors are increasingly deployed in moderation and investigative pipelines, most existing systems provide only point predictions and fail to indicate when outputs are unreliable, being an operationally critical limitation in high-stakes political contexts. This work investigates conditional, uncertainty-aware political deepfake detection using stochastic convolutional neural networks within an empirical, decision-oriented reliability framework. Rather than treating uncertainty as a purely Bayesian construct, it is evaluated through observable criteria, including calibration quality, proper scoring rules, and its alignment with prediction errors under both global and confidence-conditioned analyses. A politically focused binary image dataset is constructed via deterministic metadata filtering from a large public real-synthetic corpus. Two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) are fully fine-tuned for classification. Deterministic inference is compared with single-pass stochastic prediction, Monte Carlo dropout with multiple forward passes, temperature scaling, and ensemble-based uncertainty surrogates. Evaluation reports ROC-AUC, thresholded confusion matrices, calibration metrics, and generator-disjoint out-of-distribution performance. Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. A systematic confidence-band analysis further clarifies when uncertainty provides operational value beyond predicted confidence, delineating both the benefits and limitations of uncertainty-aware deepfake detection in political settings.
最近在生成图像模型方面的进展使得创造高度逼真的政治深度伪造图像成为可能,这给信息完整性和公众信任带来了风险,并影响了民主进程。虽然自动化深度伪造检测器正越来越多地被部署于内容审查和调查流程中,但大多数现有的系统仅提供点预测结果,未能指出输出的可靠性何时下降,在高风险的政治背景下这是一个操作上至关重要的限制因素。本研究探讨了使用随机卷积神经网络进行条件化、不确定性感知的政治深度伪造检测,并在经验主义、决策导向性的可靠性框架下展开研究。该方法并不单纯将不确定性视为贝叶斯结构,而是通过可观察的准则来评估其有效性,包括校准质量、适当的评分规则以及与全球和置信度条件下预测误差的一致性。 为此,我们构建了一个基于大规模公共真实-合成图像语料库中确定性元数据过滤的政治关注二值化图像数据集。两个预训练好的卷积神经网络骨干(ResNet-18 和 EfficientNet-B4)被完全微调用于分类任务。确定性的推理与单次随机预测、蒙特卡洛丢弃法(通过多次前向传递)、温度缩放以及基于集合的不确定性代理进行了比较。 评估报告包括ROC-AUC曲线,阈值混淆矩阵,校准指标以及生成器离散分布外表现。结果表明,经过校准的概率输出和不确定性估计能够支持风险感知的审查策略。系统性置信区间分析进一步阐明了当不确定性提供了超出预测信心的操作价值时的具体情境,明确界定了在政治环境中基于不确定性的深度伪造检测的优势与局限。
https://arxiv.org/abs/2602.10343
This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at this https URL, and the video processing and model training code is available at this https URL.
这项研究介绍了一个新的行人对象检测数据集,名为PMMA(使用移动辅助设备的行人),该数据集是在户外环境中收集的。志愿者在使用轮椅、手杖和助行器的情况下被记录下来,从而产生了九类行人:正常行人、使用手杖的人、两种类型的使用助行器者(行走或休息)、五种类型使用轮椅者(包括坐轮椅的人、推空轮椅的人以及三种不同类型推动有使用者的轮椅的人,其中包括整个推动组、推动者和坐在轮椅上的人)。为了建立基准,研究团队在MMDetection框架下实现了七种对象检测模型(Faster R-CNN、CenterNet、YOLOX、DETR、可变形DETR、DINO 和 RT-DETR)以及三种跟踪算法(ByteTrack、BOT-SORT和OC-SORT)。实验结果显示,YOLOX、可变形DETR和Faster R-CNN在检测性能方面表现出色,而三种跟踪器之间的差异相对较小。PMMA数据集可在[此处](https://this https URL)公开获取,并且视频处理及模型训练代码可以在[此处](https://this https URL)找到。
https://arxiv.org/abs/2602.10259
Stop-and-go waves, as a major form of freeway traffic congestion, cause severe and long-lasting adverse effects, including reduced traffic efficiency, increased driving risks, and higher vehicle emissions. Amongst the highway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a potential method for preventing the propagation of such waves. However, most existing JAD strategies remain impractical mainly due to the lack of discussion regarding implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces a Single-Vehicle Two-Detector Jam-Absorption Driving (SVDD-JAD) problem, and then proposes a practical JAD strategy that transforms such behavior into a maneuver capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely, JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice with two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave, without triggering a secondary wave. This paper is expected to take a significant step toward making JAD practical, advancing it from a theoretical concept to a feasible and implementable strategy. To promote reproducibility in the transportation domain, we have also open-sourced all the code on our GitHub repository this https URL.
停止启动波是高速公路交通拥堵的主要形式,会导致严重的长期不良影响,包括降低交通效率、增加驾驶风险以及提高车辆排放。在公路交通管理策略中,堵车吸收驾驶(JAD)作为一种可能的方法被提出用于防止此类波的传播,即一辆专用车辆执行“慢入”和“快出”的操作,在被捕获之前阻止停止启动波的发展。然而,大多数现有的JAD策略由于缺乏关于实施车辆和运营条件讨论而仍然不切实际。 受到警察车避让行为的实际观察启发,本文首先引入了单辆汽车双检测器堵车吸收驾驶(SVDD-JAD)问题,并提出了一种将这种行为转化为可以抑制孤立停止启动波传播的操纵动作的实际JAD策略。文中识别并系统分析了影响该策略的五个关键参数:JAD速度、流入交通速度、波宽、波速和波内速度。 通过基于SUMO(Simulation of Urban MObility)的模拟作为例证,进一步展示了如何利用两个固定的路边交通检测器在实际操作中测量这些参数。结果表明,提出的JAD策略成功地抑制了停止启动波的传播,并且没有引发二次波。本文有望使JAD从理论概念迈向可行和可实施的战略迈出重要一步。 为了促进交通运输领域的再现性研究,我们已经在GitHub仓库(此链接)上开源了所有代码。
https://arxiv.org/abs/2602.10234
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
出界检测(OOD,Out-of-distribution)对于机器学习系统的安全部署至关重要。现有的事后检测器通常依赖于模型置信度得分或特征空间中的似然估计,这往往需要严格的分布假设。在本工作中,我们引入了一种新的范式,并从多样性视角出发来定义OOD检测问题。我们提出了Vendi Novelty Score(VNS),这是一种基于Vendi Scores(VS)的OOD检测器,而VS是一组相似性为基础的多样性度量指标。VNS量化了测试样本如何增加在分布特征集中的VS值,提供了一种无需密度建模即可确定新颖性的原则方法。VNS具有线性时间复杂度、非参数性质,并自然地结合了类条件(局部)和数据集级别(全局)的新颖性信号。在多个图像分类基准测试及网络架构上,VNS实现了最先进的OOD检测性能。值得注意的是,在仅使用1%的训练数据进行计算的情况下,VNS仍能保持这种性能水平,从而可以在内存受限或访问受限的环境中部署。
https://arxiv.org/abs/2602.10062
Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
监控剩余产品的状态可以提供有价值的见解,这些见解可用于优化未来的生产流程。对于德国的面包店来说,这一点尤为重要,因为新鲜烘焙的产品保质期非常短。自动化这一过程能够减少劳动力成本、提高准确性和简化操作流程。我们建议使用物体检测模型从图像中识别烘焙产品来实现这一目标。 然而,由于德国面包种类繁多,完全监督式的训练方法成本过高且不具备可扩展性。尽管开放词汇的检测器(如OWLv2和Grounding DINO)提供了灵活性,但我们证明它们对于我们的任务来说是不够的。虽然这项工作最初受到烘焙行业的启发,但它更广泛地解决了在行业中部署计算机视觉技术时面临的挑战:这些行业中的任务非常专业化且注释数据集稀缺。 为了应对这一问题,我们编制了不同监督水平的数据集拆分,涵盖了19类烘焙产品。为此,我们提出了两种有限监督训练物体检测模型的训练工作流程: 首先,我们将OWLv2和Grounding DINO定位与图像级监督相结合,以弱监督的方式训练模型。 其次,为了提高视角鲁棒性,我们在使用Segment Anything 2作为伪标签传播模型进行注释的视频帧上进行了微调。 利用这些工作流程,并考虑到YOLOv11在速度精度权衡方面的优势,我们针对检测任务对YOLOv11进行了训练。仅依赖图像级监督的情况下,该模型实现了0.91的平均精度(mAP)。通过使用伪标签进行微调,在非理想部署条件下使模型性能提高了19.3%。 结合这些工作流程,即使在只依靠图像级监督的情况下,也能训练出一个超过我们完全监督基准模型性能的模型。这种方法特别适用于解决那些标注数据稀缺、任务具有特殊性的行业的挑战。
https://arxiv.org/abs/2602.09979
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
大型语言模型(LLMs)越来越多地被部署用于自动标注和分析教育对话,然而当前的流水线缺乏可靠的方式来检测模型出错的情况。我们研究了由LLM生成的推理是否可以用来预测其自身预测的正确性。我们分析了来自课堂对话的30,300条教师话语,每一条都由多个最先进的LLM使用教学动作结构和相应的推理进行标注。利用人类验证的真实标签,我们将任务定义为预测模型对给定话语所分配的标签是否正确。我们采用词频-逆文档频率(TF-IDF)编码LLM推理,并评估了五种监督分类器。随机森林分类器实现了0.83的F1得分(召回率=0.854),成功地识别出大多数错误预测,且优于基线模型。为特定的教学动作构建训练专用检测器进一步提高了难以处理的动作构建的表现,表明错误检测可以从专门的语言线索中受益。 使用语言学探究与词计数框架(LIWC)分析了四种正确性的语言标志:因果关系、区分性、不确定性和洞察力。正确的预测表现出基于因果的语言(例如,“因为”、“因此”),而错误的推理更有可能依赖于认知保留(如“可能”、“能够”)和表演式元认知(如“认为”、“意识到”)。语法复杂度并不能区分正确与不正确的推理,较长的推理也不一定更加可靠。这些发现表明,基于推理的错误检测为自动教育对话分析的质量控制提供了一种实用且可扩展的方法。
https://arxiv.org/abs/2602.09832
This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoT M U50, Jetson Orin Nano, and Hailo-8T M AI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.
本文提出了一种物联网(IoT)应用,该应用使用人工智能分类器通过帧差法进行快速物体检测。由于其较短的持续时间,这种方法在需要节能应用的IoT系统中比端到端方法更为高效和适用,尤其是在快速物体检测方面。我们在三种边缘设备上实现了这一技术:AMD Alveo M U50、Jetson Orin Nano 和 Hailo-8T M AI 加速器,并使用了四种模型(具有人工神经网络和Transformer 模型)。我们研究了几类对象,包括鸟类、汽车、火车和飞机。通过帧差法,MobileNet 模型在精度高、延迟低以及能耗效率方面表现出色。相比之下,YOLOX 在所有评价指标中都表现最差,准确度最低,延迟也最低且效率较差。实验结果显示,与端到端方法相比,我们提出的算法平均提高了28.314%的准确性增益,提升了3.6倍的能耗效率,并减少了39.305%的延迟。 在这所有类别中,速度最快的物体是火车和飞机。实验表明,对于这两类对象(火车和飞机),其准确率低于其他分类。因此,在需要快速检测且结果精确的任务中,端到端方法可能效果不佳,因为它们无法很好地处理高速移动物体的检测问题。为了提高计算效率,我们设计了一种轻量级检测算法作为我们的方法。这种方法非常适合IoT系统中的应用,尤其是在那些要求进行高速移动物体检测并同时需保证较高准确性的应用场景中。
https://arxiv.org/abs/2602.09515
Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
现代图像生成器能够创建出极其逼真的图片,然而只有像扭曲的手或变形的对象等异常迹象揭示了它们的合成起源。检测这些异常是非常重要的:没有检测,我们就无法评估生成器的表现并训练奖励模型来改进它们。当前的检测方法需要对视觉语言模型(VLMs)进行微调,并使用数万张标注过的图像,但这在每次生成器演变或出现新的异常类型时重复操作的成本非常高昂。我们展示了预训练的VLM中已经包含了用于识别这些异常的知识——通过适当的架构支持,只需每种异常类别几百个标记样本就能解锁这种能力。我们的系统ArtifactLens在五个基于人类标注的异常基准测试上达到了业内领先水平(首次跨多个数据集进行评估的同时),并且所需的标注数据量减少了几个数量级。该系统的支撑架构由多组件组成,并包括了上下文学习和文本指令优化,同时对每个部分进行了创新性的改进。我们的方法能够推广到其他类型的异常——如物体形态、动物解剖结构以及实体互动等,并且也适用于AI生成内容(AIGC)的检测任务。
https://arxiv.org/abs/2602.09475
End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD$^2$) based on attention mechanisms that capture spatial-temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.
端到端的自动驾驶系统已经取得了显著的进步,但它们在对抗性威胁下的鲁棒性仍然很大程度上未被探索。在这项工作中,我们在CARLA模拟器中对最先进的自动驾驶代理进行了闭环评估,特别是在黑盒对抗性威胁模型下。具体而言,我们考虑了视觉感知管道上的三个代表性的攻击向量:(i) 由声波引起的基于物理的模糊攻击;(ii) 扭曲捕获图像的电磁干扰攻击;以及 (iii) 在图像上添加幽灵物体的数字攻击,这些攻击被视为精心设计的有界扰动。我们在两个高级代理(Transfuser 和 Interfuser)上的实验揭示了严重的漏洞,在最坏的情况下,驾驶评分下降高达99%,引发了有效的安全担忧。为了帮助缓解此类威胁,我们进一步提出了一种基于注意力机制捕获空间-时间一致性的轻量级自动驾驶攻击检测模型 (AD$^2$)。在CARLA上使用多摄像头输入的全面实验表明,我们的检测器在检测能力和计算效率方面优于现有的方法。
https://arxiv.org/abs/2602.10160
This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.
这项工作提供了迄今为止最大的南方居留型杀人鲸(SRKW)声学数据集,其中包括其环境中的其他海洋哺乳动物。我们系统地搜索了SRKW栖息地中所有可用的公共档案水听器数据(超过30年的音频数据)。该搜索采用了一种弱监督、正例-未标记主动学习策略来识别所有的海洋哺乳动物实例。基于转换器的检测器在DEEPAL、DCLDE-2026和两个新引入的专业注释数据集上,在准确性、能源效率和速度方面优于现有最佳检测器。 该检测模型在95%敏感度下具有0至28.8%的具体性。我们的多类物种分类器在DCLDE-2026数据集中获得了42.1%的前一准确率(训练类别为11,测试类别为4),而生态类型分类器则获得了43.0%的前一准确率(训练类别为4,测试类别为5)。 我们提供了919小时的SRKW数据、230小时的大白海豚数据、1374小时来自未标记生态类型的虎鲸数据、1501小时的座头鲸数据、88小时的海狮数据以及246小时的太平洋侧大腹海豚数据和超过784小时的其他未指定海洋哺乳动物的数据。该SRKW数据集比DCLDE-2026、加拿大海洋网络和OrcaSound的总和还要大。 经过整理后的物种标签在CC-BY 4.0许可下提供,相应的音频数据根据原所有者的许可证提供。 这一全面的数据集适用于无监督机器翻译、栖息地使用调查以及对这种极度濒危生态类型的保护工作。
https://arxiv.org/abs/2602.09295
The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.
随着基于视频的新闻内容数量的增长,提取屏幕上的信息变得越来越需要透明且可靠的方法。然而,图形布局、排版惯例和平台特定设计模式的变化性使得手动索引变得不切实际。这项工作提出了一种全面的框架,用于自动检测并从广播和社会媒体原生新闻视频中提取个人名字。该框架引入了一个经过精心策划且平衡的数据集,包含注释的帧以捕捉当代新闻图形多样性的特点,并提出了一个可解释、模块化的抽取管道设计,在确定性和可审计条件下运行。 该管道在与生成式多模态方法进行对比评估后显示了确定性审计和随机推理之间的明显权衡。底层检测器实现了0.5 mAP(平均精度)95.8%,证明其在图形元素定位上的操作稳健性能。虽然生成系统实现了略高的原始准确性(F1分数:84.18% 对比 77.08%),但它们缺乏新闻和分析情境中所需的透明数据追溯性。所提议的管道提供了平衡的精度(79.9%)和召回率(74.4%),避免了幻觉,并在整个处理阶段提供全程可追踪性。 用户研究结果表明,59% 的受访者报告在快节奏广播节目中难以阅读屏幕上的名字,强调了该任务的实际相关性。这些成果为现代新闻媒体中的混合多模态信息抽取确立了一个方法论严谨且可解释的基准。
https://arxiv.org/abs/2602.09154
With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: this http URL
随着机器人在日常生活中越来越普及,人机互动变得更为复杂和多样化。这种交互中的关键组成部分是交互式视觉接地(Interactive Visual Grounding, IVG),在这种机制下,机器人必须解读人类的意图并解决模棱两可的情况。现有的IVG模型通常缺少一种确定何时需要澄清问题的机制,因为它们依赖于隐式的学习表示。CLUE填补了这一空白,通过将视觉语言模型(VLM)的跨模式注意力转换为一个显式的、基于空间定位的信号来决定何时询问。 我们提取文本到图像的注意图,并将其传递给轻量级CNN以检测参照模糊性,同时使用经过LoRA微调的解码器进行对话并输出接地位置标记。我们在一个真实的交互数据集上训练IVG,以及用于探测器的混合模糊集合。仅通过InViG监督,我们的模型在参数高效的微调过程中超越了最先进的方法。同样地,模糊性探测器也优于先前的基础线。 总体而言,CLUE将VLM内部的跨模式注意力转换为一个显式的、基于空间定位的信号来决定何时询问。数据和代码公开提供:[链接](this http URL)
https://arxiv.org/abs/2602.08999
AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at this https URL.
AI文本检测器面临一个关键的鲁棒性挑战:对抗性的同义替换攻击,这些攻击在保留语义的同时避开检测。我们引入了StealthRL,这是一个强化学习框架,用于在现实的对抗条件下测试检测器的鲁棒性。StealthRL使用群体相对策略优化(GRPO)和LoRA适配器在Qwen3-4B上训练同义替换策略,以针对多检测器集合进行训练,并优化一个平衡了避开检测与保留语义的复合奖励函数。我们在安全相关的1%假阳性率操作点评估了六种攻击设置(M0-M5)对三种检测器家族(RoBERTa、FastDetectGPT 和 Binoculars)的效果。 StealthRL在接近零检出的情况下工作(平均TPR@1%FPR为0.001),将平均AUROC从0.74降至0.27,并且实现了99.9%的攻击成功率。关键的是,这些攻击能够转移到训练期间未见过的一个独立检测器家族上,这表明了架构中的共享脆弱性,而不是特定于某个检测器的脆弱性。 我们还通过李克特评分进行了LLM质量评估,分析了检测器分数分布以解释为何可以避开检测,并提供了带有自助法置信区间的每个检测器AUROC。我们的结果揭示了当前AI文本检测中存在的重大鲁棒性差距,并确立了StealthRL作为原则性的对抗性评估协议。 代码和评估流程公开可用,请访问此链接:[URL](请将 [URL] 替换为实际的公开网址)。
https://arxiv.org/abs/2602.08934
Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.
最近基于查询的3D物体检测方法,结合了相机和LiDAR输入数据,展现出了强大的性能。然而,现有的查询初始化策略(如随机采样或BEV热图基采样)往往导致查询使用效率低下,并且会降低准确性,特别是在处理被遮挡或密集分布的对象时更为明显。为了解决这一限制,我们提出了ALIGN(具有LiDAR和图像引导的高级查询初始化),这是一种新的方法,用于鲁棒性遮挡抗性和对象感知的查询初始化。 我们的模型由三个关键组件构成: (i) 遮挡感知中心估计(OCE),该部分融合了LiDAR几何信息与图像语义以精确地估计物体中心。 (ii) 自适应邻域采样(ANS),它从LiDAR聚类中生成目标候选,并通过在周围空间和语义上对齐点进行补充,以完善每个对象的定义。 (iii) 动态查询平衡(DQB),该机制能自适应地调节前景与背景区域之间的查询分配。 我们在nuScenes基准数据集上的大量实验表明,ALIGN能够在多个最先进的检测器中持续提升性能,在具有挑战性的遮挡或密集人群场景下,实现了高达+0.9 mAP和+1.2 NDS的改进。我们的代码将在发表后公开提供。
https://arxiv.org/abs/2512.18187