Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.
https://arxiv.org/abs/2604.14329
This paper introduces a novel approach for image and video orientation estimation by leveraging depth distribution in natural images. The proposed method estimates the orientation based on the depth distribution across different quadrants of the image, providing a robust framework for orientation estimation suited for applications such as virtual reality (VR), augmented reality (AR), autonomous navigation, and interactive surveillance systems. To further enhance fine-scale perceptual alignment, we incorporate depth gradient consistency (DGC) and horizontal symmetry analysis (HSA), enabling precise orientation correction. This hybrid strategy effectively exploits depth cues to support spatial coherence and perceptual stability in immersive visual content. Qualitative and quantitative evaluations demonstrate the robustness and accuracy of the proposed approach, outperforming existing techniques across diverse scenarios.
https://arxiv.org/abs/2604.13995
Accurate vehicle counting through video surveillance is crucial for efficient traffic management. However, achieving high counting accuracy while ensuring computational efficiency remains a challenge. To address this, we propose a fully automated, video-based vehicle counting framework designed to optimize both computational efficiency and counting accuracy. Our framework operates in two distinct phases: \textit{estimation} and \textit{prediction}. In the estimation phase, the optimal region of interest (ROI) is automatically determined using a novel combination of three models based on detection scores, tracking scores, and vehicle density. This adaptive approach ensures compatibility with any detection and tracking method, enhancing the framework's versatility. In the prediction phase, vehicle counting is efficiently performed within the estimated ROI. We evaluated our framework on benchmark datasets like UA-DETRAC, GRAM, CDnet 2014, and ATON. Results demonstrate exceptional accuracy, with most videos achieving 100\% accuracy, while also enhancing computational efficiency, making processing up to four times faster than full-frame processing. The framework outperforms existing techniques, especially in complex multi-road scenarios, demonstrating robustness and superior accuracy. These advancements make it a promising solution for real-time traffic monitoring.
https://arxiv.org/abs/2604.12470
Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at this https URL
https://arxiv.org/abs/2604.11714
Computer-aided detection (CADe) of early neoplasia in Barrett's esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.
https://arxiv.org/abs/2604.11171
Robotic fleets can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. However, it can be demanding or even impractical for an operator to directly control each robot. Thus, autonomy of the fleet and its online interaction with the operator are both essential, particularly in dynamic and partially unknown environments. The operator might need to add new tasks, cancel some tasks, change priorities and modify planning results. How to design the procedure for these interactions and efficient algorithms to fulfill these needs have been mostly neglected in the related literature. Thus, this work proposes a human-centric coordination and supervision scheme (HECTOR) for large-scale robotic fleets under continual and uncertain temporal tasks. It consists of three hierarchical layers: (I) the bidirectional and multimodal protocol of online human-fleet interaction, where the operator interacts with and supervises the whole fleet; (II) the rolling assignment of currently-known tasks to teams within a certain horizon, and (III) the dynamic coordination within a team given the detected subtasks during online execution. The overall mission can be as general as temporal logic formulas over collaborative actions. Such hierarchical structure allows human interaction and supervision at different granularities and triggering conditions, to both improve computational efficiency and reduce human effort. Extensive human-in-the-loop simulations are performed over heterogeneous fleets under various temporal tasks and environmental uncertainties.
https://arxiv.org/abs/2604.10892
Adversarial examples present significant challenges to the security of Deep Neural Network (DNN) applications. Specifically, there are patch-based and texture-based attacks that are usually used to craft physical-world adversarial examples, posing real threats to security-critical applications such as person detection in surveillance and autonomous systems, because those attacks are physically realizable. Existing defense mechanisms face challenges in the adaptive attack setting, i.e., the attacks are specifically designed against them. In this paper, we propose Adversarial Spectrum Defense (ASD), a defense mechanism that leverages spectral decomposition via Discrete Wavelet Transform (DWT) to analyze adversarial patterns across multiple frequency scales. The multi-resolution and localization capability of DWT enables ASD to capture both high-frequency (fine-grained) and low-frequency (spatially pervasive) perturbations. By integrating this spectral analysis with the off-the-shelf Adversarial Training (AT) model, ASD provides a comprehensive defense strategy against both patch-based and texture-based adversarial attacks. Extensive experiments demonstrate that ASD+AT achieved state-of-the-art (SOTA) performance against various attacks, outperforming the APs of previous defense methods by 21.73%, in the face of strong adaptive adversaries specifically designed against ASD. Code available at this https URL .
https://arxiv.org/abs/2604.10715
Gait recognition is a biometric modality that identifies individuals from their characteristic walking patterns. Unlike conventional biometric traits, gait can be acquired at a distance and without active subject cooperation, making it suitable for surveillance and public safety applications. Nevertheless, silhouette-based temporal models remain sensitive to long sequences, observation noise, and appearance-related covariates. Recurrent architectures often struggle to preserve information from earlier frames and are inherently sequential to optimize, whereas transformer-based models typically require greater computational resources and larger training sets and may be sensitive to irregular sequence lengths and noisy inputs. These limitations reduce robustness under clothing variation, carrying conditions, and view changes, while also hindering the joint modeling of local gait cycles and longer-term motion trends. To address these challenges, we introduce a Temporal Kolmogorov-Arnold Network (TKAN) for gait recognition. The proposed model replaces fixed edge weights with learnable one-dimensional functions and incorporates a two-level memory mechanism consisting of short-term RKAN sublayers and a gated long-term pathway. This design enables efficient modeling of both cycle-level dynamics and broader temporal context while maintaining a compact backbone. Experiments on the CASIA-B dataset indicate that the proposed CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.
https://arxiv.org/abs/2604.09990
We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: this https URL
https://arxiv.org/abs/2604.09819
Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at this https URL.
https://arxiv.org/abs/2604.09327
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at this https URL. Project homepage: this https URL
https://arxiv.org/abs/2604.09063
Computer vision, a core domain of artificial intelligence (AI), is the field that enables the computational analysis, understanding, and generation of visual data. Despite being historically rooted in military funding and increasingly deployed in warfare, the field tends to position itself as a neutral, purely technical endeavor, failing to engage in discussions about its dual-use applications. Yet it has been reported that computer vision systems are being systematically weaponized to assist in technologies that inflict harm, such as surveillance or warfare. Expanding on these concerns, we study the extent to which computer vision research is being used in the military and surveillance domains. We do so by collecting a dataset of tech companies with financial ties to the field's central research exchange platform: conferences. Conference sponsorship, we argue, not only serves as strong evidence of a company's investment in the field but also provides a privileged position for shaping its trajectory. By investigating sponsors' activities, we reveal that 44% of them have a direct connection with military or surveillance applications. We extend our analysis through two case studies in which we discuss the opportunities and limitations of sponsorship as a means for uncovering technological weaponization.
https://arxiv.org/abs/2604.07803
Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at this https URL.
https://arxiv.org/abs/2604.07772
In gender-restrictive and surveilled contexts, where access to formal education may be restricted for women, pursuing education involves safety and privacy risks. When women are excluded from schools and universities, they often turn to online self-learning and generative AI (GenAI) to pursue their educational and career aspirations. However, we know little about what safe and accountable GenAI support is required in the context of surveillance, household responsibilities, and the absence of learning communities. We present a remote participatory design study with 20 women in Afghanistan, informed by a recruitment survey (n = 140), examining how participants envision GenAI for learning and employability. Participants describe using GenAI less as an information source and more as an always-available peer, mentor, and source of career guidance that helps compensate for the absence of learning communities. At the same time, they emphasize that this companionship is constrained by privacy and surveillance risks, contextually unrealistic and culturally unsafe support, and direct-answer interactions that can undermine learning by creating an illusion of progress. Beyond eliciting requirements, envisioning the future with GenAI through participatory design was positively associated with significant increases in participants' aspirations (p=.01), perceived agency (p=.01), and perceived avenues (p=.03). These outcomes show that accountable and safe GenAI is not only about harm reduction but can also actively enable women to imagine and pursue viable learning and employment futures. Building on this, we translate participants' proposals into accountability-focused design directions that center on safety-first interaction and user control, context-grounded support under constrained resources, and offer pedagogically aligned assistance that supports genuine learning rather than quick answers.
https://arxiv.org/abs/2604.07253
We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.
https://arxiv.org/abs/2604.07101
Psychological stress is clinically relevant in cardio-oncology, yet it is typically assessed only through patient-reported outcome measures (PROMs) and is rarely integrated into continuous cardiotoxicity surveillance. We estimate perceived stress in an elderly, multicenter breast cancer cohort (CARDIOCARE) using multimodal wearable data from a smartwatch (physical activity and sleep) and a chest-worn ECG sensor. Wearable streams are transformed into heterogeneous visual representations, yielding a weakly supervised setting in which a single Perceived Stress Scale (PSS) score corresponds to many unlabeled windows. A lightweight pretrained mixture-of-experts backbone (Tiny-BioMoE) embeds each representation into 192-dimensional vectors, which are aggregated via attention-based multiple instance learning (MIL) to predict PSS at month 3 (M3) and month 6 (M6). Under leave-one-subject-out (LOSO) evaluation, predictions showed moderate agreement with questionnaire scores (M3: R^2=0.24, Pearson r=0.42, Spearman rho=0.48; M6: R^2=0.28, Pearson r=0.49, Spearman rho=0.52), with global RMSE/MAE of 6.62/6.07 at M3 and 6.13/5.54 at M6.
https://arxiv.org/abs/2604.06990
Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible--infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible--infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.
https://arxiv.org/abs/2604.06865
Extracting vehicle information from surveillance images is essential for intelligent transportation systems, enabling applications such as traffic monitoring and criminal investigations. While Automatic License Plate Recognition (ALPR) is widely used, Fine-Grained Vehicle Classification (FGVC) offers a complementary approach by identifying vehicles based on attributes such as color, make, model, and type. Although there have been advances in this field, existing studies often assume well-controlled conditions, explore limited attributes, and overlook FGVC integration with ALPR. To address these gaps, we introduce UFPR-VeSV, a dataset comprising 24,945 images of 16,297 unique vehicles with annotations for 13 colors, 26 makes, 136 models, and 14 types. Collected from the Military Police of Paraná (Brazil) surveillance system, the dataset captures diverse real-world conditions, including partial occlusions, nighttime infrared imaging, and varying lighting. All FGVC annotations were validated using license plate information, with text and corner annotations also being provided. A qualitative and quantitative comparison with established datasets confirmed the challenging nature of our dataset. A benchmark using five deep learning models further validated this, revealing specific challenges such as handling multicolored vehicles, infrared images, and distinguishing between vehicle models that share a common platform. Additionally, we apply two optical character recognition models to license plate recognition and explore the joint use of FGVC and ALPR. The results highlight the potential of integrating these complementary tasks for real-world applications. The UFPR-VeSV dataset is publicly available at: this https URL.
从监控图像中提取车辆信息对于智能交通系统至关重要,可支持交通监控和刑事调查等应用。虽然自动车牌识别(ALPR)已广泛应用,但细粒度车辆分类(FGVC)通过识别颜色、品牌、型号和类型等属性提供了互补方法。尽管该领域已有进展,现有研究常假设条件受控、探索的属性有限,且忽视了FGVC与ALPR的整合。为弥补这些不足,我们推出UFPR-VeSV数据集,该数据集包含16,297辆独特车辆的24,945张图像,标注了13种颜色、26个品牌、136种型号和14种类型。数据集源自巴西巴拉那州军事警察监控系统,涵盖了多种真实世界场景,包括部分遮挡、夜间红外成像和光照变化。所有FGVC标注均通过车牌信息验证,同时提供了文本和角点标注。与现有数据集的定性和定量比较证实了本数据集的挑战性。使用五种深度学习模型的基准测试进一步验证了这一点,揭示了处理多色车辆、红外图像以及区分同平台车型等具体挑战。此外,我们应用两种光学字符识别模型进行车牌识别,并探索了FGVC与ALPR的联合使用。结果凸显了整合这些互补任务在现实应用中的潜力。UFPR-VeSV数据集已公开提供,地址为:此https链接。
https://arxiv.org/abs/2604.05271
Protest-related social media data are valuable for understanding collective action but inherently high-risk due to concerns surrounding surveillance, repression, and individual privacy. Contemporary AI systems can identify individuals, infer sensitive attributes, and cross-reference visual information across platforms, enabling surveillance that poses risks to protesters and bystanders. In such contexts, large foundation models trained on protest imagery risk memorizing and disclosing sensitive information, leading to cross-platform identity leakage and retroactive participant identification. Existing approaches to automated protest analysis do not provide a holistic pipeline that integrates privacy risk assessment, downstream analysis, and fairness considerations. To address this gap, we propose a responsible computing framework for analyzing collective protest dynamics while reducing risks to individual privacy. Our framework replaces sensitive protest imagery with well-labeled synthetic reproductions using conditional image synthesis, enabling analysis of collective patterns without direct exposure of identifiable individuals. We demonstrate that our approach produces realistic and diverse synthetic imagery while balancing downstream analytical utility with reductions in privacy risk. We further assess demographic fairness in the generated data, examining whether synthetic representations disproportionately affect specific subgroups. Rather than offering absolute privacy guarantees, our method adopts a pragmatic, harm-mitigating approach that enables socially sensitive analysis while acknowledging residual risks.
https://arxiv.org/abs/2604.05256
We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.
https://arxiv.org/abs/2604.09685