In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。
https://arxiv.org/abs/2601.13715
The increasing adoption of smart classroom technologies in higher education has mainly focused on automating attendance, with limited attention given to students' emotional and cognitive engagement during lectures. This limits instructors' ability to identify disengagement and adapt teaching strategies in real time. This paper presents SCASED (Smart Classroom Attendance System with Emotion Detection), an IoT-based system that integrates automated attendance tracking with facial emotion recognition to support classroom engagement monitoring. The system uses a Raspberry Pi camera and OpenCV for face detection, and a finetuned MobileNetV2 model to classify four learning-related emotional states: engagement, boredom, confusion, and frustration. A session-based mechanism is implemented to manage attendance and emotion monitoring by recording attendance once per session and performing continuous emotion analysis thereafter. Attendance and emotion data are visualized through a cloud-based dashboard to provide instructors with insights into classroom dynamics. Experimental evaluation using the DAiSEE dataset achieved an emotion classification accuracy of 89.5%. The results show that integrating attendance data with emotion analytics can provide instructors with additional insight into classroom dynamics and support more responsive teaching practices.
在高等教育中,智能教室技术的采用主要集中在自动签到上,而对于学生在课堂上的情感和认知参与度的关注较少。这限制了教师实时识别学生脱轨并调整教学策略的能力。本文介绍了一种基于物联网(IoT)的系统——SCASED(带情绪检测的智能教室考勤系统),该系统将自动化签到跟踪与面部表情识别相结合,以支持课堂参与度监测。 SCASED系统使用树莓派摄像头和OpenCV进行人脸检测,并采用经过微调的MobileNetV2模型来分类四种学习相关的情绪状态:投入、无聊、困惑和沮丧。此外,系统还实现了一种基于会话机制的方法来管理签到与情绪监控,即每个会话记录一次签到并随后持续进行情绪分析。 学生签到及情绪数据通过云端仪表板可视化显示,为教师提供有关课堂动态的洞察力。使用DAiSEE数据集进行实验评估后,达到了89.5%的情绪分类准确性。结果表明,将签到数据与情感分析相结合可以为教师提供更多关于课堂动态的见解,并支持更具响应性的教学实践。
https://arxiv.org/abs/2601.08049
Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.
数据稀缺和分布变化给戴口罩人脸检测与识别带来了重大挑战。为此,我们提出了一种两步生成式数据增强框架,该框架结合了基于规则的口罩变形技术与未配对图像到图像转换(使用GAN)的方法,能够生成超越单纯合成变换的真实戴口罩人脸样本。相较于仅采用基于规则的变形方法,所提出的方案在定性上持续改进,并补充现有的如IAMGAN等基于GAN的戴口罩人脸生成方法。我们引入了非面具保持损失和随机噪声注入来稳定训练并增强样本多样性。实验观察结果强调了所提出组件的有效性,并为未来以数据为中心的数据增强技术在人脸识别任务中的改进方向提供了建议。
https://arxiv.org/abs/2512.15774
Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
数据增强对于提升面部检测系统在遮挡、光照变化和复杂环境等挑战条件下的鲁棒性至关重要。传统的复制粘贴数据增强方法由于前景提取不准确、场景几何不一致以及背景语义不匹配等问题,常常生成不符合现实的合成图像。为了解决这些问题,我们提出了Depth Copy Paste(深度复制粘贴),这是一种多模态且感知深度的数据增强框架,通过将完整身体的人体实例复制并粘贴到语义兼容的场景中来生成多样且物理上一致的面部检测训练样本。 我们的方法首先利用BLIP和CLIP联合评估语义和视觉一致性,从而自动检索最适合给定前景人体背景图像。为了保证高质量的前景遮罩以保留面部细节,我们整合了SAM3进行精确分割,并使用Depth-Anything提取仅限于未被遮挡且可见的人体区域,避免在数据增强过程中使用受损的面部纹理。 为确保几何真实感,我们引入了一种基于深度引导的滑动窗口放置机制,在背景深度图上搜索具有最优深度连续性和尺度对齐的粘贴位置。这样生成的合成图像展现出自然的深度关系和改进了的视觉可信度。 通过广泛的实验表明,Depth Copy Paste能够提供更加多样且真实的训练数据,相比传统的复制粘贴方法及不依赖于深度的方法,在下游面部检测任务中带来了显著的性能提升。
https://arxiv.org/abs/2512.11683
Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
确保视频内容的真实性仍然是一项挑战,因为DeepFake生成变得越来越逼真,并且在检测上也越来越具有抵抗力。现有的大多数检测器都隐含地假设面部序列是时间一致且清晰的,而在现实场景中这一假设很少成立:由于压缩失真、遮挡和对抗性攻击的影响,人脸检测经常不稳定并导致无效或误检的人脸。为了解决这些问题,我们提出了一种拉普拉斯正则化图卷积网络(LR-GCN),该方法可以在仅使用干净面部数据进行训练的情况下,从嘈杂的或无序的面部序列中稳健地检测DeepFakes。 我们的方法构建了一个顺序无关的时间图嵌入(OF-TGE),它将每帧的CNN特征组织成一个基于语义相似性的自适应稀疏图。与受严格时间连续性约束的传统方法不同,OF-TGE捕获了跨帧的固有特征一致性,使其能够应对打乱、缺失或严重损坏的数据输入。 我们进一步在图结构和节点特征上引入了一种双层稀疏机制来抑制无效面部的影响。关键的是,我们提出了一种明确的图拉普拉斯谱先验,它在图频域中起高通滤波器的作用,突出显示结构性异常和伪造痕迹,并通过低通GCN聚合进行整合。 这种顺序设计有效地实现了任务驱动的频段带通机制,在抑制背景信息和随机噪声的同时保留操作线索。在FF++、Celeb-DFv2 和 DFDC 数据集上的广泛实验表明,LR-GCN 达到了最先进的性能并显著提高了对严重全局和局部干扰(包括缺失面部、遮挡以及被对抗性扰动的面部检测)的鲁棒性。
https://arxiv.org/abs/2512.07498
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
这篇论文在两个主要领域提出了创新贡献:一是提升生成模型(尤其是归一化流)的效率,二是应用生成模型解决实际计算机视觉挑战。第一部分通过六项关键创新对归一化流动架构进行了重大改进: 1) 开发了具有数学证明的必要和充分条件以确保可逆性的可逆3x3卷积层。 2) 引入了一种更高效的四耦合层(Quad-coupling layer)。 3) 设计了一个快速且有效的并行反演算法,用于kxk卷积层。 4) 为卷积的逆向传播开发了高效算法。 5) 在前向传递中使用卷积的逆进行Invert-Flow,并通过提出的反向传播算法进行训练。 6) Affine-StableSR,一个紧凑高效的超分辨率模型,利用预训练权重和归一化流动层来减少参数数量同时保持性能。 第二部分包括: 1) 一种使用条件生成对抗网络(Conditional GANs)的自动质量评估系统,用于解决农业产品中的类别不平衡、数据稀缺和标注挑战,并在种子纯度测试中实现了良好的准确性。 2) 一个利用堆叠自编码器进行降维的无监督地质制图框架,在特征提取方面优于传统方法。 3) 提出了一种使用面部检测和图像修复来保护自动驾驶数据集隐私的方法。 4) 利用基于Stable Diffusion的图像修复技术,将检测到的脸部和车牌替换为提高隐私保护技术和伦理考虑的方法。 5) 一种经过改进的扩散模型用于艺术作品恢复,能够通过统一微调处理多种类型的退化。
https://arxiv.org/abs/2512.04039
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
https://arxiv.org/abs/2511.19187
Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.
https://arxiv.org/abs/2511.16887
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
https://arxiv.org/abs/2511.14969
Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
https://arxiv.org/abs/2511.02180
Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.
面部识别(FR)模型已被证明容易受到微小改变的良性面部图像攻击,这些攻击揭示了系统的漏洞,并保护了用户隐私。端到端的面部识别系统在计算深度特征嵌入相似性之前,会从各种面部图像中获取预处理过的脸部。虽然面部预处理是面部识别系统的一个关键组成部分,因此也是对抗攻击的一部分,但我们观察到,在黑盒设置下,这一过程往往被忽视了。我们的研究旨在探讨几种现成的最先进的对抗攻击在应用于不同黑盒设置中的预处理技术时的效果转移性。 我们发现,选择不同的面部检测模型可以降低78%的攻击成功率,而降采样期间使用的插值方法的影响相对较小。此外,我们还发现对预处理的需求甚至会削弱白盒环境下的攻击力度,这是由于生成的噪声向量与面部检测模型之间的意外相互作用所致。 基于这些发现,我们提出了一种使用输入变换的方法来提高所研究攻击的效果转移性,这种方法不受预处理过程的影响,并将效果提高了多达27%。我们的研究表明,在面部识别系统中,预处理的重要性不容忽视,并且为了改善面部对抗样本的泛化能力,需要考虑这一因素。
https://arxiv.org/abs/2510.17169
Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.
极端模态检测(Extreme Amodal Detection)的任务是推断出输入图像中未完全可见但可以在扩展视野范围内看到的对象的二维位置。这与一般的模态检测不同,后者是指对象在输入图像中部分可见但被遮挡的情况。本文着重探讨了面部检测这一子问题,因为此类应用涉及安全和隐私领域的重要场景,但我们并未将方法专门针对这个类别进行设计。现有的方法依赖于图序列,以便从周围的帧中插值缺失的检测或利用生成模型来采样可能的完成情况。相比之下,我们考虑单一图像的任务,并提出了一种更高效的无需样本的方法,该方法利用来自图像的情境线索来推断未见人脸的存在。 为了应对这个任务,我们设计了一个基于热图(heatmap)的极端模态对象检测器,它解决了从少量信息高效预测大量数据的问题(即预测超出画面范围的部分),通过选择性地采用粗到细的解码器。我们的方法在这一新任务上取得了显著成果,并且优于其他更不高效的生成式方法。
https://arxiv.org/abs/2510.06791
In recent years, parametric representations of point clouds have been widely applied in tasks such as memory-efficient mapping and multi-robot collaboration. Highly adaptive models, like spline surfaces or quadrics, are computationally expensive in detection or fitting. In contrast, real-time methods, such as Gaussian mixture models or planes, have low degrees of freedom, making high accuracy with few primitives difficult. To tackle this problem, a multi-model parametric representation with real-time surface detection and fitting is proposed. Specifically, the Gaussian mixture model is first employed to segment the point cloud into multiple clusters. Then, flat clusters are selected and merged into planes or curved surfaces. Planes can be easily fitted and delimited by a 2D voxel-based boundary description method. Surfaces with curvature are fitted by B-spline surfaces and the same boundary description method is employed. Through evaluations on multiple public datasets, the proposed surface detection exhibits greater robustness than the state-of-the-art approach, with 3.78 times improvement in efficiency. Meanwhile, this representation achieves a 2-fold gain in accuracy over Gaussian mixture models, operating at 36.4 fps on a low-power onboard computer.
近年来,点云的参数表示在诸如记忆高效地图构建和多机器人协作等任务中得到了广泛应用。高度自适应模型(如样条曲面或二次曲面)在检测或拟合过程中计算成本较高。相比之下,实时方法(例如高斯混合模型或平面模型),虽然自由度较低,但有助于提高效率,不过用少数几个基本元素实现高精度较为困难。为解决这一问题,提出了一种结合了多种模型的参数表示法,实现了实时表面检测和拟合。 具体来说,首先采用高斯混合模型将点云分割成多个簇。然后选择平面簇并将其合并成平面或曲面。平面可以通过基于二维体素边界描述的方法轻松拟合和限定。对于带有曲率的表面,则使用B样条曲面进行拟合,并且应用相同边界的描述方法。 通过在多个公开数据集上的评估,所提出的表面检测方法比现有最佳方法显示出更强的鲁棒性,在效率方面提高了3.78倍。同时,该表示法在准确性上较高斯混合模型提升了两倍,在低功耗车载计算机上的运行速度达到每秒36.4帧。
https://arxiv.org/abs/2509.14773
Fake news detection is an important and challenging task for defending online information integrity. Existing state-of-the-art approaches typically extract news semantic clues, such as writing patterns that include emotional words, stylistic features, etc. However, detectors tuned solely to such semantic clues can easily fall into surface detection patterns, which can shift rapidly in dynamic environments, leading to limited performance in the evolving news landscape. To address this issue, this paper investigates a novel perspective by incorporating news intent into fake news detection, bridging intents and semantics together. The core insight is that by considering news intents, one can deeply understand the inherent thoughts behind news deception, rather than the surface patterns within words alone. To achieve this goal, we propose Graph-based Intent-Semantic Joint Modeling (InSide) for fake news detection, which models deception clues from both semantic and intent signals via graph-based joint learning. Specifically, InSide reformulates news semantic and intent signals into heterogeneous graph structures, enabling long-range context interaction through entity guidance and capturing both holistic and implementation-level intent via coarse-to-fine intent modeling. To achieve better alignment between semantics and intents, we further develop a dynamic pathway-based graph alignment strategy for effective message passing and aggregation across these signals by establishing a common space. Extensive experiments on four benchmark datasets demonstrate the superiority of the proposed InSide compared to state-of-the-art methods.
虚假新闻检测是一项重要且具有挑战性的任务,旨在保护在线信息的完整性。现有的最先进的方法通常会提取诸如包含情感词汇的写作模式和风格特征等新闻语义线索。然而,仅针对此类语义线索调优的检测器容易陷入表面检测模式,在动态环境中这些模式变化迅速,从而在不断演变的新闻景观中表现有限。为了解决这个问题,本文从一个新颖的角度出发,通过将新闻意图纳入虚假新闻检测来探讨这一问题,旨在连接意图和语义之间的联系。核心见解在于:考虑新闻意图可以深入了解其背后的内在思维,而不仅仅是表面的文字模式。 为了实现这一目标,我们提出了一种基于图的意图-语义联合建模方法(InSide),用于假新闻检测。该模型通过基于图的联合学习方式从语义和意图信号中提取欺骗线索。具体而言,InSide将新闻的语义和意图信号重新表述为异构图结构,这使长距离上下文交互成为可能,并且能够通过粗到细级别的意图建模捕捉全面及实施层面的意图。 为了更好地实现语义与意图之间的对齐,我们进一步开发了一种基于动态路径图对齐策略的方法,以有效进行跨信号的消息传递和聚合,在此过程中建立了一个公共空间。在四个基准数据集上的广泛实验证明了所提出的InSide方法优于现有最先进的方法。
https://arxiv.org/abs/2509.01660
Sensitive Information Exposure (SIEx) vulnerabilities (CWE-200) remain a persistent and under-addressed threat across software systems, often leading to serious security breaches. Existing detection tools rarely target the diverse subcategories of CWE-200 or provide context-aware analysis of code-level data flows. Aims: This paper aims to present SIExVulTS, a novel vulnerability detection system that integrates transformer-based models with static analysis to identify and verify sensitive information exposure in Java applications. Method: SIExVulTS employs a three-stage architecture: (1) an Attack Surface Detection Engine that uses sentence embeddings to identify sensitive variables, strings, comments, and sinks; (2) an Exposure Analysis Engine that instantiates CodeQL queries aligned with the CWE-200 hierarchy; and (3) a Flow Verification Engine that leverages GraphCodeBERT to semantically validate source-to-sink flows. We evaluate SIExVulTS using three curated datasets, including real-world CVEs, a benchmark set of synthetic CWE-200 examples, and labeled flows from 31 open-source projects. Results: The Attack Surface Detection Engine achieved an average F1 score greater than 93\%, the Exposure Analysis Engine achieved an F1 score of 85.71\%, and the Flow Verification Engine increased precision from 22.61\% to 87.23\%. Moreover, SIExVulTS successfully uncovered six previously unknown CVEs in major Apache projects. Conclusions: The results demonstrate that SIExVulTS is effective and practical for improving software security against sensitive data exposure, addressing limitations of existing tools in detecting and verifying CWE-200 vulnerabilities.
敏感信息泄露(SIEx)漏洞(CWE-200)仍然是软件系统中持续存在的、未充分解决的安全威胁,经常导致严重的安全漏洞。现有的检测工具很少针对CWE-200的各种子类别进行目标定位或提供代码级数据流的上下文感知分析。本论文旨在介绍一种名为SIExVulTS的新颖漏洞检测系统,该系统整合了基于Transformer的模型与静态分析技术,以识别和验证Java应用程序中的敏感信息泄露。 **方法:** SIExVulTS采用三阶段架构: 1. **攻击面检测引擎**:使用句子嵌入来识别敏感变量、字符串、注释及接收点。 2. **暴露分析引擎**:实例化与CWE-200层级相匹配的CodeQL查询。 3. **流验证引擎**:利用GraphCodeBERT进行源到接收点数据流的语义验证。 我们使用三个经过整理的数据集来评估SIExVulTS,包括现实世界的CVE、一组合成的CWE-200示例以及来自31个开源项目标记的数据流。 **结果:** 攻击面检测引擎实现了平均F1分数超过93%,暴露分析引擎达到了85.71%的F1分数,而流验证引擎将精度从22.61%提升到了87.23%。此外,SIExVulTS在主要Apache项目中成功发现了六个此前未知的CVE。 **结论:** 结果表明,SIExVulTS对于改进软件安全以防止敏感数据泄露非常有效且实用,解决了现有工具在检测和验证CWE-200漏洞方面的局限性。
https://arxiv.org/abs/2508.19472
Face mask detection has become increasingly important recently, particularly during the COVID-19 pandemic. Many face detection models have been developed in smart entryways using IoT. However, there is a lack of IoT development on face mask detection. This paper proposes a two-factor authentication system for smart entryway access control using facial recognition and passcode verification and an automation process to alert the owner and activate the surveillance system when a stranger is detected and controls the system remotely via Telegram on a Raspberry Pi platform. The system employs the Local Binary Patterns Histograms for the full face recognition algorithm and modified LBPH algorithm for occluded face detection. On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. The results indicate that the system is capable of conducting face recognition and mask detection, automating the operation of the remote control to register users, locking or unlocking the door, and notifying the owner. The sample participants highly accept it for future use in the user acceptance test.
最近,面部口罩检测变得越来越重要,特别是在COVID-19大流行期间。许多智能入口处的面部识别模型已经使用物联网技术开发出来,然而,在面部口罩检测方面的物联网开发却相对不足。本文提出了一种用于智能入口门禁控制的双因素认证系统,该系统结合了面部识别和密码验证,并在检测到陌生人时通过自动化过程向所有者发出警报并激活监控系统,同时还可以通过Telegram平台远程控制树莓派上的整个系统运行。 该系统采用局部二值模式直方图(LBPH)算法进行全脸识别,并且使用修改后的局部二值模式算法来检测被遮挡的面部。经过测试,在所有用户中,平均准确率约为70%,精确度约为80%,召回率为约83.26%。 结果显示,该系统能够执行面部识别和口罩检测、远程控制操作自动化(包括注册用户、锁定或解锁门以及通知所有者)。在用户体验接受性测试中,参与者对该系统的未来使用表示高度认可。
https://arxiv.org/abs/2508.13617
Face Recognition Systems that operate in unconstrained environments capture images under varying conditions,such as inconsistent lighting, or diverse face poses. These challenges require including a Face Detection module that regresses bounding boxes and landmark coordinates for proper Face Alignment. This paper shows the effectiveness of Object Generation Attacks on Face Detection, dubbed Face Generation Attacks, and demonstrates for the first time a Landmark Shift Attack that backdoors the coordinate regression task performed by face detectors. We then offer mitigations against these vulnerabilities.
在非约束环境中运行的面部识别系统会在不同条件下捕捉图像,例如光照不一致或不同的面部姿态。这些挑战需要包括一个面部检测模块来回归边界框和地标坐标以实现正确的面部对齐。本文展示了针对面部检测系统的物体生成攻击(称为面部生成攻击)的有效性,并首次演示了一种地标偏移攻击,这种攻击会篡改面部探测器执行的坐标回归任务。最后,我们提供了针对这些漏洞的缓解措施。
https://arxiv.org/abs/2508.00620
Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model's performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.
面部检测是许多人工智能驱动应用(如监控、生物识别认证和人机交互)的关键组成部分。然而,现实条件下的低分辨率图像对检测性能构成了重大挑战。在这项研究中,我们系统地调查了输入分辨率对三种基于深度学习的面部检测器——YOLOv11、YOLOv12 和 MTCNN 的准确性和鲁棒性的影响。使用 WIDER FACE 数据集,在多个图像分辨率(160x160、320x320 和 640x640)下进行了广泛的评估,并采用包括精度、召回率、mAP50、mAP50-95 及推断时间等指标来评估每个模型的性能。结果表明,YOLOv11 在检测准确性方面优于 YOLOv12 和 MTCNN,特别是在较高分辨率下,而 YOLOv12 的召回率略高。MTCNN 尽管在地标定位上具有竞争力,但在实时推断速度方面落后。我们的发现为选择适合各种操作约束的分辨率感知面部检测模型提供了可操作的见解。
https://arxiv.org/abs/2507.23341
AI-generated face detectors trained via supervised learning typically rely on synthesized images from specific generators, limiting their generalization to emerging generative techniques. To overcome this limitation, we introduce a self-supervised method based on bi-level optimization. In the inner loop, we pretrain a vision encoder only on photographic face images using a set of linearly weighted pretext tasks: classification of categorical exchangeable image file format (EXIF) tags, ranking of ordinal EXIF tags, and detection of artificial face manipulations. The outer loop then optimizes the relative weights of these pretext tasks to enhance the coarse-grained detection of manipulated faces, serving as a proxy task for identifying AI-generated faces. In doing so, it aligns self-supervised learning more closely with the ultimate goal of AI-generated face detection. Once pretrained, the encoder remains fixed, and AI-generated faces are detected either as anomalies under a Gaussian mixture model fitted to photographic face features or by a lightweight two-layer perceptron serving as a binary classifier. Extensive experiments demonstrate that our detectors significantly outperform existing approaches in both one-class and binary classification settings, exhibiting strong generalization to unseen generators.
AI生成的面部检测器通过监督学习训练时,通常依赖于特定生成器合成的图像,这限制了它们对新兴生成技术的泛化能力。为克服这一局限性,我们提出了一种基于双层优化的自监督方法。在内部循环中,我们仅使用一组线性加权预设任务(如EXIF标签分类、顺序EXIF标签排序以及人工面部操作检测)对摄影人脸图像进行视觉编码器的预训练。外部循环则通过优化这些预设任务的相对权重来增强粗粒度的人脸篡改检测能力,作为识别AI生成面孔的代理任务,从而更紧密地将自监督学习与AI生成面部检测的最终目标相结合。完成预训练后,编码器保持固定状态不变,然后可以通过高斯混合模型适应摄影人脸特征下的异常检测或使用轻量级两层感知机进行二元分类来识别AI生成的人脸。 广泛的实验表明,在单类和二元分类设置下,我们的探测器在未见过的生成器上表现出显著优于现有方法的泛化能力。
https://arxiv.org/abs/2507.22824