With the rapid development of deep generative models, forged facial images are massively exploited for illegal activities. Although existing synthetic face detection methods have achieved significant progress, they suffer from the inherent limitation of overconfidence due to their reliance on the Softmax activation function. Thus, these methods often lead to unreliable predictions when encountering unknown Out-of-Distribution (OOD) images, and cannot ascertain the model's uncertainty in its prediction. Meanwhile, most existing methods require massive high-quality annotated data, which greatly limits their practicability across diverse scenarios. To address these limitations, we propose EMSFD (Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning), an approach designed to enhance detection reliability and generalizability. Specifically, EMSFD models class evidence using the Dirichlet distribution and explicitly incorporates model uncertainty into the prediction process. Furthermore, during training, the estimated uncertainty is exploited to prioritize more informative samples from the unlabeled pool for annotation, thereby reducing labeling cost and improving model generalization. Extensive experimental evaluations demonstrate that our method enhances the interpretability of synthetic face detection. Meanwhile, our method yields a 15\% increase in accuracy compared to existing state-of-the-art (SOTA) baselines, which demonstrates the superior detection performance and generalizability of our approach. Our code is available at: this https URL.
https://arxiv.org/abs/2605.09935
Optical neural networks are emerging as powerful machine learning and information processing tools because of their potential advantages in speed and energy efficiency. The training methods of these physical models, however, remain underexplored compared to their digital counterparts and are leading to suboptimal performance. This paper reports a pre-training-driven approach that leads to snapshot image denoising with substantially improved quality. We demonstrated effective free-space optical denoising by a diffractive network optimized by a two-step process including (1) pre-training using a massive dataset of 3.45 million diverse but simple images and (2) fine-tuning with the corresponding task-specific datasets. Compared to conventional Fourier-domain filtering and directly trained diffractive networks, such a transfer learning process exhibited prominent advantages for denoising images degraded by severe noise, peak signal-to-noise ratio (PSNR) below 8 dB, while preserving fine image features and improving the PSNR to above 18 dB. Importantly, the same pre-trained optical network could be consistently fine-tuned to process degraded images from highly diverse styles ranging from handwritten digits (MNIST) and chest X-rays (ChestMNIST) to CIFAR-10 images and human faces (CelebA). We further demonstrated the critical role of our optical denoisers in vision-based applications, including face detection, plate recognition, and localization of UAVs in noisy conditions.
https://arxiv.org/abs/2605.07810
Nowadays as convolution neural networks demonstrate its powerful problem-solving ability in the area of image processing, efforts have been made to reconstruct detailed face shapes from 2D face images or videos. However, to make the full use of CNN, a large number of labeled data is required to train the network. Coarse morphable face model has been used to synthesize labeled data. However, it is hard for coarse morphable face models to generate photo-realistic data with detail such as wrinkles. In this project, we present a pipeline that reconstructs a human face 3D model from a single RGB image. The pipeline includes face detection, landmark detection, regression of 3DMM model parameters, and soft rendering. Mentor: Zhipeng Fan (Email: zf606@nyu.edu) Code Repository: this https URL reconstruction Code Reference: this https URL pytorch
https://arxiv.org/abs/2605.03996
Explicit reconstruction constraints derived from the decoupled representation are further imposed to suppress abnormal channel amplification and chromatic noise. Experiments on LOLv2-Real, MIT-Adobe FiveK, and LSRW show that the proposed method achieves competitive or superior quantitative and visual performance, reaching 29.71 dB PSNR and 0.89 SSIM on LOLv2-Real. DarkFace experiments further indicate improved downstream face detection under low-light conditions. Code and pretrained models are available at: this https URL.
https://arxiv.org/abs/2605.02627
The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been more difficult to discern. We introduce the GPT-Image-2 Twitter Dataset, the first published dataset of GPT-image-2 generated images, sourced from publicly available Twitter/X posts in the immediate aftermath of the model's April 21, 2026 release. Leveraging the Twitter API v2 and a multi-stage curation pipeline spanning multilingual text heuristics (English, Japanese, and Chinese), browser-automated Twitter "Made with AI" badge verification, and model name variant matching, we curate 10,217 confirmed GPT-image-2 images from 27,662 collected records over a six-day window. We characterize the dataset across four analyses: CLIP-based zero-shot subject taxonomy, OCR text legibility (82.0% of images contain detectable text), face detection (59.2% of images, 22,583 total faces), and semantic clustering (137 CLIP ViT-L/14 clusters). A key negative result is that C2PA content credentials are systematically stripped by Twitter's CDN on upload, rendering cryptographic provenance verification infeasible for social-media-sourced AI images. The dataset and all curation code are released publicly.
https://arxiv.org/abs/2604.25370
Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.
https://arxiv.org/abs/2604.08990
As the Internet of Things (IoT) becomes deeply embedded in daily life, users are increasingly concerned about privacy leakage, especially from video data. Since frame-by-frame protection in large-scale video analytics (e.g., smart communities) introduces significant latency, a more efficient solution is to selectively protect frames containing privacy objects (e.g., faces). Existing object detectors require fully decoded videos or per-frame processing in compressed videos, leading to decoding overhead or reduced accuracy. Therefore, we propose ComPrivDet, an efficient method for detecting privacy objects in compressed video by reusing I-frame inference results. By identifying the presence of new objects through compressed-domain cues, ComPrivDet either skips P- and B-frame detections or efficiently refines them with a lightweight detector. ComPrivDet maintains 99.75% accuracy in private face detection and 96.83% in private license plate detection while skipping over 80% of inferences. It averages 9.84% higher accuracy with 75.95% lower latency than existing compressed-domain detection methods.
https://arxiv.org/abs/2604.03640
Responding to one's name is among the earliest-emerging social orienting behaviors and is one of the most prominent aspects in the detection of Autism Spectrum Disorder (ASD). Typically developing children exhibit near-reflexive orienting to their name, whereas children with ASD often demonstrate reduced frequency, increased latency, or atypical patterns of response. In this study, we examine differential responsiveness to quantify name-calling stimuli delivered by both human agents and NAO, a humanoid robot widely employed in socially assistive interventions for autism. The analysis focuses on multiple behavioral parameters, including eye contact, response latency, head and facial orientation shifts, and duration of sustained interest. Video-based computational methods were employed, incorporating face detection, eye region tracking, and spatio-temporal facial analysis, to obtain fine-grained measures of children's responses. By comparing neurotypical and neuroatypical groups under controlled human-robot conditions, this work aims to understand how the source and modality of social cues affect attentional dynamics in name-calling contexts. The findings advance both the theoretical understanding of social orienting deficits in autism and the applied development of robot-assisted assessment tools.
对名字的反应是最早出现的社会定向行为之一,也是检测自闭症谱系障碍(ASD)最突出的方面之一。典型发育儿童对名字会表现出近乎反射性的定向反应,而自闭症儿童则常出现反应频率降低、潜伏期延长或反应模式异常。本研究通过量化人类实验者与广泛用于自闭症社会辅助干预的NAO人形机器人发出的呼叫名字刺激,分析不同主体的差异反应。研究聚焦多项行为参数,包括眼神接触、反应潜伏期、头部与面部转向角度以及持续关注时长。采用基于视频的计算方法,整合人脸检测、眼部区域追踪及时空面部分析,以获取儿童反应的精细测量。通过对比神经典型与神经非典型群体在受控人机交互条件下的表现,本研究旨在理解社会线索的来源与模态如何影响呼叫名字情境中的注意力动态。研究结果既推进了对自闭症社会定向缺陷的理论理解,也促进了机器人辅助评估工具的实践发展。
https://arxiv.org/abs/2603.22759
When visual evidence is ambiguous, vision models must decide whether to interpret face-like patterns as meaningful. Face pareidolia, the perception of faces in non-face objects, provides a controlled probe of this behavior. We introduce a representation-level diagnostic framework that analyzes detection, localization, uncertainty, and bias across class, difficulty, and emotion in face pareidolia images. Under a unified protocol, we evaluate six models spanning four representational regimes: vision-language models (VLMs; CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). Our analysis reveals three mechanisms of interpretation under ambiguity. VLMs exhibit semantic overactivation, systematically pulling ambiguous non-human regions toward the Human concept, with LLaVA-1.5-7B producing the strongest and most confident over-calls, especially for negative emotions. ViT instead follows an uncertainty-as-abstention strategy, remaining diffuse yet largely unbiased. Detection-based models achieve low bias through conservative priors that suppress pareidolia responses even when localization is controlled. These results show that behavior under ambiguity is governed more by representational choices than score thresholds, and that uncertainty and bias are decoupled: low uncertainty can signal either safe suppression, as in detectors, or extreme over-interpretation, as in VLMs. Pareidolia therefore provides a compact diagnostic and a source of ambiguity-aware hard negatives for probing and improving the semantic robustness of vision-language systems. Code will be released upon publication.
当视觉证据模棱两可时,视觉模型必须决定是否将面部样式的图案视为有意义的。面对似真症(在非面部对象中感知到面孔)为这种行为提供了一个可控的研究方法。我们引入了一种表示层面的诊断框架,该框架分析了在人脸似真症图像中的检测、定位、不确定性和偏差,涵盖了类别、难度和情绪等多个方面。在统一协议下,我们评估了六种模型,这些模型跨越了四种表示范式:视觉-语言模型(VLMs;CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B)、纯视觉分类(ViT)、通用对象检测(YOLOv8)和面部检测(RetinaFace)。我们的分析揭示了三种在模棱两可情况下进行解释的机制。VLMs表现出语义过度激活,系统地将模棱两可的人脸区域拉向“人类”概念,其中LLaVA-1.5-7B产生的过度解读最强且最自信,特别是在负面情绪的情况下。相比之下,ViT则遵循了不确定性即放弃策略,在很大程度上保持中立并且偏差较小。基于检测的模型通过保守的先验知识抑制似真反应来实现低偏差,即使在定位控制良好的情况下也是如此。 这些结果表明,模棱两可下的行为更多地受到表示选择的影响而非得分阈值,并且不确定性和偏见是解耦的:低不确定性既可以表示安全的抑制(如检测器),也可以表示极端过度解读(如VLMs)。因此,似真症提供了一个紧凑的诊断工具和一个模棱两可负面样本来源,用于探测并提高视觉-语言系统的语义鲁棒性。论文发布后将公开代码。
https://arxiv.org/abs/2603.03989
Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.
视力障碍显著影响着数百万人的生活,改变了视觉信息的处理和感知方式。在这项工作中,研究人员利用BrokenEyes系统开发了一个计算框架,以模拟五种常见的眼部疾病:年龄相关性黄斑变性、白内障、青光眼、屈光不正以及糖尿病视网膜病变,并分析这些疾病对深度学习模型中神经元特征表示的影响。通过结合使用人类和非人类数据集,在正常条件及特定疾病的条件下训练的模型揭示了特征图的重要破坏,尤其是在白内障和青光眼中,这与这些病症下的已知神经处理挑战相吻合。激活能量和余弦相似度等评估指标量化了这些扭曲的程度,为受损视觉输入与学习表示之间的相互作用提供了见解。
https://arxiv.org/abs/2602.23212
Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
事件相机以微秒级分辨率记录亮度变化,但将这些稀疏、异步的输出转换成神经网络可以利用的密集张量仍然是一个核心挑战。传统的直方图或全局衰减时间表面表示方法在整个图像平面上应用固定的时序参数,在实践中造成了在静止期间保持空间结构与快速运动区域保留锐利边缘之间的权衡。我们引入了局部自适应衰减表面(LADS),这是一个事件表示家族,其中每个位置的时序衰减根据本地信号动态进行调节。三种策略被探索,分别基于事件速率、高斯拉普拉斯响应和高频频谱能量。这些自适应方案在静止区域保留细节的同时减少了密集活动区域的模糊度。公开数据集上的广泛实验表明,LADS与标准非自适应表示相比,在面部检测和面部标志准确性方面均有所提升。在30 Hz下,LADS实现了更高的检测准确性和更低的地标误差,而在240 Hz时则缓解了高频操作通常观察到的精度下降问题,维持了2.44%的标准化平均错误率用于地标识别以及0.966 mAP50用于面部检测。这些高频结果甚至超过了先前在30 Hz运行的工作所报告的准确性,在基于事件的脸部分析中建立了新的基准。此外,通过在表示阶段保留空间结构,LADS支持使用更轻量级的网络架构同时保持实时性能。这些结果突显了神经形态视觉中上下文感知时序整合的重要性,并指向能够利用事件相机独特优势的实时高频人机交互系统的未来方向。
https://arxiv.org/abs/2602.23101
Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.
https://arxiv.org/abs/2601.16449
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。
https://arxiv.org/abs/2601.13715
The increasing adoption of smart classroom technologies in higher education has mainly focused on automating attendance, with limited attention given to students' emotional and cognitive engagement during lectures. This limits instructors' ability to identify disengagement and adapt teaching strategies in real time. This paper presents SCASED (Smart Classroom Attendance System with Emotion Detection), an IoT-based system that integrates automated attendance tracking with facial emotion recognition to support classroom engagement monitoring. The system uses a Raspberry Pi camera and OpenCV for face detection, and a finetuned MobileNetV2 model to classify four learning-related emotional states: engagement, boredom, confusion, and frustration. A session-based mechanism is implemented to manage attendance and emotion monitoring by recording attendance once per session and performing continuous emotion analysis thereafter. Attendance and emotion data are visualized through a cloud-based dashboard to provide instructors with insights into classroom dynamics. Experimental evaluation using the DAiSEE dataset achieved an emotion classification accuracy of 89.5%. The results show that integrating attendance data with emotion analytics can provide instructors with additional insight into classroom dynamics and support more responsive teaching practices.
在高等教育中,智能教室技术的采用主要集中在自动签到上,而对于学生在课堂上的情感和认知参与度的关注较少。这限制了教师实时识别学生脱轨并调整教学策略的能力。本文介绍了一种基于物联网(IoT)的系统——SCASED(带情绪检测的智能教室考勤系统),该系统将自动化签到跟踪与面部表情识别相结合,以支持课堂参与度监测。 SCASED系统使用树莓派摄像头和OpenCV进行人脸检测,并采用经过微调的MobileNetV2模型来分类四种学习相关的情绪状态:投入、无聊、困惑和沮丧。此外,系统还实现了一种基于会话机制的方法来管理签到与情绪监控,即每个会话记录一次签到并随后持续进行情绪分析。 学生签到及情绪数据通过云端仪表板可视化显示,为教师提供有关课堂动态的洞察力。使用DAiSEE数据集进行实验评估后,达到了89.5%的情绪分类准确性。结果表明,将签到数据与情感分析相结合可以为教师提供更多关于课堂动态的见解,并支持更具响应性的教学实践。
https://arxiv.org/abs/2601.08049
Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.
数据稀缺和分布变化给戴口罩人脸检测与识别带来了重大挑战。为此,我们提出了一种两步生成式数据增强框架,该框架结合了基于规则的口罩变形技术与未配对图像到图像转换(使用GAN)的方法,能够生成超越单纯合成变换的真实戴口罩人脸样本。相较于仅采用基于规则的变形方法,所提出的方案在定性上持续改进,并补充现有的如IAMGAN等基于GAN的戴口罩人脸生成方法。我们引入了非面具保持损失和随机噪声注入来稳定训练并增强样本多样性。实验观察结果强调了所提出组件的有效性,并为未来以数据为中心的数据增强技术在人脸识别任务中的改进方向提供了建议。
https://arxiv.org/abs/2512.15774
Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
数据增强对于提升面部检测系统在遮挡、光照变化和复杂环境等挑战条件下的鲁棒性至关重要。传统的复制粘贴数据增强方法由于前景提取不准确、场景几何不一致以及背景语义不匹配等问题,常常生成不符合现实的合成图像。为了解决这些问题,我们提出了Depth Copy Paste(深度复制粘贴),这是一种多模态且感知深度的数据增强框架,通过将完整身体的人体实例复制并粘贴到语义兼容的场景中来生成多样且物理上一致的面部检测训练样本。 我们的方法首先利用BLIP和CLIP联合评估语义和视觉一致性,从而自动检索最适合给定前景人体背景图像。为了保证高质量的前景遮罩以保留面部细节,我们整合了SAM3进行精确分割,并使用Depth-Anything提取仅限于未被遮挡且可见的人体区域,避免在数据增强过程中使用受损的面部纹理。 为确保几何真实感,我们引入了一种基于深度引导的滑动窗口放置机制,在背景深度图上搜索具有最优深度连续性和尺度对齐的粘贴位置。这样生成的合成图像展现出自然的深度关系和改进了的视觉可信度。 通过广泛的实验表明,Depth Copy Paste能够提供更加多样且真实的训练数据,相比传统的复制粘贴方法及不依赖于深度的方法,在下游面部检测任务中带来了显著的性能提升。
https://arxiv.org/abs/2512.11683
Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
确保视频内容的真实性仍然是一项挑战,因为DeepFake生成变得越来越逼真,并且在检测上也越来越具有抵抗力。现有的大多数检测器都隐含地假设面部序列是时间一致且清晰的,而在现实场景中这一假设很少成立:由于压缩失真、遮挡和对抗性攻击的影响,人脸检测经常不稳定并导致无效或误检的人脸。为了解决这些问题,我们提出了一种拉普拉斯正则化图卷积网络(LR-GCN),该方法可以在仅使用干净面部数据进行训练的情况下,从嘈杂的或无序的面部序列中稳健地检测DeepFakes。 我们的方法构建了一个顺序无关的时间图嵌入(OF-TGE),它将每帧的CNN特征组织成一个基于语义相似性的自适应稀疏图。与受严格时间连续性约束的传统方法不同,OF-TGE捕获了跨帧的固有特征一致性,使其能够应对打乱、缺失或严重损坏的数据输入。 我们进一步在图结构和节点特征上引入了一种双层稀疏机制来抑制无效面部的影响。关键的是,我们提出了一种明确的图拉普拉斯谱先验,它在图频域中起高通滤波器的作用,突出显示结构性异常和伪造痕迹,并通过低通GCN聚合进行整合。 这种顺序设计有效地实现了任务驱动的频段带通机制,在抑制背景信息和随机噪声的同时保留操作线索。在FF++、Celeb-DFv2 和 DFDC 数据集上的广泛实验表明,LR-GCN 达到了最先进的性能并显著提高了对严重全局和局部干扰(包括缺失面部、遮挡以及被对抗性扰动的面部检测)的鲁棒性。
https://arxiv.org/abs/2512.07498
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
这篇论文在两个主要领域提出了创新贡献:一是提升生成模型(尤其是归一化流)的效率,二是应用生成模型解决实际计算机视觉挑战。第一部分通过六项关键创新对归一化流动架构进行了重大改进: 1) 开发了具有数学证明的必要和充分条件以确保可逆性的可逆3x3卷积层。 2) 引入了一种更高效的四耦合层(Quad-coupling layer)。 3) 设计了一个快速且有效的并行反演算法,用于kxk卷积层。 4) 为卷积的逆向传播开发了高效算法。 5) 在前向传递中使用卷积的逆进行Invert-Flow,并通过提出的反向传播算法进行训练。 6) Affine-StableSR,一个紧凑高效的超分辨率模型,利用预训练权重和归一化流动层来减少参数数量同时保持性能。 第二部分包括: 1) 一种使用条件生成对抗网络(Conditional GANs)的自动质量评估系统,用于解决农业产品中的类别不平衡、数据稀缺和标注挑战,并在种子纯度测试中实现了良好的准确性。 2) 一个利用堆叠自编码器进行降维的无监督地质制图框架,在特征提取方面优于传统方法。 3) 提出了一种使用面部检测和图像修复来保护自动驾驶数据集隐私的方法。 4) 利用基于Stable Diffusion的图像修复技术,将检测到的脸部和车牌替换为提高隐私保护技术和伦理考虑的方法。 5) 一种经过改进的扩散模型用于艺术作品恢复,能够通过统一微调处理多种类型的退化。
https://arxiv.org/abs/2512.04039
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
https://arxiv.org/abs/2511.19187