When visual evidence is ambiguous, vision models must decide whether to interpret face-like patterns as meaningful. Face pareidolia, the perception of faces in non-face objects, provides a controlled probe of this behavior. We introduce a representation-level diagnostic framework that analyzes detection, localization, uncertainty, and bias across class, difficulty, and emotion in face pareidolia images. Under a unified protocol, we evaluate six models spanning four representational regimes: vision-language models (VLMs; CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). Our analysis reveals three mechanisms of interpretation under ambiguity. VLMs exhibit semantic overactivation, systematically pulling ambiguous non-human regions toward the Human concept, with LLaVA-1.5-7B producing the strongest and most confident over-calls, especially for negative emotions. ViT instead follows an uncertainty-as-abstention strategy, remaining diffuse yet largely unbiased. Detection-based models achieve low bias through conservative priors that suppress pareidolia responses even when localization is controlled. These results show that behavior under ambiguity is governed more by representational choices than score thresholds, and that uncertainty and bias are decoupled: low uncertainty can signal either safe suppression, as in detectors, or extreme over-interpretation, as in VLMs. Pareidolia therefore provides a compact diagnostic and a source of ambiguity-aware hard negatives for probing and improving the semantic robustness of vision-language systems. Code will be released upon publication.
当视觉证据模棱两可时,视觉模型必须决定是否将面部样式的图案视为有意义的。面对似真症(在非面部对象中感知到面孔)为这种行为提供了一个可控的研究方法。我们引入了一种表示层面的诊断框架,该框架分析了在人脸似真症图像中的检测、定位、不确定性和偏差,涵盖了类别、难度和情绪等多个方面。在统一协议下,我们评估了六种模型,这些模型跨越了四种表示范式:视觉-语言模型(VLMs;CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B)、纯视觉分类(ViT)、通用对象检测(YOLOv8)和面部检测(RetinaFace)。我们的分析揭示了三种在模棱两可情况下进行解释的机制。VLMs表现出语义过度激活,系统地将模棱两可的人脸区域拉向“人类”概念,其中LLaVA-1.5-7B产生的过度解读最强且最自信,特别是在负面情绪的情况下。相比之下,ViT则遵循了不确定性即放弃策略,在很大程度上保持中立并且偏差较小。基于检测的模型通过保守的先验知识抑制似真反应来实现低偏差,即使在定位控制良好的情况下也是如此。 这些结果表明,模棱两可下的行为更多地受到表示选择的影响而非得分阈值,并且不确定性和偏见是解耦的:低不确定性既可以表示安全的抑制(如检测器),也可以表示极端过度解读(如VLMs)。因此,似真症提供了一个紧凑的诊断工具和一个模棱两可负面样本来源,用于探测并提高视觉-语言系统的语义鲁棒性。论文发布后将公开代码。
https://arxiv.org/abs/2603.03989
Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.
视力障碍显著影响着数百万人的生活,改变了视觉信息的处理和感知方式。在这项工作中,研究人员利用BrokenEyes系统开发了一个计算框架,以模拟五种常见的眼部疾病:年龄相关性黄斑变性、白内障、青光眼、屈光不正以及糖尿病视网膜病变,并分析这些疾病对深度学习模型中神经元特征表示的影响。通过结合使用人类和非人类数据集,在正常条件及特定疾病的条件下训练的模型揭示了特征图的重要破坏,尤其是在白内障和青光眼中,这与这些病症下的已知神经处理挑战相吻合。激活能量和余弦相似度等评估指标量化了这些扭曲的程度,为受损视觉输入与学习表示之间的相互作用提供了见解。
https://arxiv.org/abs/2602.23212
Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
事件相机以微秒级分辨率记录亮度变化,但将这些稀疏、异步的输出转换成神经网络可以利用的密集张量仍然是一个核心挑战。传统的直方图或全局衰减时间表面表示方法在整个图像平面上应用固定的时序参数,在实践中造成了在静止期间保持空间结构与快速运动区域保留锐利边缘之间的权衡。我们引入了局部自适应衰减表面(LADS),这是一个事件表示家族,其中每个位置的时序衰减根据本地信号动态进行调节。三种策略被探索,分别基于事件速率、高斯拉普拉斯响应和高频频谱能量。这些自适应方案在静止区域保留细节的同时减少了密集活动区域的模糊度。公开数据集上的广泛实验表明,LADS与标准非自适应表示相比,在面部检测和面部标志准确性方面均有所提升。在30 Hz下,LADS实现了更高的检测准确性和更低的地标误差,而在240 Hz时则缓解了高频操作通常观察到的精度下降问题,维持了2.44%的标准化平均错误率用于地标识别以及0.966 mAP50用于面部检测。这些高频结果甚至超过了先前在30 Hz运行的工作所报告的准确性,在基于事件的脸部分析中建立了新的基准。此外,通过在表示阶段保留空间结构,LADS支持使用更轻量级的网络架构同时保持实时性能。这些结果突显了神经形态视觉中上下文感知时序整合的重要性,并指向能够利用事件相机独特优势的实时高频人机交互系统的未来方向。
https://arxiv.org/abs/2602.23101
Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.
https://arxiv.org/abs/2601.16449
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。
https://arxiv.org/abs/2601.13715
The increasing adoption of smart classroom technologies in higher education has mainly focused on automating attendance, with limited attention given to students' emotional and cognitive engagement during lectures. This limits instructors' ability to identify disengagement and adapt teaching strategies in real time. This paper presents SCASED (Smart Classroom Attendance System with Emotion Detection), an IoT-based system that integrates automated attendance tracking with facial emotion recognition to support classroom engagement monitoring. The system uses a Raspberry Pi camera and OpenCV for face detection, and a finetuned MobileNetV2 model to classify four learning-related emotional states: engagement, boredom, confusion, and frustration. A session-based mechanism is implemented to manage attendance and emotion monitoring by recording attendance once per session and performing continuous emotion analysis thereafter. Attendance and emotion data are visualized through a cloud-based dashboard to provide instructors with insights into classroom dynamics. Experimental evaluation using the DAiSEE dataset achieved an emotion classification accuracy of 89.5%. The results show that integrating attendance data with emotion analytics can provide instructors with additional insight into classroom dynamics and support more responsive teaching practices.
在高等教育中,智能教室技术的采用主要集中在自动签到上,而对于学生在课堂上的情感和认知参与度的关注较少。这限制了教师实时识别学生脱轨并调整教学策略的能力。本文介绍了一种基于物联网(IoT)的系统——SCASED(带情绪检测的智能教室考勤系统),该系统将自动化签到跟踪与面部表情识别相结合,以支持课堂参与度监测。 SCASED系统使用树莓派摄像头和OpenCV进行人脸检测,并采用经过微调的MobileNetV2模型来分类四种学习相关的情绪状态:投入、无聊、困惑和沮丧。此外,系统还实现了一种基于会话机制的方法来管理签到与情绪监控,即每个会话记录一次签到并随后持续进行情绪分析。 学生签到及情绪数据通过云端仪表板可视化显示,为教师提供有关课堂动态的洞察力。使用DAiSEE数据集进行实验评估后,达到了89.5%的情绪分类准确性。结果表明,将签到数据与情感分析相结合可以为教师提供更多关于课堂动态的见解,并支持更具响应性的教学实践。
https://arxiv.org/abs/2601.08049
Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.
数据稀缺和分布变化给戴口罩人脸检测与识别带来了重大挑战。为此,我们提出了一种两步生成式数据增强框架,该框架结合了基于规则的口罩变形技术与未配对图像到图像转换(使用GAN)的方法,能够生成超越单纯合成变换的真实戴口罩人脸样本。相较于仅采用基于规则的变形方法,所提出的方案在定性上持续改进,并补充现有的如IAMGAN等基于GAN的戴口罩人脸生成方法。我们引入了非面具保持损失和随机噪声注入来稳定训练并增强样本多样性。实验观察结果强调了所提出组件的有效性,并为未来以数据为中心的数据增强技术在人脸识别任务中的改进方向提供了建议。
https://arxiv.org/abs/2512.15774
Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
数据增强对于提升面部检测系统在遮挡、光照变化和复杂环境等挑战条件下的鲁棒性至关重要。传统的复制粘贴数据增强方法由于前景提取不准确、场景几何不一致以及背景语义不匹配等问题,常常生成不符合现实的合成图像。为了解决这些问题,我们提出了Depth Copy Paste(深度复制粘贴),这是一种多模态且感知深度的数据增强框架,通过将完整身体的人体实例复制并粘贴到语义兼容的场景中来生成多样且物理上一致的面部检测训练样本。 我们的方法首先利用BLIP和CLIP联合评估语义和视觉一致性,从而自动检索最适合给定前景人体背景图像。为了保证高质量的前景遮罩以保留面部细节,我们整合了SAM3进行精确分割,并使用Depth-Anything提取仅限于未被遮挡且可见的人体区域,避免在数据增强过程中使用受损的面部纹理。 为确保几何真实感,我们引入了一种基于深度引导的滑动窗口放置机制,在背景深度图上搜索具有最优深度连续性和尺度对齐的粘贴位置。这样生成的合成图像展现出自然的深度关系和改进了的视觉可信度。 通过广泛的实验表明,Depth Copy Paste能够提供更加多样且真实的训练数据,相比传统的复制粘贴方法及不依赖于深度的方法,在下游面部检测任务中带来了显著的性能提升。
https://arxiv.org/abs/2512.11683
Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
确保视频内容的真实性仍然是一项挑战,因为DeepFake生成变得越来越逼真,并且在检测上也越来越具有抵抗力。现有的大多数检测器都隐含地假设面部序列是时间一致且清晰的,而在现实场景中这一假设很少成立:由于压缩失真、遮挡和对抗性攻击的影响,人脸检测经常不稳定并导致无效或误检的人脸。为了解决这些问题,我们提出了一种拉普拉斯正则化图卷积网络(LR-GCN),该方法可以在仅使用干净面部数据进行训练的情况下,从嘈杂的或无序的面部序列中稳健地检测DeepFakes。 我们的方法构建了一个顺序无关的时间图嵌入(OF-TGE),它将每帧的CNN特征组织成一个基于语义相似性的自适应稀疏图。与受严格时间连续性约束的传统方法不同,OF-TGE捕获了跨帧的固有特征一致性,使其能够应对打乱、缺失或严重损坏的数据输入。 我们进一步在图结构和节点特征上引入了一种双层稀疏机制来抑制无效面部的影响。关键的是,我们提出了一种明确的图拉普拉斯谱先验,它在图频域中起高通滤波器的作用,突出显示结构性异常和伪造痕迹,并通过低通GCN聚合进行整合。 这种顺序设计有效地实现了任务驱动的频段带通机制,在抑制背景信息和随机噪声的同时保留操作线索。在FF++、Celeb-DFv2 和 DFDC 数据集上的广泛实验表明,LR-GCN 达到了最先进的性能并显著提高了对严重全局和局部干扰(包括缺失面部、遮挡以及被对抗性扰动的面部检测)的鲁棒性。
https://arxiv.org/abs/2512.07498
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
这篇论文在两个主要领域提出了创新贡献:一是提升生成模型(尤其是归一化流)的效率,二是应用生成模型解决实际计算机视觉挑战。第一部分通过六项关键创新对归一化流动架构进行了重大改进: 1) 开发了具有数学证明的必要和充分条件以确保可逆性的可逆3x3卷积层。 2) 引入了一种更高效的四耦合层(Quad-coupling layer)。 3) 设计了一个快速且有效的并行反演算法,用于kxk卷积层。 4) 为卷积的逆向传播开发了高效算法。 5) 在前向传递中使用卷积的逆进行Invert-Flow,并通过提出的反向传播算法进行训练。 6) Affine-StableSR,一个紧凑高效的超分辨率模型,利用预训练权重和归一化流动层来减少参数数量同时保持性能。 第二部分包括: 1) 一种使用条件生成对抗网络(Conditional GANs)的自动质量评估系统,用于解决农业产品中的类别不平衡、数据稀缺和标注挑战,并在种子纯度测试中实现了良好的准确性。 2) 一个利用堆叠自编码器进行降维的无监督地质制图框架,在特征提取方面优于传统方法。 3) 提出了一种使用面部检测和图像修复来保护自动驾驶数据集隐私的方法。 4) 利用基于Stable Diffusion的图像修复技术,将检测到的脸部和车牌替换为提高隐私保护技术和伦理考虑的方法。 5) 一种经过改进的扩散模型用于艺术作品恢复,能够通过统一微调处理多种类型的退化。
https://arxiv.org/abs/2512.04039
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
https://arxiv.org/abs/2511.19187
Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.
https://arxiv.org/abs/2511.16887
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
https://arxiv.org/abs/2511.14969
Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
https://arxiv.org/abs/2511.02180
Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.
面部识别(FR)模型已被证明容易受到微小改变的良性面部图像攻击,这些攻击揭示了系统的漏洞,并保护了用户隐私。端到端的面部识别系统在计算深度特征嵌入相似性之前,会从各种面部图像中获取预处理过的脸部。虽然面部预处理是面部识别系统的一个关键组成部分,因此也是对抗攻击的一部分,但我们观察到,在黑盒设置下,这一过程往往被忽视了。我们的研究旨在探讨几种现成的最先进的对抗攻击在应用于不同黑盒设置中的预处理技术时的效果转移性。 我们发现,选择不同的面部检测模型可以降低78%的攻击成功率,而降采样期间使用的插值方法的影响相对较小。此外,我们还发现对预处理的需求甚至会削弱白盒环境下的攻击力度,这是由于生成的噪声向量与面部检测模型之间的意外相互作用所致。 基于这些发现,我们提出了一种使用输入变换的方法来提高所研究攻击的效果转移性,这种方法不受预处理过程的影响,并将效果提高了多达27%。我们的研究表明,在面部识别系统中,预处理的重要性不容忽视,并且为了改善面部对抗样本的泛化能力,需要考虑这一因素。
https://arxiv.org/abs/2510.17169
Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.
极端模态检测(Extreme Amodal Detection)的任务是推断出输入图像中未完全可见但可以在扩展视野范围内看到的对象的二维位置。这与一般的模态检测不同,后者是指对象在输入图像中部分可见但被遮挡的情况。本文着重探讨了面部检测这一子问题,因为此类应用涉及安全和隐私领域的重要场景,但我们并未将方法专门针对这个类别进行设计。现有的方法依赖于图序列,以便从周围的帧中插值缺失的检测或利用生成模型来采样可能的完成情况。相比之下,我们考虑单一图像的任务,并提出了一种更高效的无需样本的方法,该方法利用来自图像的情境线索来推断未见人脸的存在。 为了应对这个任务,我们设计了一个基于热图(heatmap)的极端模态对象检测器,它解决了从少量信息高效预测大量数据的问题(即预测超出画面范围的部分),通过选择性地采用粗到细的解码器。我们的方法在这一新任务上取得了显著成果,并且优于其他更不高效的生成式方法。
https://arxiv.org/abs/2510.06791
In recent years, parametric representations of point clouds have been widely applied in tasks such as memory-efficient mapping and multi-robot collaboration. Highly adaptive models, like spline surfaces or quadrics, are computationally expensive in detection or fitting. In contrast, real-time methods, such as Gaussian mixture models or planes, have low degrees of freedom, making high accuracy with few primitives difficult. To tackle this problem, a multi-model parametric representation with real-time surface detection and fitting is proposed. Specifically, the Gaussian mixture model is first employed to segment the point cloud into multiple clusters. Then, flat clusters are selected and merged into planes or curved surfaces. Planes can be easily fitted and delimited by a 2D voxel-based boundary description method. Surfaces with curvature are fitted by B-spline surfaces and the same boundary description method is employed. Through evaluations on multiple public datasets, the proposed surface detection exhibits greater robustness than the state-of-the-art approach, with 3.78 times improvement in efficiency. Meanwhile, this representation achieves a 2-fold gain in accuracy over Gaussian mixture models, operating at 36.4 fps on a low-power onboard computer.
近年来,点云的参数表示在诸如记忆高效地图构建和多机器人协作等任务中得到了广泛应用。高度自适应模型(如样条曲面或二次曲面)在检测或拟合过程中计算成本较高。相比之下,实时方法(例如高斯混合模型或平面模型),虽然自由度较低,但有助于提高效率,不过用少数几个基本元素实现高精度较为困难。为解决这一问题,提出了一种结合了多种模型的参数表示法,实现了实时表面检测和拟合。 具体来说,首先采用高斯混合模型将点云分割成多个簇。然后选择平面簇并将其合并成平面或曲面。平面可以通过基于二维体素边界描述的方法轻松拟合和限定。对于带有曲率的表面,则使用B样条曲面进行拟合,并且应用相同边界的描述方法。 通过在多个公开数据集上的评估,所提出的表面检测方法比现有最佳方法显示出更强的鲁棒性,在效率方面提高了3.78倍。同时,该表示法在准确性上较高斯混合模型提升了两倍,在低功耗车载计算机上的运行速度达到每秒36.4帧。
https://arxiv.org/abs/2509.14773
Fake news detection is an important and challenging task for defending online information integrity. Existing state-of-the-art approaches typically extract news semantic clues, such as writing patterns that include emotional words, stylistic features, etc. However, detectors tuned solely to such semantic clues can easily fall into surface detection patterns, which can shift rapidly in dynamic environments, leading to limited performance in the evolving news landscape. To address this issue, this paper investigates a novel perspective by incorporating news intent into fake news detection, bridging intents and semantics together. The core insight is that by considering news intents, one can deeply understand the inherent thoughts behind news deception, rather than the surface patterns within words alone. To achieve this goal, we propose Graph-based Intent-Semantic Joint Modeling (InSide) for fake news detection, which models deception clues from both semantic and intent signals via graph-based joint learning. Specifically, InSide reformulates news semantic and intent signals into heterogeneous graph structures, enabling long-range context interaction through entity guidance and capturing both holistic and implementation-level intent via coarse-to-fine intent modeling. To achieve better alignment between semantics and intents, we further develop a dynamic pathway-based graph alignment strategy for effective message passing and aggregation across these signals by establishing a common space. Extensive experiments on four benchmark datasets demonstrate the superiority of the proposed InSide compared to state-of-the-art methods.
虚假新闻检测是一项重要且具有挑战性的任务,旨在保护在线信息的完整性。现有的最先进的方法通常会提取诸如包含情感词汇的写作模式和风格特征等新闻语义线索。然而,仅针对此类语义线索调优的检测器容易陷入表面检测模式,在动态环境中这些模式变化迅速,从而在不断演变的新闻景观中表现有限。为了解决这个问题,本文从一个新颖的角度出发,通过将新闻意图纳入虚假新闻检测来探讨这一问题,旨在连接意图和语义之间的联系。核心见解在于:考虑新闻意图可以深入了解其背后的内在思维,而不仅仅是表面的文字模式。 为了实现这一目标,我们提出了一种基于图的意图-语义联合建模方法(InSide),用于假新闻检测。该模型通过基于图的联合学习方式从语义和意图信号中提取欺骗线索。具体而言,InSide将新闻的语义和意图信号重新表述为异构图结构,这使长距离上下文交互成为可能,并且能够通过粗到细级别的意图建模捕捉全面及实施层面的意图。 为了更好地实现语义与意图之间的对齐,我们进一步开发了一种基于动态路径图对齐策略的方法,以有效进行跨信号的消息传递和聚合,在此过程中建立了一个公共空间。在四个基准数据集上的广泛实验证明了所提出的InSide方法优于现有最先进的方法。
https://arxiv.org/abs/2509.01660
Sensitive Information Exposure (SIEx) vulnerabilities (CWE-200) remain a persistent and under-addressed threat across software systems, often leading to serious security breaches. Existing detection tools rarely target the diverse subcategories of CWE-200 or provide context-aware analysis of code-level data flows. Aims: This paper aims to present SIExVulTS, a novel vulnerability detection system that integrates transformer-based models with static analysis to identify and verify sensitive information exposure in Java applications. Method: SIExVulTS employs a three-stage architecture: (1) an Attack Surface Detection Engine that uses sentence embeddings to identify sensitive variables, strings, comments, and sinks; (2) an Exposure Analysis Engine that instantiates CodeQL queries aligned with the CWE-200 hierarchy; and (3) a Flow Verification Engine that leverages GraphCodeBERT to semantically validate source-to-sink flows. We evaluate SIExVulTS using three curated datasets, including real-world CVEs, a benchmark set of synthetic CWE-200 examples, and labeled flows from 31 open-source projects. Results: The Attack Surface Detection Engine achieved an average F1 score greater than 93\%, the Exposure Analysis Engine achieved an F1 score of 85.71\%, and the Flow Verification Engine increased precision from 22.61\% to 87.23\%. Moreover, SIExVulTS successfully uncovered six previously unknown CVEs in major Apache projects. Conclusions: The results demonstrate that SIExVulTS is effective and practical for improving software security against sensitive data exposure, addressing limitations of existing tools in detecting and verifying CWE-200 vulnerabilities.
敏感信息泄露(SIEx)漏洞(CWE-200)仍然是软件系统中持续存在的、未充分解决的安全威胁,经常导致严重的安全漏洞。现有的检测工具很少针对CWE-200的各种子类别进行目标定位或提供代码级数据流的上下文感知分析。本论文旨在介绍一种名为SIExVulTS的新颖漏洞检测系统,该系统整合了基于Transformer的模型与静态分析技术,以识别和验证Java应用程序中的敏感信息泄露。 **方法:** SIExVulTS采用三阶段架构: 1. **攻击面检测引擎**:使用句子嵌入来识别敏感变量、字符串、注释及接收点。 2. **暴露分析引擎**:实例化与CWE-200层级相匹配的CodeQL查询。 3. **流验证引擎**:利用GraphCodeBERT进行源到接收点数据流的语义验证。 我们使用三个经过整理的数据集来评估SIExVulTS,包括现实世界的CVE、一组合成的CWE-200示例以及来自31个开源项目标记的数据流。 **结果:** 攻击面检测引擎实现了平均F1分数超过93%,暴露分析引擎达到了85.71%的F1分数,而流验证引擎将精度从22.61%提升到了87.23%。此外,SIExVulTS在主要Apache项目中成功发现了六个此前未知的CVE。 **结论:** 结果表明,SIExVulTS对于改进软件安全以防止敏感数据泄露非常有效且实用,解决了现有工具在检测和验证CWE-200漏洞方面的局限性。
https://arxiv.org/abs/2508.19472