The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.
https://arxiv.org/abs/2604.15038
Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.
https://arxiv.org/abs/2604.14734
Recent advances in deep learning underscore the need for systems that can not only acquire new knowledge through Continual Learning (CL) but also remove outdated, sensitive, or private information through Machine Unlearning (MU). However, while CL methods are well-developed, MU techniques remain in early stages, creating a critical gap for unified frameworks that depend on both capabilities. We find that naively combining existing CL and MU approaches results in knowledge leakage a gradual degradation of foundational knowledge across repeated adaptation cycles. To address this, we formalize Continual Learning Unlearning (CLU) as a unified paradigm with three key goals: (i) precise deletion of unwanted knowledge, (ii) efficient integration of new knowledge while preserving prior information, and (iii) minimizing knowledge leakage across cycles. We propose Bi-Directional Low-Rank Adaptation (BID-LoRA), a novel framework featuring three dedicated adapter pathways-retain, new, and unlearn applied to attention layers, combined with escape unlearning that pushes forget-class embeddings to positions maximally distant from retained knowledge, updating only 5% of parameters. Experiments on CIFAR-100 show that BID-LoRA outperforms CLU baselines across multiple adaptation cycles. We further evaluate on CASIA-Face100, a curated face recognition subset, demonstrating practical applicability to real-world identity management systems where new users must be enrolled and withdrawn users removed.
https://arxiv.org/abs/2604.12686
Face recognition embeddings encode identity, but they also encode other factors such as gender and ethnicity. Depending on how these factors are used by a downstream system, separating them from the information needed for verification is important for both privacy and fairness. We propose Variational Latent Entropy Estimation Disentanglement (VLEED), a post-hoc method that transforms pretrained embeddings with a variational autoencoder and encourages a distilled representation where the categorical variable of interest is separated from identity-relevant information. VLEED uses a mutual information-based objective realised through the estimation of the entropy of the categorical attribute in the latent space, and provides stable training with fine-grained control over information removal. We evaluate our method on IJB-C, RFW, and VGGFace2 for gender and ethnicity disentanglement, and compare it to various state-of-the-art methods. We report verification utility, predictability of the disentangled variable under linear and nonlinear classifiers, and group disparity metrics based on false match rates. Our results show that VLEED offers a wide range of privacy-utility tradeoffs over existing methods and can also reduce recognition bias across demographic groups.
https://arxiv.org/abs/2604.11250
Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at this https URL.
https://arxiv.org/abs/2604.09127
Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN's effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.
https://arxiv.org/abs/2604.09018
Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models--both domain-specific and foundation models--encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.
https://arxiv.org/abs/2604.07282
Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.
https://arxiv.org/abs/2604.06782
Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.
https://arxiv.org/abs/2604.00903
Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.
多模态大语言模型(MLLMs)近期被探索用于面部验证系统,以判断两张人脸图像是否属于同一人。与专用面部识别系统不同,MLLMs通过视觉提示处理此任务,并依赖通用的视觉与推理能力。然而,这些模型的人口统计公平性尚未得到充分研究。本文提出一项基准研究,在IJB-C和RFW面部验证协议上,评估了六个模型家族中的九个开源MLLMs(参数规模从2B至8B),覆盖四个种族群体和两个性别群体。我们采用等误率(EER)和多个操作点下的真匹配率(TMR)按人口统计组别测量验证准确率,并使用四个基于误识率(FMR)的公平性指标量化人口统计差异。研究结果表明:唯一的面部专用模型FaceLLM-8B在两个基准测试中均显著优于通用MLLMs。观察到的偏差模式与传统面部识别常见报告不同——不同群体受影响程度因基准测试和模型而异。同时发现,最准确的模型未必最公平,而整体准确率较低的模型可能因在所有群体中产生均匀的高错误率而显得"公平"。
https://arxiv.org/abs/2603.25613
While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users' privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.
尽管人脸识别算法的快速发展催生了众多有益应用,但其大规模部署也引发了关于大规模监控风险和个人隐私威胁的重大担忧。本文提出\textit{对抗性伪装}作为保护用户隐私的新方案。该方法设计高效且便于物理世界用户复现。算法首先定义由颜色、形状和角度参数化的低维模式空间,优化后的模式会被投影至语义有效的人脸区域进行评估。我们的方法能在多种架构上最大化识别错误率,确保即使面对黑盒系统也具有高跨模型迁移性。在模拟测试中,该方法显著降低了所有 tested 最先进人脸识别模型的性能,并在真实人体实验中展现出前景性结果,同时揭示了不同模型鲁棒性的差异及跨架构攻击迁移性的证据。
https://arxiv.org/abs/2603.21867
Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at this https URL.
https://arxiv.org/abs/2603.16629
Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.
https://arxiv.org/abs/2603.15062
Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.
https://arxiv.org/abs/2603.12915
Generative AI systems increasingly expose powerful reasoning and image refinement capabilities through user-facing chatbot interfaces. In this work, we show that the naïve exposure of such capabilities fundamentally undermines modern deepfake detectors. Rather than proposing a new image manipulation technique, we study a realistic and already-deployed usage scenario in which an adversary uses only benign, policy-compliant prompts and commercial generative AI systems. We demonstrate that state-of-the-art deepfake detection methods fail under semantic-preserving image refinement. Specifically, we show that generative AI systems articulate explicit authenticity criteria and inadvertently externalize them through unrestricted reasoning, enabling their direct reuse as refinement objectives. As a result, refined images simultaneously evade detection, preserve identity as verified by commercial face recognition APIs, and exhibit substantially higher perceptual quality. Importantly, we find that widely accessible commercial chatbot services pose a significantly greater security risk than open-source models, as their superior realism, semantic controllability, and low-barrier interfaces enable effective evasion by non-expert users. Our findings reveal a structural mismatch between the threat models assumed by current detection frameworks and the actual capabilities of real-world generative AI. While detection baselines are largely shaped by prior benchmarks, deployed systems expose unrestricted authenticity reasoning and refinement despite stringent safety controls in other domains.
https://arxiv.org/abs/2603.10504
Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.
局部差异隐私(LDP)是保障数据源隐私的隐私保护机器学习中的黄金标准信任模型。然而,由于像素空间的高度维度性,长期以来认为将其应用于图像数据是不切实际的。传统的LDP机制针对低维数据设计,在应用于高维像素空间时会导致严重的效用损失。本文展示了这种效用损失并非源自LDP本身,而是源于其对不合适的数据表示的应用。我们引入了LDP-Slicing,这是一种轻量级、无需训练的框架,用于解决这一领域不匹配的问题。我们的关键洞察是将像素值分解为一系列二进制位平面。此转换允许我们将LDP机制直接应用于位级别的表示。为了进一步增强隐私并保持效用,我们整合了一个感知混淆模块以减轻人类可察觉的信息泄露,并采用了一种基于优化的隐私预算分配策略。该流程满足严格的像素级$\varepsilon$-LDP标准,同时生成具有高下游任务效用的图像。在人脸识别和图像分类方面的广泛实验表明,在相似的隐私预算下,LDP-Slicing优于现有的差分隐私(DP)/LDP基准方法,并且计算开销可以忽略不计。
https://arxiv.org/abs/2603.03711
Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
面部识别仍然容易受到呈现攻击的影响,因此需要强大的防欺骗解决方案(Face Anti-Spoofing, FAS)。最近基于多模态大型语言模型 (MLLM) 的 FAS 方法将二元分类任务重新表述为生成简短的文本描述,以提高跨域泛化能力。然而,它们的泛化能力仍然有限,因为这些描述主要捕捉直观的语义线索(例如,面具轮廓),而在感知细微视觉模式方面却显得力不从心。为了克服这一限制,我们整合了外部视觉工具到 MLLM 中,鼓励更深入地调查微妙的欺骗线索。 具体而言,我们提出了带有视觉工具增强推理的 FAS (TAR-FAS) 框架,该框架将 FAS 任务重新表述为“带有视觉工具的链式思考”(CoT-VT)模式。这使 MLLM 可以从直观观察开始,并根据需要灵活调用外部视觉工具进行细微级别的调查。为此,我们设计了一种增强工具的数据注释流水线并构建了 ToolFAS-16K 数据集,该数据集中包含了多轮次的工具使用推理轨迹。 此外,我们还引入了一个考虑工具的 FAS 训练流水线,在这个过程中,多样化的工具组相对策略优化(DT-GRPO)使模型能够自主学习高效的工具使用方法。在具有挑战性的十一种跨域协议下进行的大量实验表明,TAR-FAS 达到了最先进的性能,并提供了细粒度的视觉调查,从而为值得信赖的欺骗检测提供支持。
https://arxiv.org/abs/2603.01038
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
随着数字娱乐的爆炸性增长,自动视频摘要在内容索引、个性化推荐和高效媒体存档等方面变得不可或缺。对于长篇视频(如电影和电视剧)的自动情节生成,现有视觉-语言模型(VLMs)面临着重大挑战。虽然这些通用模型在单张图片描述方面表现出色,但在长时间背景下往往出现关键性失败,主要表现为人物识别一致性不足以及叙事连贯性的断裂。 为了克服这些限制,我们提出了MovieTeller框架,它通过工具增强的渐进式抽象来生成电影简介。我们的核心贡献是一个无需训练、基于事实且利用外部工具辅助的过程。与需要昂贵模型微调的方法不同,我们的框架可以直接以即插即用的方式使用现成的模型。 具体来说,我们首先引入一个专门的人脸识别模型作为外部“工具”,用于建立事实基础——精确的人物身份及其对应的边界框。这些基础信息随后被注入到提示中,引导VLM进行推理,确保生成的场景描述与可验证的事实相一致。 此外,我们的渐进式抽象流水线将一部完整电影的摘要工作分解为多阶段过程,有效地缓解了现有VLM在处理长文本时的上下文长度限制问题。实验表明,相较于端到端的基础方法,我们的方法显著提高了事实准确性、人物一致性以及整体叙事连贯性。
https://arxiv.org/abs/2602.23228
Identical twin face verification represents an extreme fine-grained recognition challenge where even state-of-the-art systems fail due to overwhelming genetic similarity. Current face recognition methods achieve over 99.8% accuracy on standard benchmarks but drop dramatically to 88.9% when distinguishing identical twins, exposing critical vulnerabilities in biometric security systems. The difficulty lies in learning features that capture subtle, non-genetic variations that uniquely identify individuals. We propose the Asymmetric Hierarchical Attention Network (AHAN), a novel architecture specifically designed for this challenge through multi-granularity facial analysis. AHAN introduces a Hierarchical Cross-Attention (HCA) module that performs multi-scale analysis on semantic facial regions, enabling specialized processing at optimal resolutions. We further propose a Facial Asymmetry Attention Module (FAAM) that learns unique biometric signatures by computing cross-attention between left and right facial halves, capturing subtle asymmetric patterns that differ even between twins. To ensure the network learns truly individuating features, we introduce Twin-Aware Pair-Wise Cross-Attention (TA-PWCA), a training-only regularization strategy that uses each subject's own twin as the hardest possible distractor. Extensive experiments on the ND_TWIN dataset demonstrate that AHAN achieves 92.3% twin verification accuracy, representing a 3.4% improvement over state-of-the-art methods.
同卵双胞胎面部验证代表了一个极端的细粒度识别挑战,即使最先进的系统也因为基因相似度过高而无法准确区分。目前的人脸识别方法在标准基准测试中能实现超过99.8%的准确性,但在分辨同卵双胞胎时这一数字会急剧下降至88.9%,这揭示了生物特征安全系统的重大漏洞所在。问题在于如何学习能够捕捉细微非遗传变异特征的能力,这些变异有助于唯一识别个体。 我们提出了一种新颖的架构——不对称分层注意网络(AHAN),特别针对此挑战而设计,通过多粒度面部分析实现功能。AHAN引入了一个层次交叉注意力(HCA)模块,该模块在语义面部区域执行多层次分析,并能在最优分辨率下进行专业化处理。此外,我们还提出了一种面部不对称注意模块(FAAM),它通过计算左右半边脸部之间的跨注意力来学习独特的生物特征签名,捕捉细微的非对称模式,即使对于双胞胎来说也是不同的。 为了确保网络能学习真正具有区分性的特征,我们引入了同卵意识配对交叉注意训练方法(TA-PWCA),这是一种仅用于训练阶段的正则化策略,通过将每个主体与其自己的同卵双胞胎作为最难干扰项来实现。在ND_TWIN数据集上的广泛实验表明,AHAN实现了92.3%的同卵双胞胎验证准确性,相比现有最佳方法提高了3.4%。
https://arxiv.org/abs/2602.21503
Search and rescue (SAR) operations require rapid responses to save lives or property. Unmanned Aerial Vehicles (UAVs) equipped with vision-based systems support these missions through prior terrain investigation or real-time assistance during the mission itself. Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. UAVs with deep learning-based vision systems offer a new approach to the planning and execution of SAR operations. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning-based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, root mean square error (RMSE) and standard deviations of distance estimation up to 15,3\% in three tested scenarios.
搜索和救援(SAR)行动要求快速响应以拯救生命或财产。装备有基于视觉系统的无人飞行器(UAVs)支持通过预先的地形调查或在任务本身中实时协助来执行这些任务。基于视觉的无人机框架通过检测并识别特定个体,然后跟踪和跟随他们,同时保持安全距离,从而帮助人类搜索任务。无人机跟随的关键安全性要求是在真实条件下准确估计相机与目标对象之间的距离,这是通过融合多种图像模态实现的。装备有基于深度学习的视觉系统的无人机为SAR操作的规划和执行提供了一种新方法。 本文中,在自动人员检测和面部识别的系统框架内,我们提出了一种结合深度摄像头测量数据和单目摄像头到人体的距离估算的方法,以实现稳健跟踪和跟随。通过YOLO-pose实现了基于深度学习的深度摄像头数据过滤以及从单目摄像头估计相机与人体之间的距离,从而能够使用扩展卡尔曼滤波器(EKF)算法实时融合深度信息。 我们设计的子系统旨在用于无人机,并且可以估算并测量深度摄像机与人类关键点之间和人体之间的距离,以维持无人机与人类目标之间的安全距离。我们的系统提供了准确的距离估计,在与运动捕捉地面真实数据对比后得到了验证。在室内环境下进行实测时,该系统可以在三种测试场景中将平均误差、均方根误差(RMSE)和距离估算的标准差减少高达15.3%。 这段文字详细描述了一种利用深度学习技术来改进无人机搜索与救援任务的创新方法,并且展示了其在实际应用中的有效性。
https://arxiv.org/abs/2602.20958