Paper Reading AI Learner

FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos

2025-06-08 22:22:00
Kavitha Viswanathan, Vrinda Goel, Shlesh Gholap, Devayan Ghosh, Madhav Gupta, Dhruvi Ganatra, Sanket Potdar, Amit Sethi

Abstract

Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching -- detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition -- extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID's selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.

Abstract (translated)

现实中的监控系统常常无法在单一的低分辨率(LR)帧中识别出面部和车牌,这阻碍了可靠的个体身份确认。为了推进时间序列识别模型的发展,我们推出了FANVID,这是一个基于视频的新基准测试集,包括近1,463个低分辨率片段(180 x 320像素,每秒20至60帧),涵盖来自三个英语国家的63个人身份和49个车牌。每个视频中还包括干扰面孔和车牌,增加了任务难度和现实感。该数据集包含31,096个手动验证过的边界框及标签。 FANVID定义了两个任务:(1)面部匹配——在低分辨率图像中检测出面部并将其与高分辨率的照片进行匹配;(2)车牌识别——从低分辨率的车牌上提取文本信息,而无需预先设定数据库。视频是从高分辨率源降低采样率得到的,确保单帧中的面孔和文字无法辨认,从而迫使模型利用时间序列的信息。 我们还引入了基于平均精度(IoU > 0.5)的评估指标,侧重于面部身份识别正确性和字符级别的文本准确性。采用预训练视频超分辨率、检测及识别方法作为基线模型,在面部匹配任务中达到了0.58分,在车牌识别任务中则为0.42分,这既展示了这些任务的可行性也揭示了其挑战性。 FANVID在选择面孔和车牌时平衡了多样性与识别难度。我们发布了数据访问、评估、基线及标注软件,以支持研究的可重复性和扩展性。FANVID旨在激发低分辨率识别中的时间序列建模创新,在监控、法医分析以及自动驾驶车辆等领域具有应用潜力。

URL

https://arxiv.org/abs/2506.07304

PDF

https://arxiv.org/pdf/2506.07304.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot