Abstract
Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching -- detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition -- extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID's selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.
Abstract (translated)
现实中的监控系统常常无法在单一的低分辨率(LR)帧中识别出面部和车牌,这阻碍了可靠的个体身份确认。为了推进时间序列识别模型的发展,我们推出了FANVID,这是一个基于视频的新基准测试集,包括近1,463个低分辨率片段(180 x 320像素,每秒20至60帧),涵盖来自三个英语国家的63个人身份和49个车牌。每个视频中还包括干扰面孔和车牌,增加了任务难度和现实感。该数据集包含31,096个手动验证过的边界框及标签。 FANVID定义了两个任务:(1)面部匹配——在低分辨率图像中检测出面部并将其与高分辨率的照片进行匹配;(2)车牌识别——从低分辨率的车牌上提取文本信息,而无需预先设定数据库。视频是从高分辨率源降低采样率得到的,确保单帧中的面孔和文字无法辨认,从而迫使模型利用时间序列的信息。 我们还引入了基于平均精度(IoU > 0.5)的评估指标,侧重于面部身份识别正确性和字符级别的文本准确性。采用预训练视频超分辨率、检测及识别方法作为基线模型,在面部匹配任务中达到了0.58分,在车牌识别任务中则为0.42分,这既展示了这些任务的可行性也揭示了其挑战性。 FANVID在选择面孔和车牌时平衡了多样性与识别难度。我们发布了数据访问、评估、基线及标注软件,以支持研究的可重复性和扩展性。FANVID旨在激发低分辨率识别中的时间序列建模创新,在监控、法医分析以及自动驾驶车辆等领域具有应用潜力。
URL
https://arxiv.org/abs/2506.07304