Abstract
Face recognition applications have grown in parallel with the size of datasets, complexity of deep learning models and computational power. However, while deep learning models evolve to become more capable and computational power keeps increasing, the datasets available are being retracted and removed from public access. Privacy and ethical concerns are relevant topics within these domains. Through generative artificial intelligence, researchers have put efforts into the development of completely synthetic datasets that can be used to train face recognition systems. Nonetheless, the recent advances have not been sufficient to achieve performance comparable to the state-of-the-art models trained on real data. To study the drift between the performance of models trained on real and synthetic datasets, we leverage a massive attribute classifier (MAC) to create annotations for four datasets: two real and two synthetic. From these annotations, we conduct studies on the distribution of each attribute within all four datasets. Additionally, we further inspect the differences between real and synthetic datasets on the attribute set. When comparing through the Kullback-Leibler divergence we have found differences between real and synthetic samples. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
Abstract (translated)
面部识别应用程序与数据集的大小、深度学习模型的复杂性和计算能力成正比增长。然而,尽管深度学习模型不断进化变得更具弹性和计算能力在增加,但可用的数据集正在减少和移除。隐私和伦理问题在这些领域内具有相关性。通过生成人工智能,研究人员在开发完全 synthetic 数据集以供训练面部识别系统方面付出了努力。然而,最近的研究成果尚不能达到与基于真实数据的先进模型的性能相当的水平。为了研究在真实和合成数据上训练模型的性能漂移,我们利用大规模属性分类器(MAC)为四个数据集:两个真实和两个合成创建注释:从这些注释,我们研究了每个属性的所有四个数据集中的分布。此外,我们进一步研究了真实和合成数据在属性集上的差异。通过Kullback-Leibler散度比较,我们发现了真实和合成样本之间的差异。有趣的是,我们已经证实,尽管真实样本足以解释合成分布,但相反的说法并不完全正确。
URL
https://arxiv.org/abs/2404.15234