Abstract
Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is close to the top-1 accuracy (84.2%) achieved by JFT-300M pre-training, while the number of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.
Abstract (translated)
公式驱动的监督学习(FDSL)已被证明是训练视觉转换器的有效方法,其中ExFractalDB-21k比ImageNet-21k的前训练效果更加突出。这些研究还表明,在训练视觉转换器之前,轮廓的重要性比纹理更加重要。然而,缺乏对为什么这些轮廓导向的模拟数据集可以达到与真实数据集相同的精度的系统化研究,使得许多怀疑论者仍然存在。在目前的工作中,我们基于循环谐波提出了一种新的方法,系统性地研究轮廓导向模拟数据集的设计空间。这使我们能够高效地搜索FDSL参数的最佳范围,并最大限度地增加dataset中的合成图像的多样性,我们发现这是一个重要的因素。当结果的新datasetVisualAtom-21k用于前训练ViT-Base时,在ImageNet-1k上进行微调时,top-1准确率达到83.7%。这与JFT-300M前训练时(84.2%)的top-1准确率相当,而图像数量仅为1/14。与静态数据集JFT-300M不同,合成数据集的质量将继续提高,当前工作是对这种可能性的证明。FDSL也摆脱了与真实图像相关的常见问题,例如隐私/版权问题、标签费用/错误和伦理偏见。
URL
https://arxiv.org/abs/2303.01112