Paper Reading AI Learner

Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves

2023-03-02 09:47:28
Sora Takashima, Ryo Hayamizu, Nakamasa Inoue, Hirokatsu Kataoka, Rio Yokota

Abstract

Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is close to the top-1 accuracy (84.2%) achieved by JFT-300M pre-training, while the number of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.

Abstract (translated)

公式驱动的监督学习(FDSL)已被证明是训练视觉转换器的有效方法,其中ExFractalDB-21k比ImageNet-21k的前训练效果更加突出。这些研究还表明,在训练视觉转换器之前,轮廓的重要性比纹理更加重要。然而,缺乏对为什么这些轮廓导向的模拟数据集可以达到与真实数据集相同的精度的系统化研究,使得许多怀疑论者仍然存在。在目前的工作中,我们基于循环谐波提出了一种新的方法,系统性地研究轮廓导向模拟数据集的设计空间。这使我们能够高效地搜索FDSL参数的最佳范围,并最大限度地增加dataset中的合成图像的多样性,我们发现这是一个重要的因素。当结果的新datasetVisualAtom-21k用于前训练ViT-Base时,在ImageNet-1k上进行微调时,top-1准确率达到83.7%。这与JFT-300M前训练时(84.2%)的top-1准确率相当,而图像数量仅为1/14。与静态数据集JFT-300M不同,合成数据集的质量将继续提高,当前工作是对这种可能性的证明。FDSL也摆脱了与真实图像相关的常见问题,例如隐私/版权问题、标签费用/错误和伦理偏见。

URL

https://arxiv.org/abs/2303.01112

PDF

https://arxiv.org/pdf/2303.01112.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot