Abstract
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
Abstract (translated)
将图像在处理前resize到固定分辨率的行为已经成为普遍选择,但视觉转换器(ViT)等模型提供了灵活的序列建模,因此输入序列长度会有所不同。我们使用NaViT(原生分辨率ViT)来利用这一点,它在训练期间使用序列打包来处理任意分辨率和比例的输入。同时,我们展示了在大规模监督和比较的图像文本预训练中改进的训练效率。NaViT可以高效地转移到标准任务,如图像和视频分类、物体检测和语义分割,并取得了更加稳健和公平基准的改进结果。在推理时,输入分辨率的灵活性可以用来平滑地 navigate测试时的成本性能权衡。我们认为,NaViT标志着与大多数计算机视觉模型使用的标准的CNN设计输入和建模管道的 departure,并代表了ViTs 的一个有前途的方向。
URL
https://arxiv.org/abs/2307.06304