Localization vs. Semantics: How Can Language Benefit Visual Representation Learning?

2022-12-01 05:00:18

Zhuowan Li (1), Cihang Xie (2), Benjamin Van Durme (1), Alan Yuille (1) ((1) Johns Hopkins University, (2) University of California, Santa Cruz)

arXiv_CV

Abstract
Abstract (translated)
URL
PDF

Abstract

Despite the superior performance brought by vision-and-language pretraining, it remains unclear whether learning with multi-modal data can help understand each individual modality. In this work, we investigate how language can help with visual representation learning from a probing perspective. Specifically, we compare vision-and-language and vision-only models by probing their visual representations on a broad range of tasks, in order to assess the quality of the learned representations in a fine-grained manner. Interestingly, our probing results suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. With further analysis using detailed metrics, our study suggests that language helps vision models learn better semantics, but not localization. Code is released at this https URL.

Abstract (translated)

URL

https://arxiv.org/abs/2212.00281

PDF

https://arxiv.org/pdf/2212.00281.pdf