Abstract
We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation? We evaluate different vision-language models with multiple datasets across a set of concepts and find (i) all models evaluated show distinct performance differences based on the perceived gender of the person co-occurring with a given concept in the image and that aggregating analyses over all concepts can mask these concerns; (ii) model calibration (i.e. the relationship between accuracy and confidence) also differs distinctly by perceived gender, even when evaluating on similar representations of concepts; and (iii) these observed disparities align with existing gender biases in word embeddings from language models. These findings suggest that, while language greatly expands the capability of vision tasks, it can also contribute to social biases in zero-shot vision settings. Furthermore, biases can further propagate when foundational models like CLIP are used by other models to enable zero-shot capabilities.
Abstract (translated)
我们探索了零次预测视觉语言模型在不同视觉任务中表现出性别偏见的程度。视觉模型传统上需要特定的标签来代表概念,并进行微调;像CLIP这样的零次预测模型则使用开放式词汇表,意味着它们不需要固定的标签,通过使用文本嵌入来表示概念。带着这些能力考虑我们的问题是:在进行零次预测图像分类、物体检测和语义分割时,视觉语言模型是否表现出性别偏见?我们使用多个数据集对一组概念进行评估,并发现(一)所有评估模型都表现出根据图像中特定概念 perceived gender 的不同表现差异,而且汇总分析所有概念可以掩盖这些担忧;(二)模型校准(即准确性和信心之间的关系)也按 perceived gender 不同而显著不同,即使在评估相似概念的情况下也是如此;(三)这些观察到的差异与语言模型中的现有性别偏见对齐。这些发现表明,尽管语言大大扩展了视觉任务的能力,但它也可以在零次预测视觉设置中的社会偏见中发挥作用。此外,当像CLIP这样的基础模型被其他模型使用以启用零次预测能力时,偏见还会进一步传播。
URL
https://arxiv.org/abs/2301.11100