Paper Reading AI Learner

Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities

2023-01-26 13:44:31
Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, Candace Ross

Abstract

We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation? We evaluate different vision-language models with multiple datasets across a set of concepts and find (i) all models evaluated show distinct performance differences based on the perceived gender of the person co-occurring with a given concept in the image and that aggregating analyses over all concepts can mask these concerns; (ii) model calibration (i.e. the relationship between accuracy and confidence) also differs distinctly by perceived gender, even when evaluating on similar representations of concepts; and (iii) these observed disparities align with existing gender biases in word embeddings from language models. These findings suggest that, while language greatly expands the capability of vision tasks, it can also contribute to social biases in zero-shot vision settings. Furthermore, biases can further propagate when foundational models like CLIP are used by other models to enable zero-shot capabilities.

Abstract (translated)

我们探索了零次预测视觉语言模型在不同视觉任务中表现出性别偏见的程度。视觉模型传统上需要特定的标签来代表概念,并进行微调;像CLIP这样的零次预测模型则使用开放式词汇表,意味着它们不需要固定的标签,通过使用文本嵌入来表示概念。带着这些能力考虑我们的问题是:在进行零次预测图像分类、物体检测和语义分割时,视觉语言模型是否表现出性别偏见?我们使用多个数据集对一组概念进行评估,并发现(一)所有评估模型都表现出根据图像中特定概念 perceived gender 的不同表现差异,而且汇总分析所有概念可以掩盖这些担忧;(二)模型校准(即准确性和信心之间的关系)也按 perceived gender 不同而显著不同,即使在评估相似概念的情况下也是如此;(三)这些观察到的差异与语言模型中的现有性别偏见对齐。这些发现表明,尽管语言大大扩展了视觉任务的能力,但它也可以在零次预测视觉设置中的社会偏见中发挥作用。此外,当像CLIP这样的基础模型被其他模型使用以启用零次预测能力时,偏见还会进一步传播。

URL

https://arxiv.org/abs/2301.11100

PDF

https://arxiv.org/pdf/2301.11100.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot