Paper Reading AI Learner

Interpreting the Second-Order Effects of Neurons in CLIP

2024-06-06 17:59:52
Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

Abstract

We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce "semantic" adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.

Abstract (translated)

我们通过自动描述CLIP中单个神经元的功能来解释其功能。分析直接影响(即神经元通过残差流到输出)或间接影响(总贡献)无法捕捉到CLIP中神经元的功能。因此,我们提出了“二级镜头”,通过分析神经元通过后注意力的流动直接到输出的情况。我们发现这些影响非常选择性:对于每个神经元,影响在 <2% 的图像上显著。此外,每个影响都可以在CLIP的文本图像空间中近似为一个方向。我们通过分解这些方向为稀疏的文本表示来描述神经元。这些集揭示了多义词行为 - 每个神经元对应多个,通常不相关的概念(例如船只和汽车)。利用这种神经元多义词,我们通过生成与错误分类相关概念 spuriously correlation的图像来大规模生产“语义”对抗样本。此外,我们还使用第二级效果进行图像零散分割和特征提取。我们的结果表明,对于神经元的可扩展理解可用于模型欺骗和引入新模型功能。

URL

https://arxiv.org/abs/2406.04341

PDF

https://arxiv.org/pdf/2406.04341.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot