Abstract
Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on this https URL and a demo on this https URL.
Abstract (translated)
与飞机等由人类设计的系统不同,这些系统的每个组件的作用和依赖关系都十分明确,AI模型内部的工作原理仍然很大程度上不透明,这阻碍了验证过程,并且削弱了信任。本文介绍了一种名为SemanticLens的通用神经网络解释方法,该方法将隐藏在各个组成部分(如单个神经元)中的知识映射到一个语义结构化的多模态空间中,例如CLIP等基础模型的空间内。在这个空间中,可以执行一些独特的操作,包括:(i) 文本搜索以识别编码特定概念的神经元;(ii) 系统性地分析和比较模型表示;(iii) 自动标记神经元并解释其功能角色;以及(iv) 审计以验证决策是否符合要求。SemanticLens完全可扩展且无需人工干预,被证明在调试、验证、总结模型知识、使推理与预期保持一致(例如,在恶性黑色素瘤分类中遵循ABCDE规则)等方面非常有效,并能检测到与虚假关联和其相关训练数据绑定的组件。通过实现对组件级别的理解和验证,该方法有助于弥合AI模型和传统工程系统之间的“信任差距”。我们可以在[此链接](https://example.com/code)提供SemanticLens的代码,在[此链接](https://example.com/demo)提供演示。
URL
https://arxiv.org/abs/2501.05398