Paper Reading AI Learner

MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

2025-12-13 12:26:57
Benjamin Beilharz, Thomas S. A. Wallis

Abstract

While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.

Abstract (translated)

尽管深度学习方法在许多视觉基准测试中取得了令人印象深刻的成就,但理解这些模型的表示和决策仍然具有挑战性。虽然视觉模型通常是基于2D输入进行训练的,但人们通常认为它们会形成一个潜在的3D场景表示(例如,能够容忍部分遮挡或推理相对深度的能力)。在此,我们引入了MRD(可微渲染生成器),这是一种使用物理基础的可微分渲染技术来探究视觉模型对生成性3D场景属性隐含理解的方法。通过找到在物理上不同但产生相同模型激活值的3D场景参数(即为模型同形体)。与以前基于像素的方法评估模型表示相比,这种方法重建的结果始终以物理场景描述为基础。这意味着我们可以例如,在保持材料和光照不变的情况下探究模型对物体形状的敏感性。 作为概念验证,我们评估了多个模型在恢复几何形状(形态)和双向反射分布函数(材质)场景参数方面的能力。结果表明,目标场景与优化后的场景之间存在高度相似的模型激活模式,并且这些视觉效果有所不同。从定性的角度来看,这种重建有助于探究模型对物理场景属性敏感或不变的特性。 MRD通过分析物理场景参数如何驱动模型响应的变化,为增进我们对计算机和人类视觉的理解提供了潜力。

URL

https://arxiv.org/abs/2512.12307

PDF

https://arxiv.org/pdf/2512.12307.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot