Paper Reading AI Learner

Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties

2023-06-27 17:59:33
Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Joshua B. Tenenbaum, Daniel LK Yamins, Judith E Fan, Kevin A. Smith

Abstract

General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: this https URL

Abstract (translated)

总体物理场景理解需要更多的不仅仅是定位和识别对象,而是要认识到对象可能具有不同的潜在属性(例如质量或弹性),而这些属性会影响物理事件的结果。尽管在物理和视频预测模型方面已经取得了很大进展,但测试其性能的标准通常不需要理解对象具有 individual 物理属性,或者仅测试那些直接观察到的属性(例如尺寸或颜色)。本研究提出了一个全新的数据集和基准,称为Physion++,旨在 rigorous 评估人工系统中的视觉物理预测性能,在这些预测性能依赖于对场景对象潜在属性准确估计的情况下进行测试。具体来说,我们测试了那些依赖于对属性如质量、摩擦、弹性和可变形性等进行估计的场景,并且只能通过观察对象的运动和与其他对象或液体的互动来推断这些属性的值。我们评估了多个最先进的预测模型的性能,涵盖了学习与内置知识的不同水平,并将这些性能与人类预测进行比较。我们发现,使用标准训练方法和数据集训练的模型不会自发地学习关于潜在属性的推断,但编码物体性和物理状态的倾向往往导致更好的预测性能。然而,所有模型与人类表现之间存在巨大的差距,并且所有模型的预测与人类预测之间相关性不佳,这表明没有一个最先进的模型正在学习以人类方式进行物理预测。项目页面: this https URL

URL

https://arxiv.org/abs/2306.15668

PDF

https://arxiv.org/pdf/2306.15668.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot