Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties

Abstract
Abstract (translated)
URL
PDF

Abstract

General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: this https URL

Abstract (translated)

总体物理场景理解需要更多的不仅仅是定位和识别对象,而是要认识到对象可能具有不同的潜在属性(例如质量或弹性),而这些属性会影响物理事件的结果。尽管在物理和视频预测模型方面已经取得了很大进展,但测试其性能的标准通常不需要理解对象具有 individual 物理属性,或者仅测试那些直接观察到的属性(例如尺寸或颜色)。本研究提出了一个全新的数据集和基准,称为Physion++,旨在 rigorous 评估人工系统中的视觉物理预测性能,在这些预测性能依赖于对场景对象潜在属性准确估计的情况下进行测试。具体来说,我们测试了那些依赖于对属性如质量、摩擦、弹性和可变形性等进行估计的场景,并且只能通过观察对象的运动和与其他对象或液体的互动来推断这些属性的值。我们评估了多个最先进的预测模型的性能,涵盖了学习与内置知识的不同水平,并将这些性能与人类预测进行比较。我们发现,使用标准训练方法和数据集训练的模型不会自发地学习关于潜在属性的推断,但编码物体性和物理状态的倾向往往导致更好的预测性能。然而,所有模型与人类表现之间存在巨大的差距,并且所有模型的预测与人类预测之间相关性不佳,这表明没有一个最先进的模型正在学习以人类方式进行物理预测。项目页面: this https URL

URL

https://arxiv.org/abs/2306.15668

PDF

https://arxiv.org/pdf/2306.15668.pdf

Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties

Abstract

Abstract (translated)

URL

PDF Copy

PDF