Abstract
Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better.
Abstract (translated)
大型语言模型(LLMs)处理的物理常识信息不足。由于训练在一个无实体环境中,LLMs往往无法在给定环境中预测行动的结果。然而,在计划中,预测行动效果在执行前很重要,因为通常在实现目标时需要连贯的行动序列。因此,我们介绍了仅从实际感官输入(图像和文本)中预测行动结果的跨媒体任务。接下来,我们将扩展LLM模型,建模对象的潜在表示,以更好地预测环境中的行动结果。我们表明,当结合视觉信息时,跨媒体模型可以捕捉物理常识。最后,我们对新行动和对象的性能进行评估,并发现结合多种感官信息可以帮助模型更广泛地学习物理常识推理。
URL
https://arxiv.org/abs/2301.11845