Unified Dynamic Scanpath Predictors Outperform Individually Trained Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal guides it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.

Abstract (translated)

之前关于扫描路径预测的研究主要集中在群体模型上，而忽视了个体差异的存在。忽视这些差异对社交人机交互的影响尤为严重，因为机器人通常根据启发式或预定义的模式模仿人类的注视。然而，人类注视模式具有异质性，并且变化的行为可能显著影响此类人机交互的结果。为了填补这一空白，我们开发了一种基于深度学习的社交线索整合模型来进行显著性预测，而不是预测视频中的扫描路径。我们的模型通过通过门控机制和顺序注意来递归整合注意历史和社交线索来学习扫描路径。我们在动态社会场景的凝视数据集上评估了我们的方法。引入注意历史到我们的模型使得训练单个统一模型成为可能，而不是为每个扫描路径训练资源密集的模型。我们观察到，在训练模型的大型数据集上，晚期的神经整合方法超越了早期的融合方法，而在类似分布的小数据集上，情况则相反。结果还表明，对于所有观察者的扫描路径进行联合训练的单统一模型，其表现与单独训练的模型相当或者更好。我们假设，这种结果是因为群体突出表现引起了模型对普遍注意的关注，而监督信号则指导其学习个性化的注意行为，使得统一模型相对于单独模型具有优势，因为它隐含地表示了普遍注意。

URL

https://arxiv.org/abs/2405.02929

PDF

https://arxiv.org/pdf/2405.02929.pdf

Unified Dynamic Scanpath Predictors Outperform Individually Trained Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF