Paper Reading AI Learner

Unified Dynamic Scanpath Predictors Outperform Individually Trained Models

2024-05-05 13:15:11
Fares Abawi, Di Fu, Stefan Wermter

Abstract

Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal guides it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.

Abstract (translated)

之前关于扫描路径预测的研究主要集中在群体模型上,而忽视了个体差异的存在。忽视这些差异对社交人机交互的影响尤为严重,因为机器人通常根据启发式或预定义的模式模仿人类的注视。然而,人类注视模式具有异质性,并且变化的行为可能显著影响此类人机交互的结果。为了填补这一空白,我们开发了一种基于深度学习的社交线索整合模型来进行显著性预测,而不是预测视频中的扫描路径。我们的模型通过通过门控机制和顺序注意来递归整合注意历史和社交线索来学习扫描路径。我们在动态社会场景的凝视数据集上评估了我们的方法。引入注意历史到我们的模型使得训练单个统一模型成为可能,而不是为每个扫描路径训练资源密集的模型。我们观察到,在训练模型的大型数据集上,晚期的神经整合方法超越了早期的融合方法,而在类似分布的小数据集上,情况则相反。结果还表明,对于所有观察者的扫描路径进行联合训练的单统一模型,其表现与单独训练的模型相当或者更好。我们假设,这种结果是因为群体突出表现引起了模型对普遍注意的关注,而监督信号则指导其学习个性化的注意行为,使得统一模型相对于单独模型具有优势,因为它隐含地表示了普遍注意。

URL

https://arxiv.org/abs/2405.02929

PDF

https://arxiv.org/pdf/2405.02929.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot