Paper Reading AI Learner

Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors

2024-04-16 18:15:57
João Luzio, Alexandre Bernardino, Plinio Moreno

Abstract

The aim of this work is to establish how accurately a recent semantic-based foveal active perception model is able to complete visual tasks that are regularly performed by humans, namely, scene exploration and visual search. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. It has been used previously in scene exploration tasks. In this paper, we revisit the model and extend its application to visual search tasks. To illustrate the benefits of using semantic information in scene exploration and visual search tasks, we compare its performance against traditional saliency-based models. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model in accurately representing the semantic information present in the visual scene. In visual search experiments, searching for instances of a target class in a visual field containing multiple distractors shows superior performance compared to the saliency-driven model and a random gaze selection algorithm. Our results demonstrate that semantic information, from the top-down, influences visual exploration and search tasks significantly, suggesting a potential area of research for integrating it with traditional bottom-up cues.

Abstract (translated)

本文旨在探讨一个最近基于语义的信息提取的视野主动感知模型的准确性和其在完成人类通常执行的视觉任务(场景探索和视觉搜索)方面的能力。该模型利用当前物体检测器定位和分类大量物体类别的功能,并在多个注视点上更新场景的语义描述。它在之前用于场景探索任务中已经应用过。在本文中,我们重新审视了该模型,并将其应用于视觉搜索任务。为了说明在场景探索和视觉搜索任务中使用语义信息的优势,我们将其性能与传统基于 saliency 的模型进行比较。在场景探索任务中,基于语义的方法在准确表示视觉场景中的语义信息方面表现出优越性能。在视觉搜索实验中,在包含多个干扰物的视觉区域内搜索目标类别的实例,与基于 saliency 的模型和随机注视选择算法相比,表现出优越性能。我们的结果表明,从上到下,语义信息对视觉探索和搜索任务具有显著影响,这表明了一个可能的研究领域,将语义信息与传统自上而下提示相结合。

URL

https://arxiv.org/abs/2404.10836

PDF

https://arxiv.org/pdf/2404.10836.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot