Paper Reading AI Learner

Combining Vision and Tactile Sensation for Video Prediction

2023-04-21 18:02:15
Willow Mandil, Amir Ghalamzan-E

Abstract

In this paper, we explore the impact of adding tactile sensation to video prediction models for physical robot interactions. Predicting the impact of robotic actions on the environment is a fundamental challenge in robotics. Current methods leverage visual and robot action data to generate video predictions over a given time period, which can then be used to adjust robot actions. However, humans rely on both visual and tactile feedback to develop and maintain a mental model of their physical surroundings. In this paper, we investigate the impact of integrating tactile feedback into video prediction models for physical robot interactions. We propose three multi-modal integration approaches and compare the performance of these tactile-enhanced video prediction models. Additionally, we introduce two new datasets of robot pushing that use a magnetic-based tactile sensor for unsupervised learning. The first dataset contains visually identical objects with different physical properties, while the second dataset mimics existing robot-pushing datasets of household object clusters. Our results demonstrate that incorporating tactile feedback into video prediction models improves scene prediction accuracy and enhances the agent's perception of physical interactions and understanding of cause-effect relationships during physical robot interactions.

Abstract (translated)

在本文中,我们对将触觉感觉添加到视频预测模型中对于实际机器人交互的影响进行了研究。预测机器人行动对环境的影响是机器人学中的一个基本挑战。目前的方法利用视觉和机器人行动数据生成在特定时间段内的视频预测,然后可用于调整机器人行动。然而,人类依赖视觉和触觉反馈来发展和维持他们的物理周围环境的心理模型。在本文中,我们研究了将触觉反馈整合到视频预测模型中对于实际机器人交互的影响。我们提出了三种多模态融合方法,并比较了这些触觉增强的视频预测模型的性能。此外,我们介绍了两个新的机器人推动数据集,这些数据集使用基于磁性的触觉传感器进行无监督学习。第一个数据集包含具有不同物理性质的视觉相同的物体,而第二个数据集模拟了家庭物品簇现有的机器人推动数据集。我们的结果表明,将触觉反馈添加到视频预测模型中可以提高场景预测的准确性,增强Agent在物理机器人交互中对物理互动和因果关系的理解。

URL

https://arxiv.org/abs/2304.11193

PDF

https://arxiv.org/pdf/2304.11193.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot