Paper Reading AI Learner

ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

2024-04-24 21:30:01
Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman

Abstract

An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to intelligently select acoustic data sampling locations. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment in which a mobile agent equipped with visual and acoustic sensors jointly constructs the environment acoustic model and the occupancy map on-the-fly. We introduce ActiveRIR, a reinforcement learning (RL) policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions, yielding a high quality acoustic model of the environment from a minimal set of acoustic samples. We train our policy with a novel RL reward based on information gain in the environment acoustic model. Evaluating on diverse unseen indoor environments from a state-of-the-art acoustic simulation platform, ActiveRIR outperforms an array of methods--both traditional navigation agents based on spatial novelty and visual exploration as well as existing state-of-the-art methods.

Abstract (translated)

环境声学模型表示了室内环境声音如何通过物理特性进行转换,对于任何给定的源/接收器位置。传统构建声学模型的方法包括在空间密集的位置收集大量声学数据,或依赖于场景几何知识来智能选择声学数据采样位置。我们提出了一种主动声学采样,这是一种新任务,用于高效构建未映射环境中的环境声学模型和占用图,其中移动代理器配备视觉和听觉传感器共同构建环境声学模型和占用图。我们引入了ActiveRIR,一种基于音频-视觉传感器流的信息增强强化学习(RL)策略,用于引导代理器导航并确定最优的声学数据采样位置,从而从最小数量的声学样本中产生高质量的环境声学模型。我们用基于环境声学模型信息增益的新颖RL奖励来训练我们的策略。在从最先进的声学仿真平台上的多样未见室内环境中评估,ActiveRIR表现优于基于空间新颖性和视觉探索的传统导航方法和现有最先进的方法。

URL

https://arxiv.org/abs/2404.16216

PDF

https://arxiv.org/pdf/2404.16216.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot