Paper Reading AI Learner

360VOTS: Visual Object Tracking and Segmentation in Omnidirectional Videos

2024-04-22 07:54:53
Yinzhe Xu, Huajian Huang, Yingshu Chen, Sai-Kit Yeung

Abstract

Visual object tracking and segmentation in omnidirectional videos are challenging due to the wide field-of-view and large spherical distortion brought by 360° images. To alleviate these problems, we introduce a novel representation, extended bounding field-of-view (eBFoV), for target localization and use it as the foundation of a general 360 tracking framework which is applicable for both omnidirectional visual object tracking and segmentation tasks. Building upon our previous work on omnidirectional visual object tracking (360VOT), we propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS). The 360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories. To support both the development and evaluation of algorithms in this domain, we divide the dataset into a training subset with 170 sequences and a testing subset with 120 sequences. Furthermore, we tailor evaluation metrics for both omnidirectional tracking and segmentation to ensure rigorous assessment. Through extensive experiments, we benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset. Homepage: this https URL

Abstract (translated)

视图对象跟踪和分割在全景视频中具有挑战性,因为360°图像带来的 wide field-of-view 和 large spherical distortion。为了减轻这些问题,我们引入了一种新的表示,扩展边界场视野(eBFoV),用于目标定位,并将其作为通用360跟踪框架的基础,适用于全景视觉对象跟踪和分割任务。在我们的之前工作基础上(360VOT),我们提出了一个全面的 datasets 和 benchmark,其中包含了一个新的组件,称为全景视频对象分割(360VOS)。360VOS 数据集包括 290 个序列,并伴有密集的像素级掩码,涵盖了更广泛的目标类别。为了支持在这个领域的算法的发展和评估,我们将数据集划分为训练集和测试集,其中训练集包含170个序列,测试集包含120个序列。此外,我们还为全景跟踪和分割定义了严格的评估指标,以确保严谨的评估。通过广泛的实验,我们基准了最先进的 approaches,并证明了所提出的360跟踪框架和训练数据集的有效性。主页:https:// this URL

URL

https://arxiv.org/abs/2404.13953

PDF

https://arxiv.org/pdf/2404.13953.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot