Paper Reading AI Learner

REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning

2025-01-30 03:58:41
Liangjing Shao, Benshuang Chen, Shuting Zhao, Xinrong Chen

Abstract

Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: \href{this https URL}{this http URL}

Abstract (translated)

实时内窥镜自运动跟踪对于高效导航和内窥镜手术的机器人自动化是一个重要任务。本文提出了一种新的框架,旨在实现内窥镜的实时自运动跟踪。首先,提出了一个多模态视觉特征学习网络来进行相对姿态预测,在该网络中,从光流、场景特征以及两个相邻观察点的联合特征都被提取出来用于预测。 由于在连接图像的通道维度上有更多的相关性信息,因此设计了一种基于注意力机制的新颖特征提取器,以整合来自连续两帧串联后的多维信息。为了从融合特性中提取更完整的特征表示,提出了一种新颖的姿态解码器来预测框架末尾的连接特征图所对应的姿态转换。 最后,根据相对姿态计算出内窥镜的绝对位置。实验在三个不同内窥镜场景的数据集上进行,并且结果表明该方法优于现有的前沿技术。此外,所提方法的推理速度超过每秒30帧,满足了实时需求。项目页面在此:\[此链接\](请将方括号中的内容替换为实际提供的URL)。

URL

https://arxiv.org/abs/2501.18124

PDF

https://arxiv.org/pdf/2501.18124.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot