Abstract
Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: \href{this https URL}{this http URL}
Abstract (translated)
实时内窥镜自运动跟踪对于高效导航和内窥镜手术的机器人自动化是一个重要任务。本文提出了一种新的框架,旨在实现内窥镜的实时自运动跟踪。首先,提出了一个多模态视觉特征学习网络来进行相对姿态预测,在该网络中,从光流、场景特征以及两个相邻观察点的联合特征都被提取出来用于预测。 由于在连接图像的通道维度上有更多的相关性信息,因此设计了一种基于注意力机制的新颖特征提取器,以整合来自连续两帧串联后的多维信息。为了从融合特性中提取更完整的特征表示,提出了一种新颖的姿态解码器来预测框架末尾的连接特征图所对应的姿态转换。 最后,根据相对姿态计算出内窥镜的绝对位置。实验在三个不同内窥镜场景的数据集上进行,并且结果表明该方法优于现有的前沿技术。此外,所提方法的推理速度超过每秒30帧,满足了实时需求。项目页面在此:\[此链接\](请将方括号中的内容替换为实际提供的URL)。
URL
https://arxiv.org/abs/2501.18124