Abstract
We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method, dubbed XFeat (Accelerated Features), revisits fundamental design choices in convolutional neural networks for detecting, extracting, and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular, accurate image matching requires sufficiently large image resolutions - for this reason, we keep the resolution as large as possible while limiting the number of channels in the network. Besides, our model is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at this http URL.
Abstract (translated)
我们提出了一个轻量级且准确的资源高效的视觉对应架构。我们的方法被称为XFeat(加速特征),它重新审视了卷积神经网络中用于检测、提取和匹配局部特征的基本设计选择。我们的新模型满足对于资源受限设备快速且鲁棒算法的关键需求。特别是,准确的图像匹配需要足够大的图像分辨率 - 因此,我们在网络中限制通道数量,同时尽可能地保持分辨率。此外,我们的模型还设计为在稀疏或半稀疏级别提供匹配选择,每种选择都可能更适合不同的下游应用,例如视觉导航和增强现实。我们的模型是第一个提供半稀疏匹配的,它依赖于新颖的匹配平滑模块。XFeat具有多才性和硬件无关性,在速度(最高可达5倍)和精度上超过了当前基于深度学习的局部特征,已经在姿态估计和视觉局部定位中得到证明。我们展示它在一台廉价的笔记本电脑CPU上的实时运行,没有专门的硬件优化。代码和权重可以从该网站的URL获取。
URL
https://arxiv.org/abs/2404.19174