Abstract
We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.
Abstract (translated)
我们重新审视了针对各种局部特征训练基于注意力的稀疏图像匹配模型的问题。首先,我们识别出一个此前被忽视的关键设计选择,这一选择对LightGlue模型的性能产生了显著影响。接着,我们在基于变压器的匹配框架中研究检测器和描述符的作用,发现是检测器而非描述符往往是性能差异的主要原因。最后,我们提出了一种新的方法,通过使用多样化的检测器生成的关键点来微调现有的图像匹配模型,从而构建出一个通用且不受特定检测器限制的模型。当作为零样本匹配器用于新型检测器时,该模型能够达到或超越为那些特征专门训练的模型的准确性。我们的发现为变压器基匹配模型的部署及未来局部特征的设计提供了宝贵的见解。
URL
https://arxiv.org/abs/2602.08430