Abstract
Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at this https URL}
Abstract (translated)
精确近邻搜索是一个计算密集型过程,甚至连其简单的兄弟——向量检索——也可能具有计算复杂性。当检索具有高维$d$与数据库中向量数量$N$相对时,这种情况尤为加剧。精确近邻检索通常被认为是$O(Nd)$问题,没有sub-线性解决方案。相反,注意力转向了向量检索的近似近邻(ANN)技术,其中许多具有sub-linear或甚至对数时间复杂性。然而,从二进制搜索问题(例如,$d=1$向量检索)的直觉告诉我们,应该有一种不需要暴力搜索解决方案的方法来检索向量的组织表示。对于低维度(例如,$d=2$或$d=3$的情况),\texttt{kd-trees}提供了一个$O(d\log N)$的检索算法。然而,在高层维度(例如,$k=128$)上,该算法迅速退化为一$O(dN)$的解决方案,在实际应用中。我们提出了一个新颖的logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN)算法,灵感来自\texttt{kd-trees}。该算法在100\%召回率的情况下,具有$O(d\log N)$的查找速度,对128个均匀随机生成的$d=128$维向量进行检索。附录中的代码可在该https URL上找到。
URL
https://arxiv.org/abs/2405.04435