Abstract
k-Nearest Neighbors is one of the most fundamental but effective classification models. In this paper, we propose two families of models built on a sequence to sequence model and a memory network model to mimic the k-Nearest Neighbors model, which generate a sequence of labels, a sequence of out-of-sample feature vectors and a final label for classification, and thus they could also function as oversamplers. We also propose 'out-of-core' versions of our models which assume that only a small portion of data can be loaded into memory. Computational experiments show that our models outperform k-Nearest Neighbors, a feed-forward neural network and a memory network, due to the fact that our models must produce additional output and not just the label. As an oversample on imbalanced datasets, the sequence to sequence kNN model often outperforms Synthetic Minority Over-sampling Technique and Adaptive Synthetic Sampling.
Abstract (translated)
k-最近邻居是最基本但有效的分类模型之一。在本文中,我们提出了两个建立在序列模型上的模型族和一个记忆网络模型来模拟k-最近邻居模型,它们生成一系列标签,一系列样本外特征向量和一个最终的分类标签,因此它们也可以用作过采样器。我们还提出了我们模型的“核心外”版本,假设只有一小部分数据可以加载到内存中。计算实验表明,由于我们的模型必须产生额外的输出而不仅仅是标签,所以我们的模型胜过k-最近邻居,一个前馈神经网络和一个内存网络。作为不平衡数据集的过度抽样,序列kNN模型的排序通常优于合成少数过采样技术和自适应合成采样。
URL
https://arxiv.org/abs/1804.11214