Abstract
Hash codes are efficient data representations for coping with the ever growing amounts of data. In this paper, we introduce a random forest semantic hashing scheme that embeds tiny convolutional neural networks (CNN) into shallow random forests, with near-optimal information-theoretic code aggregation among trees. We start with a simple hashing scheme, where random trees in a forest act as hashing functions by setting `1' for the visited tree leaf, and `0' for the rest. We show that traditional random forests fail to generate hashes that preserve the underlying similarity between the trees, rendering the random forests approach to hashing challenging. To address this, we propose to first randomly group arriving classes at each tree split node into two groups, obtaining a significantly simplified two-class classification problem, which can be handled using a light-weight CNN weak learner. Such random class grouping scheme enables code uniqueness by enforcing each class to share its code with different classes in different trees. A non-conventional low-rank loss is further adopted for the CNN weak learners to encourage code consistency by minimizing intra-class variations and maximizing inter-class distance for the two random class groups. Finally, we introduce an information-theoretic approach for aggregating codes of individual trees into a single hash code, producing a near-optimal unique hash for each class. The proposed approach significantly outperforms state-of-the-art hashing methods for image retrieval tasks on large-scale public datasets, while performing at the level of other state-of-the-art image classification techniques while utilizing a more compact and efficient scalable representation. This work proposes a principled and robust procedure to train and deploy in parallel an ensemble of light-weight CNNs, instead of simply going deeper.
Abstract (translated)
哈希码是用于处理不断增长的数据量的有效数据表示。在本文中,我们介绍了一种随机森林语义哈希方案,它将微小卷积神经网络(CNN)嵌入到浅层随机森林中,在树木之间具有接近最优的信息理论代码聚合。我们从一个简单的哈希方案开始,其中森林中的随机树通过为访问树叶设置“1”而其余为“0”来充当哈希函数。我们表明,传统的随机森林无法产生哈希值,这些哈希值保留了树木之间的潜在相似性,使随机森林方法对哈希进行了挑战。为了解决这个问题,我们建议首先将每个树分裂节点的到达类随机分组为两组,从而获得一个明显简化的两类分类问题,可以使用轻量级CNN弱学习器来处理。这种随机类分组方案通过强制每个类与不同树中的不同类共享其代码来实现代码唯一性。 CNN弱学习者进一步采用非传统的低等级损失,以通过最小化类内变化和最大化两个随机类组的类间距离来鼓励代码一致性。最后,我们介绍了一种信息理论方法,用于将单个树的代码聚合为单个哈希码,从而为每个类生成近乎最优的唯一哈希值。所提出的方法明显优于大规模公共数据集上的图像检索任务的最先进的散列方法,同时在其他最先进的图像分类技术的水平上执行,同时利用更紧凑和有效的可扩展性表示。这项工作提出了一个原则性和稳健的程序,以并行训练和部署轻量级CNN集合,而不是简单地更深入。
URL
https://arxiv.org/abs/1711.08364