Abstract
Cluster analysis plays a crucial role in database mining, and one of the most widely used algorithms in this field is DBSCAN. However, DBSCAN has several limitations, such as difficulty in handling high-dimensional large-scale data, sensitivity to input parameters, and lack of robustness in producing clustering results. This paper introduces an improved version of DBSCAN that leverages the block-diagonal property of the similarity graph to guide the clustering procedure of DBSCAN. The key idea is to construct a graph that measures the similarity between high-dimensional large-scale data points and has the potential to be transformed into a block-diagonal form through an unknown permutation, followed by a cluster-ordering procedure to generate the desired permutation. The clustering structure can be easily determined by identifying the diagonal blocks in the permuted graph. We propose a gradient descent-based method to solve the proposed problem. Additionally, we develop a DBSCAN-based points traversal algorithm that identifies clusters with high densities in the graph and generates an augmented ordering of clusters. The block-diagonal structure of the graph is then achieved through permutation based on the traversal order, providing a flexible foundation for both automatic and interactive cluster analysis. We introduce a split-and-refine algorithm to automatically search for all diagonal blocks in the permuted graph with theoretically optimal guarantees under specific cases. We extensively evaluate our proposed approach on twelve challenging real-world benchmark clustering datasets and demonstrate its superior performance compared to the state-of-the-art clustering method on every dataset.
Abstract (translated)
聚类分析在数据库挖掘中扮演着关键角色,该领域中最常用的算法是 DBSCAN。然而,DBSCAN 具有多个限制,例如处理高维大型数据集的困难性、对输入参数的敏感性以及产生聚类结果缺乏鲁棒性。本文介绍了一种改进的 DBSCAN 版本,它利用相似图的块状性质指导 DBSCAN 的聚类过程。关键思想是构建一个衡量高维大型数据点之间相似性的图形,具有将经过未知排列后可能变为块状的潜在可能性,然后通过聚类排序程序生成所需的排列。通过确定 permuted 图形中的块状行,可以轻松确定聚类结构。我们提出了一种基于梯度下降的方法来解决所提出的问题。此外,我们还开发了一种基于 DBSCAN 的点遍历算法,它可以在图中识别高密度聚类,并生成聚类的增强顺序。通过基于遍历顺序的排列,为自动和交互式聚类分析提供了灵活的基础。我们对特定情况下的理论最优保证下的所有 permuted 图形中的块状行进行自动搜索,并从每个数据集的大量挑战实例中进行全面评估。我们证明了所提出方法在所有数据集上的优越性能,与最先进的聚类方法相比。
URL
https://arxiv.org/abs/2404.01341