Date of Award
Campus Access Dissertation
Doctor of Philosophy (PhD)
Dan A. Simovici
Clustering is a central topic in unsupervised learning and has a wide variety of applications. However, the increasing needs of clustering massive datasets and the high cost of running clustering algorithms poses difficult problems for users, while to select the best clustering model with a suitable number of clusters is also a primary focus. In this thesis, we mainly focus on determining whether a data set is clusterable, and what is the natural number of clusters in a dataset.
First, we approach data clusterability from an ultrametric-based perspective. A novel approach to determine the ultrametricity of a dataset is proposed via a special type of matrix product and via this measure, we can evaluate the clusterability of it. Then, we show that our method of matrix product on the distance matrix will finally generate a sub-dominant ultrametric distance space of the original dataset. In addition, if a dataset has a unimodal or poorly constructed structure, its ultrametricity will be lower than other datasets with the same cardinality. We also show that by promoting the clusterability of a dataset, a poor clustering algorithm will perform better on the same dataset.
Secondly, we present a technique grounded in information theory for determining the natural number of clusters existent in a data set. Our approach involves a bi-criterial optimization that makes use of the entropy and the cohesion of a partition. Additionally, the experimental results are validated by using two quite distinct clustering methods: the k-means algorithm and Ward hierarchical clustering and their contour curves. We also show that by modifying the parameter, our approach can handle dataset with heavily imbalanced clustering structure, which is further complicated in practice.
Hua, Kaixun, "Clusterability, Model Selection and Evaluation" (2019). Graduate Doctoral Dissertations. 474.