============================= Divise Clustering with HiPart ============================= :Author: Sotiris Tasoulis .. contents:: :depth: 4 .. Divisive Clustering and Visualization ===================================== Herein is an example of the recent HiPart library for divisive clustering (Anagnostou et al. (2022)). In this tutorial, we will perform divisive clustering through one of the methods provided by the library, and we will examine it’s static visualizations options. There is an interactive plot tool provided by the package as well, but we will not cover it here. Initially we load the required libraries for this example, mainly the sklearn datasets and matplotlib for ploting. .. container:: cell .. code:: python # install HiPart using pip # %pip install HiPart import numpy as np import pandas as pd import matplotlib.pyplot as plt from HiPart.clustering import dePDDP from HiPart import visualizations as hvs from sklearn.datasets import make_blobs from sklearn.datasets import load_iris from sklearn.datasets import load_linnerud The iris example ================ For consistency (and historical reasons :) ) we start with an example on clustering the iris dataset. The algorithm of choice here is dePDDP because it provides efficient density clustering and its parameterization unveils interesting aspects (S. K. Tasoulis, Tasoulis, and Plagianakos (2010)). The actual number of clusters is given (3), along with a “bandwidth” parameter that affects the sensitivity of the kernel density estimation, similarly to the bin_size parameter we usually provide to histograms. The basic attribute of this algorithm is that it clusters the data based on such one dimensional density estimations. Since, the clustering is **divisive** the algorithm generate binary splits thus, in contrast to other clustering methods, it is straight forward to define parameters such as the smallest size of cluster allowed, and how uneven should the resulting subclusters be. These are controlled by the “min_sample_split” and “percentile” parameters respectively. .. container:: cell .. code:: python data = load_iris() Y = data.target X = data.data clustered_class = dePDDP(max_clusters_number=3, bandwidth_scale=0.5, percentile=0.2, min_sample_split=5) clu_res = clustered_class.fit(X) # clu_res.output_matrix m = hvs.split_visualization(clu_res) .. container:: cell-output cell-output-display .. figure:: hipart-example1_files/figure-rst/fig0-output-1.png :alt: Divisive Clustering Example for 3 clusters :name: fig0 Divisive Clustering Example for 3 clusters Notice: - The output_matrix output: It can be very useful to get the whole tree structure in the form of a matrix. Each row represents a data point and each column cluster ids (predicted labels). More precisely each column contain the clustering ids to a corresponding iterative step of the clustering procedure, so the first column contains the clustering ids for a binary split, then the second the ids after the second break (cluster ids for three clusters) and the last column the final cluster ids according to the desired number of cluster given to the algorithm as input. - Lets change the desired number of clusters The dePDDP algorithm is deterministic. In practice this means than when we rerun the algorithm “allowing” it to retrieve a higher number of clusters we just relax the stopping criterion. The algorithm will choose to split further a leaf node from the previous clustering result. .. container:: cell .. code:: python clustered_class = dePDDP(max_clusters_number=5, bandwidth_scale=0.5, percentile=0.2, min_sample_split=5) clu_res = clustered_class.fit(X) m = hvs.split_visualization(clu_res) .. container:: cell-output cell-output-display .. figure:: hipart-example1_files/figure-rst/fig2-output-1.png :alt: Divisive Clustering Example for 5 clusters :name: fig2 Divisive Clustering Example for 5 clusters Simplicity, Sensitivity ======================= Setting up the parameters for dePDDP is super simple to understand but can be tricky as well. In practice, there are two dominant parameters that need to be specified carefully since they interact. For example the “number of clusters” parameter can be skipped completely. This will also allow us to use the algorithm for cluster number estimation. However, the bandwidth selection for the kernel density estimation could greatly affect the results in this case. In the following example, we only set the “bandwidth_scale” parameter and the algorithm estimate the number of clusters itself. We plot the results along with the “true labels”. We observe that two clusters have great scale of overlap constituting in practice one cluster with double the size. The algorithm identifies it as one cluster and provides a reasonable estimate for the number of cluster in this example. .. container:: cell :name: fig3 .. code:: python X, y = make_blobs(n_samples=1500, centers=6, random_state=0) clustered_class = dePDDP(bandwidth_scale=1.5) # get only the predicted class clu_res = clustered_class.fit_predict(X) fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, sharey=True) ax1.scatter(X[:,0], X[:,1], c=y, alpha=0.5) ax1.set_title('True Labels') ax2.scatter(X[:,0], X[:,1], c=clu_res, alpha=0.5) ax2.set_title('Clustering result') .. container:: cell-output cell-output-display :name: fig3-1 :: Text(0.5, 1.0, 'Clustering result') Divisive Clustering Example .. container:: cell-output cell-output-display .. image:: hipart-example1_files/figure-rst/fig3-output-2.png :name: fig3-2 Dendrogram example ================== Yay, we can also plot dendrograms! Remember that divisive clustering operates in a top down fashion. The orange cluster has been split from the rest of the dataset first. Although there are no cluster similaries estimated here, we may assume that the green and red clusters are more similar to each other because they have been split later in the iterative procedure. .. container:: cell .. code:: python data = load_linnerud() Y = data.target X = data.data clustered_class = dePDDP(max_clusters_number=3, bandwidth_scale=0.5, percentile=0.1, min_sample_split=1) clu_res = clustered_class.fit(X) m = hvs.dendrogram_visualization(clu_res) .. container:: cell-output cell-output-display .. figure:: hipart-example1_files/figure-rst/fig4-output-1.png :alt: Divisive Clustering Example :name: fig4 Divisive Clustering Example Why bother? =========== - Speed and efficiency comparison In this example we will test HiPart for high dimensional data to expose the computational advantages of divisive over agglomerative clustering. We employ the super simple iPDDP algorithm (S. Tasoulis and Tasoulis (2008)) as we expect extensive data sparsity and a minimal deegre of cluster overlap, so the distance based clustering criterio seems more appropriate than the density one. We also use a dataset with uneven clusters to make the comparison more interesting. .. container:: cell .. code:: python from HiPart.clustering import iPDDP from sklearn.cluster import AgglomerativeClustering from sklearn.cluster import KMeans import time from sklearn import metrics # generate the dataset X, y = make_blobs(n_samples=[5000,2000,1000,500,500,200,100], centers=None, n_features= 1000,cluster_std=20,random_state=2) # -------------- # set the model clustered_class = iPDDP(max_clusters_number=7, percentile=0.1, min_sample_split=5) st = time.time() iPDDP_res = clustered_class.fit_predict(X) et = time.time() iPDDP_elapsed_time = et - st # -------------- # set the model Agglo_model = AgglomerativeClustering(n_clusters=7,linkage='average') st = time.time() Agglo_res = Agglo_model.fit_predict(X) et = time.time() Agglo_elapsed_time = et - st # -------------- # set the model kmeans_model = KMeans(n_clusters=7, random_state=0 ) st = time.time() kmeans_res = kmeans_model.fit_predict(X) et = time.time() kmeans_elapsed_time = et - st # Measure clustering efficiency as well with the Ajusted Rand Index metric iPDDP_ARS = metrics.adjusted_rand_score(y, iPDDP_res) agglo_ARS = metrics.adjusted_rand_score(y, Agglo_res) kmeans_ARS = metrics.adjusted_rand_score(y, kmeans_res) results = pd.DataFrame({"method" : ["iPDDP","Agglo","kmeans"], "Seconds" : [iPDDP_elapsed_time, Agglo_elapsed_time,kmeans_elapsed_time], "ARS" : [iPDDP_ARS,agglo_ARS,kmeans_ARS] }) print(results) .. container:: cell-output cell-output-stdout :: method Seconds ARS 0 iPDDP 2.134749 0.999651 1 Agglo 30.910616 0.975736 2 kmeans 5.666819 0.583536 .. container:: references csl-bib-body hanging-indent :name: refs .. container:: csl-entry :name: ref-anagnostou2022hipart Anagnostou, Panagiotis, Sotiris Tasoulis, Vassilis Plagianakos, and Dimitris Tasoulis. 2022. “HiPart: Hierarchical Divisive Clustering Toolbox.” arXiv. https://doi.org/10.48550/ARXIV.2209.08680. .. container:: csl-entry :name: ref-TASOULIS20103391 Tasoulis, S. K., D. K. Tasoulis, and V. P. Plagianakos. 2010. “Enhancing Principal Direction Divisive Clustering.” *Pattern Recognition* 43 (10): 3391–411. https://doi.org/https://doi.org/10.1016/j.patcog.2010.05.025. .. container:: csl-entry :name: ref-tasoulis2008improving Tasoulis, SK, and DK Tasoulis. 2008. “Improving Principal Direction Divisive Clustering.” In *14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), Workshop on Data Mining Using Matrices and Tensors*.