HiPart Visualization Capabilities 

Author:: Panagiotis Anagnostou

1 Introduction 

In this example, we will present a few visualization capabilities of the HiPart package along with its compatibility with well established data science visualization tools.

For that purpose, we will use the Cancer dataset, which can be found in the package repository.

1.1 Initialization 

We begin with the needed modules for the example.

from HiPart.clustering import DePDDP
from scipy.cluster import hierarchy

import h5py
import HiPart.interactive_visualization as iv
import HiPart.visualizations as viz
import matplotlib
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

1.2 Data loading 

The real word dataset we utilize is in .h5 format. In what follows, we define a custom function (h5file()) for data loading.

# h5 files input function
def h5file(data_folder, name):
    f = h5py.File(data_folder + name + ".h5", "r")
    inData = f["data"]["matrix"][:].transpose()
    inTarget = f["class"]["categories"][:]
    inTarget = np.int32(inTarget) - 1

    if inData.shape[0] != len(inTarget):
        inData = inData.transpose()
    f.close()

    return inData, inTarget

# input the data
X, y = h5file("./", "Cancer")
# y contains categories in the form of integers, and their numbering starts from -1
y = y + 1   # numbering corection

The algorithm we will use to cluster the Canser dataset is the dePDDP algorithm. The only parametrization we will do to the algorithm is the number of times we want the data to be split with the max_clusters_number parameter.

We choose to use the dePDDP algorithm to cluster the data. The only parameter provided is the number of clusters we expect the algorithm to retrieve (max_clusters_number). Keep in mind that the algorithm have a termination criterion by its own, so the retrieved number of clusters could be smaller than max_clusters_number.

# Return a model which contains the the clustering of the data
clustering = DePDDP(
    max_clusters_number=np.unique(y).shape[0],
).fit(X)

1.4 Utilities 

We initialize utilities for this example.

# Create a list of colors for the clusters to use
color_map = matplotlib.cm.get_cmap("tab20", 20)
color_list = [iv._convert_to_hex(color_map(i)) for i in range(color_map.N)]

2 Build-in visualizations 

We begin with the split visualization of HiPart which provided a clear view of the hierarchical algorithmic procedure.

spl_viz = viz.split_visualization(clustering)
spl_viz.show()

The 2d scatter plot correspond to the PCA projections used to estimate the separating hyperplane, shown in as a vertical red line.

Next we utilize the dendrogram visualization, illustrated the constructed binary tree.

dendrogram_viz = viz.dendrogram_visualization(clustering)

3 Seaborn `clustermap()`

Seaborn is a popular visualization library for statistic visualizations in {python}. It is built on the top of the matplotlib library and is also closely integrated into the data structures from pandas.

The clustermap() function from seaborn, plots a hierarchically-clustered heat map of the data matrix. Seaborn already provides a variety of built-in hierarchical agglomerative methods. Using the HiPart package, you can also utilize the hierarchical divisive methods of the package as input in the clustermap() function.

# create a linkage to represent the by row clustering
row_linkage = viz.linkage(clustering)
# craete color for the rows
row_colors = np.take(color_list, clustering.labels_.astype("int"))

# Cluster the data by column and create a linkage to represent the by column clustering
column_clustering = DePDDP(
    max_clusters_number=7,
).fit(X.transpose())
column_linkage = viz.linkage(column_clustering)
# craete color for the columns
column_colors = np.take(color_list, column_clustering.labels_.astype("int"))

heatmap = sns.clustermap(
    X,
    figsize=(10, 6),
    cmap="rocket_r",
    row_linkage=row_linkage,  # this four inputs are the key inputs for the heatmap visualization
    row_colors=row_colors,
    col_linkage=column_linkage,
    col_colors=column_colors,
    dendrogram_ratio=0.12,
)

4 Dendrogramm for evaluation 

Finally, we present a dendrogram example with a color line at the bottom of the visualization. This line presents the labels of the samples for a given data matrix, when these are available.

For the creation of the figure, we will utilize the GridSpec class of the matplotlib package. This way, we can create a figure with two subplots. The first subplot is the axes dendro, and the second subplot is the axes labels.

# Set figure size
fig = plt.figure(figsize=(6, 4.5))
# Create a grid with 1 column and 2 rows in which, the first row
# shows the dendrogram and must be bigger that the sendond row
# which shows the real labels. This can be achived by spliting the
# space in 26 parts.
gs = gridspec.GridSpec(25, 1, fig, wspace=0.01, hspace=0.2)

# Dendrogram subplot
dendro = plt.subplot(gs[0:24, 0:1]) # use the first 25 row of the
                                    # grid for the denro axes
hierarchy.set_link_color_palette(color_list) # use the color palet we created
den_data = viz.dendrogram_visualization(
    clustering,
    no_labels=True, # SoS: do not print labels on the dendro axes
    ax=dendro,
)
dendro.axis("off") # Do not show axis data around the figure

# color the pyrity line
colors = y[den_data["leaves"]]  # sort the samples the same way they are
                                # sorted in the dendrogram subfigure
colors = np.take(color_list, y[den_data["leaves"]]) # apply the created
                                                    # color map to the
                                                    # samples

# create the purity line
labels = plt.subplot(gs[24:26, 0:1]) # use the first 1 row of the
                                     # grid for the denro axes
labels.scatter(  # labels subplot creation with the use of a scater plot
    np.arange(X.shape[0]),
    np.zeros(X.shape[0]),
    s=65,
    c=colors,
    marker="|",
)
labels.axis([0, X.shape[0], -0.05, 0.05]) # set the axis for the scater plot
labels.axis("off") # Do not show axis data around the figure

plt.show()

To this end, we can investigate the correspondence between the labels and the clusters retrieved from the dePDDP algorithm.