Package contents

HiPart: Hierarchical divisive clustering toolbox

HiPart is a package created for the implementation of hierarchical divisive clustering algorithms. Even among this family of algorithms, its specialty is high-performance algorithms for high-dimensional big data. It is a package with similar execution principles as the scikit-learn package. It also provides two types of static visualizations for all the algorithms executed in the package, with the addition of linkage generation for the divisive hierarchical clustering structure. Finally, the package provides an interactive visualization for manipulating the PDDP-based algorithms’ split-point for each of the splits those algorithms generated from the clustering process.

Clustering module

Implementation of the clustering algorithms, members of the HiPart package.

@author Panagiotis Anagnostou @author Nicos Pavlidis

class HiPart.clustering.BisectingKmeans(max_clusters_number=100, min_sample_split=5, random_state=None)[source]

Bases: Partition

Class BisectingKmeans. It executes the bisecting k-Means algorithm.

References

Savaresi, S. M., & Boley, D. L. (2001, April). On the performance of bisecting K-means and PDDP. In Proceedings of the 2001 SIAM International Conference on Data Mining (pp. 1-14). Society for Industrial and Applied Mathematics.

Parameters:
  • max_clusters_number (int, (optional)) – Desired maximum number of clusters for the algorithm.

  • min_sample_split (int, (optional)) – The minimum number of points needed in a cluster for a split to occur.

  • random_state (int, (optional)) – The random seed fed in the k-Means algorithm.

output_matrix

Model’s step by step execution output.

Type:

numpy.ndarray

labels_

Extracted clusters from the algorithm.

Type:

numpy.ndarray

tree

The object which contains all the information about the execution of the bisecting k-Means algorithm.

Type:

treelib.Tree

samples_number

The number of samples contained in the data.

Type:

int

fit_predict(X)

Returns the results of the fit method in the form of the labels of the predicted clustering labels.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

labels_ – Extracted clusters from the algorithm.

Return type:

numpy.ndarray

calculate_node_data(indices, key)[source]

Execution of the binary k-Means algorithm on the samples presented by the data_matrix. The two resulted clusters are the two new clusters if the leaf is chosen to be split. And calculation of the splitting criterion.

Parameters:
  • indices (numpy.ndarray) – The index of the samples in the original data matrix.

  • data_matrix (numpy.ndarray) – The data matrix containing all the data for the samples.

  • key (int) – The value of the color for each node.

Returns:

data – The necessary data for each node which are splitting point.

Return type:

dict

fit(X)[source]

Execute the BisectingKmeans algorithm and return all the execution data in the form of a BisectingKmeans class object.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns.

Returns:

A BisectingKmeans class type object, with complete results on the algorithm’s analysis.

Return type:

self

split_function(tree, selected_node)[source]

Split the indicated node by clustering the data with a binary k-means clustering algorithm.

Because python passes by reference data this function doesn’t need a return statement.

Parameters:
  • tree (treelib.tree.Tree) – The tree build by the BisectingKmeans algorithm, in order to cluster the input data.

  • selected_node (int) – The numerical identifier for the tree node that i about to be split.

Returns:

  • There is no returns in this function. The results of this function

  • pass to execution by utilizing the python’s pass-by-reference

  • nature.

class HiPart.clustering.DePDDP(decomposition_method='pca', max_clusters_number=100, bandwidth_scale=0.5, percentile=0.1, min_sample_split=5, visualization_utility=True, distance_matrix=False, **decomposition_args)[source]

Bases: Partition

Class dePDDP. It executes the dePDDP algorithm.

References

Tasoulis, S. K., Tasoulis, D. K., & Plagianakos, V. P. (2010). Enhancing principal direction divisive clustering. Pattern Recognition, 43(10), 3391- 3411.

Parameters:
  • decomposition_method (str, (optional)) – One of the (‘pca’, ‘kpca’, ‘ica’, ‘tsne’) supported decomposition methods used as kernel for the dePDDP algorithm.

  • max_clusters_number (int, (optional)) – Desired maximum number of clusters to find the dePDDP algorithm.

  • bandwidth_scale (float, (optional)) – Standard deviation scaler for the density approximation.

  • percentile (float, (optional)) – The percentile distance from the dataset’s edge in which a split can not occur. [0,0.5) values are allowed.

  • min_sample_split (int, (optional)) – The minimum number of points needed in a cluster for a split to occur.

  • visualization_utility (bool, (optional)) – If (True) generate the data needed by the visualization utilities of the package otherwise, if false the split_visualization and interactive_visualization of the package can not be created. For the ‘tsne’ decomposition method does not support visualization because it affects the correct execution of the dePDDP algorithm.

  • distance_matrix (bool, (optional)) – If (True) the input data are considered as a distance matrix and not as a data matrix. The distance matrix is a square matrix with the samples on the rows and the variables on the columns. The distance matrix is used only in conjunction with the ‘mds’ decomposition method and no other from the supported decomposition methods.

  • **decomposition_args – Arguments for each of the decomposition methods (“decomposition.PCA” as “pca”, “decomposition.KernelPCA” as “kpca”, “decomposition.FastICA” as “ica”, “manifold.TSNE” as “tsne”) utilized by the HiPart package, as documented in the scikit-learn package, from which they are implemented.

output_matrix

Model’s step by step execution output.

Type:

numpy.ndarray

labels_

Extracted clusters from the algorithm.

Type:

numpy.ndarray

tree

The object which contains all the information about the execution of the dePDDP algorithm.

Type:

treelib.Tree

samples_number

The number of samples contained in the data.

Type:

int

fit_predict(X)

Returns the results of the fit method in the form of the labels of the predicted clustering labels.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

labels_ – Extracted clusters from the algorithm.

Return type:

numpy.ndarray

calculate_node_data(indices, key)[source]

Calculation of the projections onto the Principal Components with the utilization of the “Principal Components Analysis” or the “Kernel Principal Components Analysis” or the “Independent Component Analysis” or “t-SNE” methods.

With the incorporation of the “Multi-Dimensional Scaling” method the function can also be used for distance matrices. The distance matrix is used only in correlation with the “mds” decomposition method. This makes us check for the correct configuration of the parameters each time the function is executed.

Determination of the projection’s density and search for its local minima. The lowest local minimum point within the allowed sample percentiles of the projection’s density representation is selected as the split point.

This function leads to the second Stopping criterion 2 of the algorithm.

Parameters:
  • indices (numpy.ndarray) – The index of the samples in the original data matrix.

  • key (int) – The value of the color for each node.

Returns:

data – The necessary data for each node which are splitting point.

Return type:

dict

fit(X)[source]

Execute the dePDDP algorithm and return all the execution data in the form of a dePDDP class object.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

A dePDDP class type object, with complete results on the algorithm’s analysis.

Return type:

self

class HiPart.clustering.IPDDP(decomposition_method='pca', max_clusters_number=100, percentile=0.1, min_sample_split=5, visualization_utility=True, distance_matrix=False, **decomposition_args)[source]

Bases: Partition

Class IPDDP. It executes the iPDDP algorithm.

References

Tasoulis, S. K., Tasoulis, D. K., & Plagianakos, V. P. (2010). Enhancing principal direction divisive clustering. Pattern Recognition, 43(10), 3391- 3411.

Parameters:
  • decomposition_method (str, (optional)) – One of the (‘pca’, ‘kpca’, ‘ica’, ‘tsne’) supported decomposition methods used as kernel for the iPDDP algorithm.

  • max_clusters_number (int, (optional)) – Desired maximum number of clusters for the algorithm.

  • percentile (float, (optional)) – The percentile distance from the dataset’s edge in which a split can not occur. [0,0.5) values are allowed.

  • min_sample_split (int, (optional)) – The minimum number of points needed in a cluster for a split to occur.

  • visualization_utility (bool, (optional)) – If (True) generate the data needed by the visualization utilities of the package otherwise, if false the split_visualization and interactive_visualization of the package can not be created. For the ‘tsne’ decomposition method does not support visualization because it affects the correct execution of the iPDDP algorithm.

  • distance_matrix (bool, (optional)) – If (True) the input data are considered as a distance matrix and not as a data matrix. The distance matrix is a square matrix with the samples on the rows and the variables on the columns. The distance matrix is used only in conjunction with the ‘mds’ decomposition method and no other from the supported decomposition methods.

  • **decomposition_args – Arguments for each of the decomposition methods (“decomposition.PCA” as “pca”, “decomposition.KernelPCA” as “kpca”, “decomposition.FastICA” as “ica”, “manifold.TSNE” as “tsne”) utilized by the HiPart package, as documented in the scikit-learn package, from which they are implemented.

output_matrix

Model’s step by step execution output.

Type:

numpy.ndarray

labels_

Extracted clusters from the algorithm.

tree

The object which contains all the information about the execution of the iPDDP algorithm.

Type:

treelib.Tree

samples_number

The number of samples contained in the data.

Type:

int

fit_predict(X)

Returns the results of the fit method in the form of the labels of the predicted clustering labels.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

labels_ – Extracted clusters from the algorithm.

Return type:

numpy.ndarray

calculate_node_data(indices, key)[source]

Calculation of the projections onto the Principal Components with the utilization of the “Principal Components Analysis” or the “Kernel Principal Components Analysis” or the “Independent Component Analysis” or “t-SNE” methods.

Determination of the projection’s maximum distance between to consecutive points and chooses it as the split-point for this node.

This function leads to the second Stopping criterion 2 of the algorithm.

Parameters:
  • indices (ndarray of shape (n_samples,)) – The index of the samples in the original data matrix.

  • key (int) – The value of the color for each node.

Returns:

data – The necessary data for each node which are splitting point.

Return type:

dict

fit(X)[source]

Execute the iPDDP algorithm and return all the execution data in the form of a IPDDP class object.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns.

Returns:

A iPDDP class type object, with complete results on the algorithm’s analysis.

Return type:

self

class HiPart.clustering.KMPDDP(decomposition_method='pca', max_clusters_number=100, min_sample_split=15, visualization_utility=True, distance_matrix=False, random_state=None, **decomposition_args)[source]

Bases: Partition

Class KMPDDP. It executes the kMeans-PDDP algorithm.

References

Zeimpekis, D., & Gallopoulos, E. (2008). Principal direction divisive Partition with kernels and k-means steering. In Survey of Text Mining II (pp. 45-64). Springer, London.

Parameters:
  • decomposition_method (str, (optional)) – One of the (‘pca’, ‘kpca’, ‘ica’, ‘tsne’) supported decomposition methods used as kernel for the kMeans-PDDP algorithm.

  • max_clusters_number (int, (optional)) – Desired maximum number of clusters for the algorithm.

  • min_sample_split (int, (optional)) – The minimum number of points needed in a cluster for a split to occur.

  • visualization_utility (bool, (optional)) – If (True) generate the data needed by the visualization utilities of the package otherwise, if false the split_visualization and interactive_visualization of the package can not be created. For the ‘tsne’ decomposition method does not support visualization because it affects the correct execution of the kMeans-PDDP algorithm.

  • distance_matrix (bool, (optional)) – If (True) the input data are considered as a distance matrix and not as a data matrix. The distance matrix is a square matrix with the samples on the rows and the variables on the columns. The distance matrix is used only in conjunction with the ‘mds’ decomposition method and no other from the supported decomposition methods.

  • random_state (int, (optional)) – The random seed fed in the k-Means algorithm

  • **decomposition_args – Arguments for each of the decomposition methods (“decomposition.PCA” as “pca”, “decomposition.KernelPCA” as “kpca”, “decomposition.FastICA” as “ica”, “manifold.TSNE” as “tsne”) utilized by the HiPart package, as documented in the scikit-learn package, from which they are implemented.

output_matrix

Model’s step by step execution output.

Type:

numpy.ndarray

labels_

Extracted clusters from the algorithm.

tree

The object which contains all the information about the execution of the iPDDP algorithm.

Type:

treelib.Tree

samples_number

The number of samples contained in the data.

Type:

int

fit_predict(X)

Returns the results of the fit method in the form of the labels of the predicted clustering labels.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

labels_ – Extracted clusters from the algorithm.

Return type:

numpy.ndarray

calculate_node_data(indices, key)[source]

Calculation of the projections onto the Principal Components with the utilization of the “Principal Components Analysis” or the “Kernel Principal Components Analysis” or the “Independent Component Analysis” or “t-SNE” methods.

Determination of the projection’s clusters by utilizing the binary k-means clustering algorithm.

This function leads to the second Stopping criterion 2 of the algorithm.

Parameters:
  • indices (numpy.ndarray) – The index of the samples in the original data matrix.

  • key (int) – The value of the color for each node.

Returns:

data – The necessary data for each node which are splitting point

Return type:

dict

fit(X)[source]

Execute the kM-PDDP algorithm and return all the execution data in the form of a kM_PDDP class object.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns.

Returns:

self – A kM-PDDP class type object, with complete results on the algorithm’s analysis.

Return type:

object

class HiPart.clustering.MDH(max_clusters_number=100, max_iterations=10, k=2.3, percentile=0.1, min_sample_split=5, random_state=None)[source]

Bases: Partition

Class MDH. It executes the MDH algorithm.

References

Pavlidis, N. G., Hofmeyr, D. P., & Tasoulis, S. K. (2016). Minimum density hyperplanes. Journal of Machine Learning Research, 17 (156), 1-33.

Parameters:
  • max_clusters_number (int, optional) – Desired maximum number of clusters to find the MDH algorithm.

  • max_iterations (int, optional) – Maximum number of iterations on the search for the minimum density hyperplane.

  • k (float, optional) – The multiples of the standard deviation which the existence of a splitting hyperplane is allowed. The default value is 2.3.

  • percentile (float, optional) – The percentile distance from the dataset’s edge in which a split can not occur. [0,0.5) values are allowed.

  • min_sample_split (int, optional) – The minimum number of points needed in a cluster for a split to occur.

  • random_state (int, optional) – The random seed to be used in the algorithm’s execution.

output_matrix

Model’s step by step execution output.

Type:

numpy.ndarray

labels_

Extracted clusters from the algorithm.

Type:

numpy.ndarray

tree

The object which contains all the information about the execution of the MDH algorithm.

Type:

treelib.Tree

samples_number

The number of samples contained in the data.

Type:

int

fit_predict(X)

Returns the results of the fit method in the form of the labels of the predicted clustering labels.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

labels_ – Extracted clusters from the algorithm.

Return type:

numpy.ndarray

calculate_node_data(indices, key)[source]

Find a minimum density hyperplane to bisect the data. The determination of the minimum density hyperplane is based on the minimization is found by minimizing the first derivative of the density function. This is made possible through the use of “Sequential Quadratic Programming” (SQP) method, which is used to simultaneously find the optimal projection vector v and minimum density point b.

This function leads to the second Stopping criterion 2 of the algorithm.

Parameters:
  • indices (numpy.ndarray) – The index of the samples in the original data matrix.

  • key (int) – The value of the color for each node.

Returns:

data – The necessary data for each node which are splitting point.

Return type:

dict

fit(X)[source]

Execute the MDH algorithm and return all the execution data in the form of a MDH class object.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns.

Returns:

A MDH class type object, with complete results on the algorithm’s analysis.

Return type:

self

class HiPart.clustering.PDDP(decomposition_method='pca', max_clusters_number=100, min_sample_split=5, visualization_utility=True, distance_matrix=False, **decomposition_args)[source]

Bases: Partition

Class PDDP. It executes the PDDP algorithm.

References

Boley, D. (1998). Principal direction divisive Partition. Data mining and knowledge discovery, 2(4), 325-344.

Parameters:
  • decomposition_method (str, (optional)) – One of the (‘pca’, ‘kpca’, ‘ica’, ‘tsne’) supported decomposition methods used as kernel for the PDDP algorithm.

  • max_clusters_number (int, (optional)) – Desired maximum number of clusters for the algorithm.

  • min_sample_split (int, (optional)) – The minimum number of points needed in a cluster for a split to occur.

  • visualization_utility (bool, (optional)) – If (True) generate the data needed by the visualization utilities of the package otherwise, if false the split_visualization and interactive_visualization of the package can not be created. For the ‘tsne’ decomposition method does not support visualization because it affects the correct execution of the PDDP algorithm.

  • distance_matrix (bool, (optional)) – If (True) the input data are considered as a distance matrix and not as a data matrix. The distance matrix is a square matrix with the samples on the rows and the variables on the columns. The distance matrix is used only in conjunction with the ‘mds’ decomposition method and no other from the supported decomposition methods.

  • **decomposition_args – Arguments for each of the decomposition methods (“decomposition.PCA” as “pca”, “decomposition.KernelPCA” as “kpca”, “decomposition.FastICA” as “ica”, “manifold.TSNE” as “tsne”) utilized by the HiPart package, as documented in the scikit-learn package, from which they are implemented.

output_matrix

Model’s step by step execution output.

Type:

numpy.ndarray

labels_

Extracted clusters from the algorithm.

tree

The object which contains all the information about the execution of the iPDDP algorithm.

Type:

treelib.Tree

samples_number

The number of samples contained in the data.

Type:

int

fit_predict(X)

Returns the results of the fit method in the form of the labels of the predicted clustering labels.

Parameters:

X (numpy.ndarray) – Data matrix with the samples on the rows and the variables on the columns. If the distance_matrix is True then X should be a square distance matrix.

Returns:

labels_ – Extracted clusters from the algorithm.

Return type:

numpy.ndarray

calculate_node_data(indices, key)[source]

Calculation of the projections onto the Principal Components with the utilization of the “Principal Components Analysis” or the “Kernel Principal Components Analysis” or the “Independent Component Analysis” or “t-SNE” methods.

The projection’s clusters are split on the median pf the projected data.

This function leads to the second Stopping criterion 2 of the algorithm.

Parameters:
  • indices (numpy.ndarray) – The index of the samples in the original data matrix.

  • key (int) – The value of the color for each node.

Returns:

data – The necessary data for each node which are splitting point.

Return type:

dictionary

fit(X)[source]

Execute the PDDP algorithm and return all the execution data in the form of a PDDP class object.

Parameters:

X (numpy.ndarray) – Data matrix (must check and return an error if not).

Returns:

A PDDP class type object, with complete results on the algorithm’s analysis.

Return type:

self

Interactive_visualization module

Interactive visualization module for the algorithms members of the HiPart package that utilise one decomposition method to one dimension to split the data.

@author: Panagiotis Anagnostou

HiPart.interactive_visualization.Splitpoint_Manipulation_Callback(data, current_figure, split_number, maximum_number_splits, split_marks, splitpoint_position, splitpoint_minimum, splitpoint_max, splitpoint_marks, apply_button)[source]

Function triggered by the change in value on the Input elements. The triggers are:

  1. the change in the split of the data.

  2. the changes in the split-point.

  3. the apply button for the change to the new split-point (partial algorithm execution from the selected split).

The rest of the function’s inputs are inputs that can’t trigger this callback but their data are necessary for the execution of this callback.

Parameters:
  • splitpoint_marks

  • data (dict) – The paths of the temporary files created for the execution of the interactive visualization.

  • current_figure (dict) – A dictionary created from the plotly express object plots. It is used for the manipulation of the current figure.

  • split_number (int) – The value of the split to project (or is projected, depending on the callback context triggered state).

  • maximum_number_splits (int) – The number of splits there are in the model created/manipulated.

  • split_marks (dict) – The split number is the key of the dictionary and the assigned value is the value of the dictionary.

  • splitpoint_position (float) – The current position of the shape represents the split-point of the currently selected split, extracted from the split-point slider.

  • splitpoint_minimum (float) – The minimum value that can be assigned to the split-point, extracted from the split-point slider.

  • splitpoint_max (float) – The maximum value that can be assigned to the split-point, extracted from the split-point slider.

  • apply_button (int) – The number of clicks of the apply button (Not needed in the function`s execution, but necessary for the callback definition).

Returns:

  • figure (dash.dcc.Graph) – A figure that can be integrated at dash`s Html components.

  • splitMax (float) – The new value for the maximum split number.

  • splitMarks (float) – The changed marks the spit can utilize as values.

  • splitpoint (float) – The newly generated split-point by the callback.

  • splitPMin (int) – The minimum value the split-point can take.

  • splitPMax (int) – The maximum value the split-point can take.

HiPart.interactive_visualization.app_layout(app, tmpFileNames)[source]

Basic interface creation for the interactive application. The given inputs let the user manage the application correctly.

Parameters:
  • app (dash.Dash) – The application we want to create the layout on.

  • tmpFileNames (dict) – A dictionary with the names (paths) of the temporary files needed for the execution of the algorithm.

HiPart.interactive_visualization.display_menu(pathname, data)[source]

Menu interchange callback.

Parameters:
  • pathname (str) – The new url pathname that the webpage needs to go to.

  • data (dict) – The paths of the temporary files created for the execution of the interactive visualization.

Returns:

div – A div containing all the visual components needed for each of the page’s components.

Return type:

dash.html

HiPart.interactive_visualization.main(inputData)[source]

The main function of the interactive visualization.

Parameters:

inputData (dePDDP or iPDDP or kM_PDDP or PDDP object) – The object to be visualized.

Returns:

obj – The manipulated object after the execution of the interactive visualization. (Currently not working correctly)

Return type:

dePDDP or iPDDP or kM_PDDP or PDDP object

Visualizations module

Implementation module for the static visualization of the algorithms implemented in the HiPart package.

@author: Panagiotis Anagnostou @author: Nicos Pavlidis

HiPart.visualizations.dendrogram_visualization(hipart_object, cmap='viridis', default_coloring=True, **dendrogram_parameters)[source]

Create a dendrogram visualization of the divisive clustering based on the HiPart`s algorithm execution. The characteristic of this dendrogram is that the distance of the clusters is not the one depicted on the graph, but it represents the distance of each node from the base of the tree, also known as leaves.

!! Important Note !! The “count_sort” parameter is set to True by default to preserve the hierarchical split structure of the tree. If the “count_sort” parameter is set to False, the dendrogram colors will not be assigned to the correct clusters.

Parameters:
  • hipart_object (DePDDP or IPDDP or KMPDDP or PDDP or MDH or BisectingKmeans) – The object member of the HiPart package that we want to manipulate on the premiss of this function.

  • cmap (string) – The name of the matplotlib color map to be used for the data visualization.

  • default_coloring (bool, optional) – If True, the dendrogram will be colored according to the default HiPart tree coloring, based on the clustering implemented by the package. If False, the dendrogram will be colored according to the default methodology used by the scipy.cluster.hierarchy.dendrogram function. Note that either way the “color_threshold” parameter can be changed.

  • **dendrogram_parameters (optional) – All the parameters the scipy.cluster.hierarchy.dendrogram function can take except the color_threshold parameter. Except for the “color_threshold” parameter. This parameter takes a default threshold because, in the linkage, we do not calculate the actual distance of the found clusters but only their hierarchy.

Raises:

TypeError – If the hipart_object is not a member of the HiPart package, the function raises a “TypeError” error. The error is raised for the possibility of unexpected faults if the elements of the hipart_object are not correctly structured.

Returns:

dn – A dictionary of data structures created to render the dendrogram.

Return type:

dict

HiPart.visualizations.linkage(hipart_object)[source]

Create the linkage of the data based on the divisive hierarchical clustering for the members of the HiParts package. The linkage’s characteristic is that the cluster distance on the third column of the linkage is not the distance of the clusters with each other, but it represents the distance of each node from the base of the tree, also known as leaves.

Parameters:

hipart_object (DePDDP or IPDDP or KMPDDP or PDDP or MDH or BisectingKmeans) – The object member of HiPart package that we want to manipulate on the premiss of this function.

Returns:

Z – The divisive clustering encoded as a linkage matrix.

Return type:

numpy.ndarray

HiPart.visualizations.mdh_visualization(mdh_obj, color_map='viridis')[source]

Create the visualization of each of the splits generated by MDH algorithm, members of HiPart package. For each split, we visualize the data on the first two principal components, while the color of each sample is chosen depending on the cluster it belongs to. The colors throughout the separate split represent the same cluster each time.

Moreover, we add the line of the projection vector projected onto the first two principal components, as well as a dashed line representing the splitting hyperplane. The splitting hyperplane a line that is perpendicular to the projection vector and passes through the minimum density point of the data on the projection direction. In addition, we create add the density of the data onto the projection vector in the form of a blue line, projected onto the projection vector. With this information one can validate the execution and the results of each algorithm and examining this visualization can help the parametrization of the algorithms.

Warning: the projection vector and the splitting hyperplane in the plot may not seem perpendicular to each other. This is because the projection vector is projected onto the first two principal components, and we can’t predict the viewing angle of the plot, in the tuple of dimensions. For the same reason, dotted line representing the splitting hyperplane may seem misplaced from the data it splits.

Parameters:
  • mdh_obj (MDH) – The object member of HiPart package that we want to manipulate on the premiss of this function.

  • color_map (string) – The name of the matplotlib color map to be used for the data visualization.

Raises:
  • TypeError – If the hipart_object is not a member of the HiPart package raise an error for the possibility of unexpected errors that will be raised if the elements of the object are not correctly structured.

  • ValueError – If the “visualization_utility” attribute of the imported object is False, this causes some needed for this visualization data not to be created.

Returns:

plt – The created visualization by this function.

Return type:

pyplot (module)

HiPart.visualizations.split_visualization(hipart_object, color_map='viridis', mdh_split_plot=True)[source]

Create the visualization of each of the splits generated by one of the divisive hierarchical clustering algorithms, members of HiPart package. For each split, we visualize the data on the first two principal components, while the color of each sample is chosen depending on the cluster it belongs to. The colors throughout the separate split represent the same cluster each time.

Depending on the input object, the visualization is enriched with additional information related to the algorithm that created it. For the:

1. DePDDP object, the visualization adds a marginal plot on the X-axis that represents the density of the data as extracted from the kernel density estimation for the first principal component. That is the information the dePDDP algorithm utilizes to split each cluster.

2. KMPDDP object, the visualization adds a marginal plot on the X-axis that represents the data as they are projected on the first principal component, with the addition of the centers the k-Means finds within its execution on each split. That is the information the kM_PDDP algorithm utilizes to split each cluster.

3. iPDDP along with PDDP objects, the visualization adds a marginal plot on the X-axis that represents the data as they are projected on the first principal component. That is the information the iPDDP algorithm utilizes to split each cluster.

4. BisectingKmeans object does not include additional information in the visualization. That is because of the nature of the algorithm it implements.

5. MDH object does not include additional information in the visualization. That is because of the nature of the algorithm it implements.

Finally, for all the objects the visualization adds the selected split point for each split by each algorithm. This way one can validate the execution and the results of each algorithm. Moreover, examining this visualization can help the parametrization of the algorithms.

Parameters:
  • hipart_object (DePDDP or IPDDP or KMPDDP or PDDP or MDH or BisectingKmeans) – The object member of HiPart package that we want to manipulate on the premiss of this function.

  • color_map (string) – The name of the matplotlib color map to be used for the data visualization.

  • mdh_split_plot (bool) – If True, the visualization of the MDH object will include the projection vector and the splitting hyperplane visualization. See visualization.mdh_visualization for more information.

Raises:
  • TypeError – If the hipart_object is not a member of the HiPart package raise an error for the possibility of unexpected errors that will be raised if the elements of the object are not correctly structured.

  • ValueError – If the “visualization_utility” attribute of the imported object is False, this causes some needed for this visualization data not to be created.

Returns:

plt – The created visualization by this function.

Return type:

pyplot (module)