Wiki

Available Trees

The SHiP framework supports a variety of ultrametric similarity tree types, which form the basis for hierarchy construction. Each tree type encapsulates a specific strategy for representing pairwise similarity across the dataset.

The enum below lists all available tree types supported in the framework:

Warning

doxygenenum: Cannot find enum “UltrametricTreeType” in doxygen xml output for project “SHiP” from directory: ../doxygen/xml


[1] Based on the paper:
Anna Beer, Andrew Draganov, Ellen Hohma, Philipp Jahn, Christian M.M. Frey, Ira Assent.
Connecting the Dots – Density-Connectivity Distance unifies DBSCAN, k-Center and Spectral Clustering.
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2023.

[2] Adapted from the original codebase: ICDE21-HST and is based on the paper:
Yuxiang Zeng, Yongxin Tong, Lei Chen.
HST+: An Efficient Index for Embedding Arbitrary Metric Spaces.
IEEE International Conference on Data Engineering (ICDE), 2021.

[3] Implemented in the mlpack 4 C++ machine learning library:
Ryan R. Curtin, Marcus Edel, Omar Shrit, Shubham Agrawal, Suryoday Basak, James J. Balamuta, Ryan Birmingham, Kartik Dutt, Dirk Eddelbuettel, Rishabh Garg, Shikhar Jaiswal, Aakash Kaushik, Sangyeon Kim, Anjishnu Mukherjee, Nanubala Gnana Sai, Nippun Sharma, Yashwant Singh Parihar, Roshan Swain, Conrad Sanderson.
mlpack 4: a fast, header-only C++ machine learning library
Journal of Open Source Software, 2023.

Available Hierarchies

Once a similarity tree is built, it can be transformed into a clustering hierarchy. These hierarchies generalize well-known clustering paradigms such as \(k\)-means and \(k\)-median, enabling fine-grained control over how cluster centroids and merges are computed.

We support all possible \((k,z)\)-hierarchies, allowing flexibility in choosing the most suitable hierarchy for a given dataset.

  • \(z = 0\)\(k\)-center (actually in theory: \(z = ∞\), but in this implementation we use 0 for \(∞\))

  • \(z = 1\)\(k\)-median

  • \(z = 2\)\(k\)-means

Available Partitioning Methods

The SHiP framework provides multiple partitioning methods to extract flat clusterings from hierarchies. Each method corresponds to a specific partitioning objective or heuristic, such as fixed-\(k\), elbow detection, and also function-based selection strategies, as e.g., the HDBSCAN stability function.

The enum below shows all available partitioning strategies:

enum class PartitioningMethod

Available methods for partitioning a hierarchy into flat clusters.

Values:

enumerator K

Partition into a fixed number of clusters \(k\).

**Parameters:**
- `k` (default: `2`): Number of clusters the data should be partitioned into.

enumerator Elbow

Use the Elbow method to determine the optimal number of clusters.

enumerator Threshold

Cut the hierarchy at a fixed similarity threshold.

**Parameters:**
- `k` (default: `2`): Number of clusters the data should be partitioned into.

enumerator ThresholdElbow

Combine threshold cutting and the Elbow method for more adaptive partitioning.

enumerator QCoverage

Use \(q\)-coverage to ensure that a given proportion of mass is retained in the clustering.

**Parameters:**
- `k` (default: `2`): Number of clusters the data should be partitioned into.
- `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.

enumerator QCoverageElbow

Combine \(q\)-coverage and the Elbow method for adaptive and coverage-aware partitioning.

**Parameters:**
- `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.

enumerator QStem

Use the \(q\)-stem criterion to extract partitions based on internal branch structure.

**Parameters:**
- `k` (default: `2`): Number of clusters the data should be partitioned into.
- `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.

enumerator QStemElbow

Combine \(q\)-stem with Elbow-based refinement for improved robustness.

**Parameters:**
- `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.

enumerator LcaNoiseElbow

Elbow method that incorporates noise suppression using least common ancestors (LCA).

enumerator LcaNoiseElbowNoTriangle

Noise-aware Elbow method that excludes triangular merging to improve sharpness.

enumerator MedianOfElbows

Take the median of multiple Elbow criteria to find a more stable clustering.

**Parameters:**
- `elbow_start_z` (default: `1`): Range start to use for elbow method.
- `elbow_end_z` (default: `5`): Range end to use for elbow method.

enumerator MeanOfElbows

Take the mean of multiple Elbow criteria to smooth over noise in the hierarchy.

**Parameters:**
- `elbow_start_z` (default: `1`): Range start to use for elbow method.
- `elbow_end_z` (default: `5`): Range end to use for elbow method.

enumerator Stability

Optimize cluster stability across multiple resolutions to find a robust partitioning, see also HDBSCAN.

**Parameters:**
- `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.

enumerator NormalizedStability

Use normalized cluster stability to allow comparisons across hierarchies of different sizes.

**Parameters:**
- `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.


Customization and Composition

Each of the above components — trees, hierarchies, and partitioning strategies — can be independently selected and composed. This enables flexible experimentation and tailored clustering behavior for a wide range of data types and analysis goals.

Example:

from SHiP import SHiP
from SHiP.ultrametric_tree import UltrametricTreeType
from SHiP.partitioning import PartitioningMethod

ship = SHiP(data=my_data, treeType=UltrametricTreeType.DCTree)
labels = ship.fit_predict(hierarchy=2, partitioningMethod=PartitioningMethod.Elbow)