Wiki
Available Trees
The SHiP framework supports a variety of ultrametric similarity tree types, which form the basis for hierarchy construction. Each tree type encapsulates a specific strategy for representing pairwise similarity across the dataset.
The enum below lists all available tree types supported in the framework:
Warning
doxygenenum: Cannot find enum “UltrametricTreeType” in doxygen xml output for project “SHiP” from directory: ../doxygen/xml
Available Hierarchies
Once a similarity tree is built, it can be transformed into a clustering hierarchy. These hierarchies generalize well-known clustering paradigms such as \(k\)-means and \(k\)-median, enabling fine-grained control over how cluster centroids and merges are computed.
We support all possible \((k,z)\)-hierarchies, allowing flexibility in choosing the most suitable hierarchy for a given dataset.
\(z = 0\) → \(k\)-center (actually in theory: \(z = ∞\), but in this implementation we use 0 for \(∞\))
\(z = 1\) → \(k\)-median
\(z = 2\) → \(k\)-means
…
Available Partitioning Methods
The SHiP framework provides multiple partitioning methods to extract flat clusterings from hierarchies. Each method corresponds to a specific partitioning objective or heuristic, such as fixed-\(k\), elbow detection, and also function-based selection strategies, as e.g., the HDBSCAN stability function.
The enum below shows all available partitioning strategies:
-
enum class PartitioningMethod
Available methods for partitioning a hierarchy into flat clusters.
Values:
-
enumerator K
Partition into a fixed number of clusters \(k\).
**Parameters:** - `k` (default: `2`): Number of clusters the data should be partitioned into.
-
enumerator Elbow
Use the Elbow method to determine the optimal number of clusters.
-
enumerator Threshold
Cut the hierarchy at a fixed similarity threshold.
**Parameters:** - `k` (default: `2`): Number of clusters the data should be partitioned into.
-
enumerator ThresholdElbow
Combine threshold cutting and the Elbow method for more adaptive partitioning.
-
enumerator QCoverage
Use \(q\)-coverage to ensure that a given proportion of mass is retained in the clustering.
**Parameters:** - `k` (default: `2`): Number of clusters the data should be partitioned into. - `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.
-
enumerator QCoverageElbow
Combine \(q\)-coverage and the Elbow method for adaptive and coverage-aware partitioning.
**Parameters:** - `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.
-
enumerator QStem
Use the \(q\)-stem criterion to extract partitions based on internal branch structure.
**Parameters:** - `k` (default: `2`): Number of clusters the data should be partitioned into. - `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.
-
enumerator QStemElbow
Combine \(q\)-stem with Elbow-based refinement for improved robustness.
**Parameters:** - `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.
-
enumerator LcaNoiseElbow
Elbow method that incorporates noise suppression using least common ancestors (LCA).
-
enumerator LcaNoiseElbowNoTriangle
Noise-aware Elbow method that excludes triangular merging to improve sharpness.
-
enumerator MedianOfElbows
Take the median of multiple Elbow criteria to find a more stable clustering.
**Parameters:** - `elbow_start_z` (default: `1`): Range start to use for elbow method. - `elbow_end_z` (default: `5`): Range end to use for elbow method.
-
enumerator MeanOfElbows
Take the mean of multiple Elbow criteria to smooth over noise in the hierarchy.
**Parameters:** - `elbow_start_z` (default: `1`): Range start to use for elbow method. - `elbow_end_z` (default: `5`): Range end to use for elbow method.
-
enumerator Stability
Optimize cluster stability across multiple resolutions to find a robust partitioning, see also HDBSCAN.
**Parameters:** - `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.
-
enumerator NormalizedStability
Use normalized cluster stability to allow comparisons across hierarchies of different sizes.
**Parameters:** - `min_cluster_size` (default: `5`): Minimum number of points which a cluster should contain.
-
enumerator K
Customization and Composition
Each of the above components — trees, hierarchies, and partitioning strategies — can be independently selected and composed. This enables flexible experimentation and tailored clustering behavior for a wide range of data types and analysis goals.
Example:
from SHiP import SHiP
from SHiP.ultrametric_tree import UltrametricTreeType
from SHiP.partitioning import PartitioningMethod
ship = SHiP(data=my_data, treeType=UltrametricTreeType.DCTree)
labels = ship.fit_predict(hierarchy=2, partitioningMethod=PartitioningMethod.Elbow)