Clustering metrics

Clustering metrics#

Clustering is a fundamental technique in unsupervised machine learning that involves grouping similar data points together to uncover underlying patterns or structures in a dataset. To evaluate the performance and quality of clustering algorithms, various metrics are employed to assess how well the algorithm has grouped the data. Among the key clustering metrics, Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and Adjusted Rand Index (ARI) stand out as widely used measures. Each of these metrics offers a unique perspective on the effectiveness of the clustering algorithms

Notably, the information on these metrics and their applications was particularly helpful, and a great book on machine learning provided valuable input.

Silhouette Score#

The Silhouette Score is a metric used to calculate the goodness of a clustering technique for a given dataset. It measures how well-defined the clusters in the data are. The score is based on both the average distance between data points within the same cluster (cohesion) and the average distance between different clusters (separation).

The Silhouette Score for the i-th data point, denoted as s(i), is given by:

\[ S(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\]

a(i): The average distance from the ith data point to the other data points in the same cluster. It represents cohesion.
b(i): The average distance from the ith data point to the data points in the nearest cluster (i.e., the cluster that the data point is not a part of). It represents separation.

The overall Silhouette Score for the entire dataset is usually computed as the average of the S(i) values for all data points

\[\text{Silhouette Score} = \frac{1}{n} \sum_{i=1}^{n}S(i)\]

n: The number of data points in the dataset.
S(i): The Silhouette Score for the ith data point.

Range of Values:

-1: indicates that the data point has been assigned to the wrong cluster. It implies that the data point is better matched to the neighboring clusters than to its own cluster. This scenario suggests a poor clustering result with substantial overlap between clusters.
0: indicates that the data point is on the boundary between two clusters. It means that the average distance to points in its own cluster is comparable to the average distance to points in the nearest cluster. This situation suggests overlapping or unclear cluster boundaries
1: indicates that the data point is we-l matched to its own cluster and poorly matched to neighboring clusters. This is an ideal scenario, suggesting distinct, well-defined clusters with clear boundaries.

Calculating Silhouette Score in Python#

The Silhouette Score in Python can be computed using the scikit-learn library

The code below generates a synthetic dataset with 300 data points distributed among three clusters using the make_blobs function. It then applies the KMeans clustering algorithm with three clusters (k=3) to the data. The script visualizes the original data and the clustering result with centroids using Matplotlib. Finally, it calculates and prints the silhouette score, a metric that quantifies the quality of the clustering, with a higher score indicating better-defined clusters. In this specific example, the silhouette score for three clusters is approximately 0.85, suggesting that the data is well-clustered.

Show code cell source Hide code cell source

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_samples, silhouette_score
from jupyterquiz import display_quiz

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(
    n_samples=500,
    n_features=2,
    centers=4,
    cluster_std=1,
    center_box=(-10.0, 10.0),
    shuffle=True,
    random_state=1,
)  # For reproducibility

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, n_init="auto", random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print(
        "For n_clusters =",
        n_clusters,
        "The average silhouette_score is :",
        silhouette_avg,
    )

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_values,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(
        X[:, 0], X[:, 1], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
    )

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(
        centers[:, 0],
        centers[:, 1],
        marker="o",
        c="white",
        alpha=1,
        s=200,
        edgecolor="k",
    )

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(
        "Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
        % n_clusters,
        fontsize=14,
        fontweight="bold",
    )

plt.show()

For n_clusters = 2 The average silhouette_score is : 0.7049787496083262

For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
For n_clusters = 5 The average silhouette_score is : 0.561464362648773
For n_clusters = 6 The average silhouette_score is : 0.4857596147013469

../_images/073718e4ec7de142db631155ddd9d61337f2328593e8b18e8c2172d27187a240.png

../_images/829a66ed0b3016945c00f77a53ac0e23051e7af05f5adea75f3a8c743d93218d.png

../_images/2399e461c5e5a4e31578948f3a222ecb56ea607a2922bf069d96020600dd33a0.png

../_images/1d706530c4e88eb6864417e657dd06f32473b25f32e560023bf2a6b580ef39cf.png

../_images/60ba092751bf6ee873b1b624994723caa964b9872274fbf9c1daacf02dd4c8cc.png

Exercise

You are provided with a simple dataset consisting of two data points and their cluster assignments. Your task is to calculate the Silhouette Score for each data point. The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters.

Data

Data point 1: [1, 2] (Cluster 1)
Data point 2: [4, 5] (Cluster 2)

Cluster Assignment

Data point 1 is assigned to Cluster 1
Data point 2 is assigned to Cluster 2

Note

You can check your answer or see the explanation below

Explanation!

Step 1: Calculate average distance (a) for each data point

$ a_1 = \frac{d([1, 2], [4, 5])}{1} $
$ a_1 = \sqrt{(1-4)^2 + (2-5)^2} = \sqrt{18} $
$ a_1 = 3\sqrt{2} $
$ a_2 = \frac{d([4, 5], [1, 2])}{1} $
$ a_2 = \sqrt{(4-1)^2 + (5-2)^2} = \sqrt{18} $
$ a_2 = 3\sqrt{2} $

Step 2: Calculate average distance (b) for each data point

$ b_1 = \min (\frac{d([1, 2], [4, 5])}{1}) $
$ b_1 = \sqrt{(1-4)^2 + (2-5)^2} = \sqrt{18} $
$ b_1 = 3\sqrt{2} $
$ b_2 = \min (\frac{d([4, 5], [1, 2])}{1}) $
$ b_2 = \sqrt{(4-1)^2 + (5-2)^2} = \sqrt{18}$
$ b_2 = 3\sqrt{2} $

Step 3: Calculate the Silhouette Score for each data point

$ s_1 = \frac{b_1 - a_1}{\max(a_1, b_1)}$
$ s_1 = \frac{3\sqrt{2} - 3\sqrt{2}}{\max(3\sqrt{2}, 3\sqrt{2})}$
$ s_1 = \frac{0}{3\sqrt{2}}$
$ s_1 = 0$

$ s_2 = \frac{b_2 - a_2}{\max(a_2, b_2)}$
$ s_2 = \frac{3\sqrt{2} - 3\sqrt{2}}{\max(3\sqrt{2}, 3\sqrt{2})}$
$ s_2 = \frac{0}{3\sqrt{2}}$
$ s_2 = 0$

Davies-Bouldin Index#

The Davis-Buldin index is a validation metric that is used to evaluate clustering models. It is determined by averaging each cluster’s resemblance to the cluster that is closest to it. Here, the ratio of the intracluster to the intercluster distance is used to quantify similarity.

The following formula can be used to determine the Davis-Boldin index for a given data set:

\[ DB = \frac{1}{n} \sum_{i=1}^{n} \max_{j\neq i} \left( \frac{S_i + S_j}{M_{ij}} \right) \]

Where:

$n$ is the total number of clusters.
$S_i$ is the average similarity of cluster $S_i$ (compactness).
$M_{ij}$ is the similarity between clusters $S_i$ and $S_j$ (separation).

Compactness ($S_i$):
- $S_i$ represents the average similarity within cluster $i$, or the compactness of the cluster.
- The similarity within a cluster is typically calculated using the average distance or dissimilarity between pairs of points within that cluster.
- The specific measure of similarity (or dissimilarity) depends on the distance metric chosen for the clustering algorithm. For example, if Euclidean distance is used, $S_i$ might be the average Euclidean distance between all pairs of points in cluster $i$.
Separation ($M_{ij}$):
- $M_{ij}$ represents the similarity between clusters $i$ and $j$, or the separation between the clusters.
- Similar to compactness, the measure of similarity between clusters depends on the chosen distance metric. It is often the distance or dissimilarity between the centroids (center points) of the clusters.
- The smaller $M_{ij}$, the better, as it indicates that the clusters are well separated.

Lower vs. Higher DB Index Values#

DBI is a relative clustering validity index, meaning that it compares the clustering results to a hypothetical “ideal” clustering. A lower DBI value indicates a better clustering solution.

Important

Higher DB index values correspond to poorer clustering solutions. This is because a higher DBI value indicates that the clusters are not well-separated and/or that the clusters are not compact.
However, a lower DB index value is desirable. It indicates that the clusters are well-separated and compact, which is often a good indication of a successful clustering solution.

Syntax#

The davies_bouldin_score can be accessed through the scikit-learn library’s sklearn.metrics module. The recommended syntax for utilization is as follows:

sklearn.metrics.davies_bouldin_score(X, labels)

Implementation#

Below is the Python implementation of DB index using the sklearn library

The code conducts Agglomerative Hierarchical Clustering (AHC) on the Wine dataset and computes the Davies-Bouldin Index (DBI) to assess the quality of the clustering outcomes. Additionally, it generates a dendrogram to visually represent the hierarchical organization of the data.

Davies-Bouldin Index: 0.5357343073560251

../_images/b82cf20b1fe2685b3bb14ecfa47c57edb09ccc3eeaef8bdf122311831d8222e9.png

Implementation 2#

The code performs Gaussian Mixture Model (GMM) clustering on a synthetic dataset generated using scikit-learn’s make_blobs function. It also calculates the Davies-Bouldin Index (DBI) to evaluate the clustering results and plots a scatter plot to visualize the clustering results.

Davies-Bouldin Index: 0.21231599538998416

../_images/194d6b0e75ec5c181f1bdd0b4856da1da7e4b0c5df06905acbfea7669fd089ce.png

Question#

Test your understanding of clustering evaluation metrics by identifying the key parameter assessed by the Davies-Bouldin Index for gauging the quality of clustering models.

Adjusted Rand Index (ARI)#

The Adjusted Rand Index (ARI) is a measure used in clustering and classification tasks to assess the similarity between two data partitions, taking into account chance agreement. It is an adjustment of the Rand Index and provides a normalized score that considers random clustering.

The formula for Adjusted Rand Index is expressed as follows:

\[ ARI = \frac{\text{RI} - \text{Expected}}{\text{Max} - \text{Expected}} \]

Explanation!

\[\begin{split} \begin{aligned} \text{RI} & = \text{Rand Index} \\ \text{Expected} & = \text{Expected Rand Index under independence} \\ \text{Max} & = \text{Maximum possible Rand Index} \end{aligned} \end{split}\]

The Rand Index is calculated using the formula:

\[ RI = \frac{a + b}{a + b + c + d} \]

Explanation!

\[\begin{split} \begin{aligned} a & = \text{Number of pairs of elements that are in the same cluster in both partitions} \\ b & = \text{Number of pairs of elements that are in different clusters in both partitions} \\ c & = \text{Number of pairs of elements that are in the same cluster in the first partition and in different clusters in the second partition} \\ d & = \text{Number of pairs of elements that are in different clusters in the first partition and in the same cluster in the second partition} \end{aligned} \end{split}\]

Example: Digits Dataset#

Dataset looks like:

import matplotlib.pyplot as plt
from sklearn import datasets
# Load the digits dataset
digits = datasets.load_digits()

# Display a few images and their labels
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i in range(10):  # Corrected loop range
    axes[i // 5, i % 5].imshow(digits.images[i], cmap=plt.cm.gray_r, interpolation='nearest')
    axes[i // 5, i % 5].set_title(f"Label: {digits.target[i]}")
    axes[i // 5, i % 5].axis('off')

plt.show()

../_images/47a3123a11f868c0dc65f9c6b5e8fcc2e606521ff9d40ad9456c1fb07030793b.png

Assume we are using KMeans clustering algorithm with 10 clusters (digits 0-9):

kmeans = KMeans(n_clusters=10, random_state=42)
y_pred = kmeans.fit_predict(X)

Calculate Adjusted Rand Index:

ari_score = adjusted_rand_score(y_true, y_pred)

Adjusted Rand Index (ARI): 0.6669121092859385

Loading BokehJS ...

Calinski-Harabasz index#

The Calinski-Harabasz index (also known as the Variance Ratio Criterion) is calculated as a ratio of the sum of inter-cluster dispersion and the sum of intra-cluster dispersion for all clusters (where the dispersion is the sum of squared distances).

The Calinski-Harabasz index (CH) is a way to measure how well a K-Means clustering algorithm divides data into clusters. It helps assess the effectiveness of the algorithm in creating meaningful groups when a specific number of clusters is used.

How do you interpret Calinski-Harabasz index?#

A high Calinski-Harabasz index (CH) indicates improved clustering because it signifies that data points within each cluster are closely packed (denser), while the clusters themselves are well-separated from one another.

In the upcoming section, we’ll delve into a detailed explanation of how to calculate the CH, complete with a few illustrative examples.

To proceed with this tutorial, make sure you have the following Python libraries installed: scikit-learn and matplotlib.

####Calinski-Harabasz Index Explained

In this section, we will break down each calculation step and offer insightful examples to enhance your comprehension of the formulas.

Initially, we calculate the inter-cluster dispersion, also known as the between-group sum of squares (BGSS).

In the CH context, inter-cluster dispersion gauges the weighted sum of squared distances between the centroids of clusters and the centroid of the entire dataset (barycenter).

The calculation for the between-group sum of squares is as follows:

\[\text{BGSS} = \sum_{k=1}^{K} n_k \cdot \|\mathbf{C}_k - \mathbf{C}\|^2\]

Here are the key terms used in the formula:

n_k: the number of observations in cluster k.
C_k: the centroid of cluster k.
C: the centroid of the dataset (barycenter).
K: the number of clusters.

Next, the second step involves calculating the intra-cluster dispersion, or the within-group sum of squares (WGSS).

In the CH context, intra-cluster dispersion assesses the sum of squared distances between each observation and the centroid of its corresponding cluster.

For each cluster k we will compute the WGSS_k as:

\[\text{WGSS}_k = \sum_{k=1}^{{n_k}} \|\mathbf{X}_ik - \mathbf{C}_k\|^2\]

Here are the key terms used in the formula:

n_k: the number of observations in cluster k.
X_ik: the i-th observation of cluster k.
C_k: the centroid of cluster k.

And then sum all individual within group sums of squares: $$\text{WGSS} = \sum_{k=1}^{K} \mathbf{WGSS}_k$$

Here are the key terms used in the formula:

WGSS_k: the within group sum of squares of cluster k.
K: the number of clusters.

Calculate Calinski-Harabasz Index#

The Calinski-Harabasz index is determined by summing the inter-cluster dispersion and the intra-cluster dispersion across all clusters.

The calculation for the Calinski-Harabasz index is as follows:

\[\text{WGSS} = \frac{ \frac{BGSS}{K-1} }{ \frac{WGSS}{N-K} } = \frac{BGSS}{WGSS} \times \frac{N-K}{K-1}\]

Here are the key terms used in the formula:

BGSS: between-group sum of squares (between-group dispersion).
WGSS: within-group sum of squares (within-group dispersion).
N: total number of observations.
K: total number of clusters.

From the above formula, we can conclude that the large values of Calinski-Harabasz index represent better clustering.

Calinski-Harabasz Index Example in Python#

In this part, we’ll walk through an example of computing the Calinski-Harabasz index for a K-Means clustering algorithm in Python.

To begin, import the necessary dependencies:

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
import matplotlib.pyplot as plt

Feel free to utilize any dataset with the provided code. For simplicity, we’ll use the built-in Iris dataset, focusing specifically on the first two features: “sepal width” and “sepal length”:

iris = load_iris()
X = iris.data[:, :2]

We’ll begin by setting the target for K-Means to have 3 clusters:

kmeans = KMeans(n_clusters=3, random_state=30)
labels = kmeans.fit_predict(X)

And check the Calinski-Harabasz index for the above results:

ch_index = calinski_harabasz_score(X, labels)

print(ch_index)

185.33266845949427

You should get the resulting score: 185.33266845949427 or approximately (185.33).

To put in perspective how the clusters look , let’s visualize them:

unique_labels = list(set(labels))
colors = ['red', 'orange', 'grey']

for i in unique_labels:
    filtered_label = X[labels == i]
    plt.scatter(filtered_label[:,0],
                filtered_label[:,1],
                color = colors[i],
                edgecolor='k')

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

../_images/7a31bf3c6b18c52418feea0caf91236551221793eaa4beed7df9a2df1d6c9c0b.png

We should see the following original 3 clusters (above)

Given that we computed the CH index for 3 clusters and the original data has 3 labels, we expect the CH index to be highest for 3 clusters compared to other cluster numbers.

Now, let’s compute the CH index for a range of cluster numbers and identify the highest values:

results = {}

for i in range(2,11):
    kmeans = KMeans(n_clusters=i, random_state=30)
    labels = kmeans.fit_predict(X)
    db_index = calinski_harabasz_score(X, labels)
    results.update({i: db_index})

and visualize it:

plt.plot(list(results.keys()), list(results.values()))
plt.xlabel("Number of clusters")
plt.ylabel("Calinski-Harabasz Index")
plt.show()

../_images/65411cb7b4332fb231333736daa836f1e6b918ab296498c9ce498f9ad5e6ae66.png

An interesting observation emerges as 5 clusters and 10 clusters yield a higher Calinski-Harabasz index value compared to 3 clusters, despite the actual number of labels in the data being 3.

It’s worth noting that while we can obtain higher CH index values for cluster numbers other than 3, the index values remain within a very close range, roughly between 175 and 200.

Clustering metrics

Contents

Clustering metrics#

Silhouette Score#

Calculating Silhouette Score in Python#

Davies-Bouldin Index#

Lower vs. Higher DB Index Values#

Syntax#

Implementation#

Implementation 2#

Question#

Adjusted Rand Index (ARI)#

Example: Digits Dataset#

Calinski-Harabasz index#

How do you interpret Calinski-Harabasz index?#

Calculate Calinski-Harabasz Index#

Calinski-Harabasz Index Example in Python#