Elbow Method vs Silhouette Score – Which is Better?

0

In this post, you will learn about two different methods to use for finding optimal number of clusters in K-means clustering. These methods are commonly termed as Elbow method and Silhouette analysis. Selecting optimal number of clusters is key to applying clustering algorithm to the dataset. As a data scientist, knowing these two techniques to find out optimal number of clusters would prove to be very helpful while In this relation, you may want to check out detailed posts on the following:

In this post, we will use YellowBricks machine learning visualization library for creating the plot related to Elbow method and Silhouette score. The following topics get covered in this post:

  • Elbow method plot vs Silhouette analysis plot
  • Which method to use – Elbow method vs Silhouette score

Elbow Method / SSE Plot vs Silhouette Analysis Plot

In this section, you will learn about how to create SSE Plot and Silhouette plot for determining the optimal number of clusters in K-means clustering. Recall that SSE represents within-cluster sum of square error calculated using Euclidean distance.

Here is the Python code using YellowBricks library for Elbow method / SSE Plot created using SKLearn IRIS dataset. In Elbow method where a SSE line plot is drawn, if the line chart looks like an arm, then the “elbow” on the arm is the value of k that is the best.

from sklearn import datasets
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
#
# Load the IRIS dataset
#
iris = datasets.load_iris()
X = iris.data
y = iris.target
#
# Instantiate the clustering model and visualizer
#
km = KMeans(random_state=42)
visualizer = KElbowVisualizer(km, k=(2,10))

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

Here is how the Elbow / SSE Plot would look like. As per the plot given below, for n_clusters = 4 that represents the elbow you start seeing seeing diminishing returns by increasing k. Line starts looking linear.

SSE Plot / Elbow Method for finding optimal number of clusters
Fig 1. SSE Plot / Elbow Method for finding optimal number of clusters

Here is the Python code using YellowBricks library for Silhouette analysis / plots:

from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from yellowbrick.cluster import SilhouetteVisualizer
#
# Load the IRIS dataset
#
iris = datasets.load_iris()
X = iris.data
y = iris.target
 
fig, ax = plt.subplots(3, 2, figsize=(15,8))
for i in [2, 3, 4, 5, 6, 7]:
    '''
    Create KMeans instance for different number of clusters
    '''
    km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42)
    q, mod = divmod(i, 2)
    '''
    Create SilhouetteVisualizer instance with KMeans instance
    Fit the visualizer
    '''
    visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(X)  

Here is how the Silhouette plot would look like for different number of clusters ranging from 2 to 7 clusters.

Silhouette plots for n_clusters = 2 to n_clusters = 7
Fig 2. Silhouette plots for n_clusters = 2 to n_clusters = 7

Which Method to use – Elbow method vs Silhouette Score

Both Elbow method / SSE Plot and Silhouette method can be used interchangeably based on the details presented by the plots. It may be good idea to use both the plots just to make sure that you select most optimal number of clusters.

In Elbow method where a SSE line plot is drawn, if the line chart looks like an arm, then the “elbow” on the arm is the value of k that is the best. It is the point, from where the decrease in SSE starts looking linear.

Silhouette analysis / scores and related Silhouette plots look to have an edge over elbow method as one can evaluate clusters on multiple criteria such as the following and it is highly likely that one can end up determining the most optimal number of clusters in K-means. The Silhouette plots shown below have been created on Sklearn IRIS dataset.

  • Whether all the clusters’ Silhouette plot falls beyond the average Silhouette score. If the silhouette plot for one of the clusters fall below the average Silhouette score, one can reject those numbers of clusters. Thus, the choice of n_clusters = 4 will be sub-optimal.
Fig 3. K-means clusters Silhouette Plot for n_clusters = 4 (Below Avg Score)
  • Whether there is wide fluctuations in the size of the cluster plots. If there are wider fluctuations like the following, the number of cluster is sub-optimal. In the diagram below, you could see wide fluctuations with one cluster below average score, other is very large and yet another ones in between. Thus, the choice of n_clusters = 5 will be sub-optimal.
K-means clusters Silhouette Plot for n_clusters = 5 (Wide fluctuations)
Fig 4. K-means clusters Silhouette Plot for n_clusters = 5 (Wide fluctuations)
  • Whether the thickness of the clusters’ Silhouette plot is uniform. If there are clusters of non-uniform thickness, the number of clusters is sub-optimal. In the diagram below, you will see the two cluster Silhouette plots to have non-uniform thickness, one being very much thicker than another. Thus, the choice of n_clusters = 2 will be sub-optimal.
K-means clusters Silhouette Plot for n_clusters = 2 (Nonuniform Thickness)
Fig 5. K-means clusters Silhouette Plot for n_clusters = 2 (Nonuniform Thickness)

Given above, the Silhouette plot for n_clusters = 3 look to be most appropriate than others as it stands good against all the three measuring criteria (scores below average Silhouette score, Wide fluctuations in the size of the plot and non-uniform thickness). Here is how the plot look like:

K-means clusters Silhouette Plot for n_clusters = 3 (Optimal)
Fig 6. K-means clusters Silhouette Plot for n_clusters = 3 (Optimal)

Conclusions

Here is the summary of what you learned in relation to which method out of Elbow method and Silhouette score to use for finding optimal number of clusters in K-means clustering:

  • When using elbow method, look for the point from where the SSE plot starts looking linear. In other words, the decrease in SSE is not much after that point.
  • When using Silhouette plot, look for the number of clusters where all clusters’ plot is beyond average Silhouette score, with mostly uniform thickness and do not have wide fluctuations in the size.
  • You could as well use both the SSE / Elbow plot and Silhouette plot just to make sure you select the most optimal number of clusters in K-means clustering.
Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.