In this post, you will learn about K-Means clustering concepts with the help of fitting a K-Means model using Python Sklearn KMeans clustering implementation. Before getting into details, let’s briefly understand the concept of clustering.
Clustering represents a set of unsupervised machine learning algorithms belonging to different categories such as prototype-based clustering, hierarchical clustering, density-based clustering etc. K-means is one of the most popular clustering algorithm belong to prototype-based clustering category. The idea is to create K clusters of data where data in each of the K clusters have greater similarity with other data in the same cluster. The different clustering algorithms sets out rules based on how the data needs to be clustered together. Here is a diagram representing creation of clusters using K-means algorithms.
In the above diagram, pay attention to some of the following:
- There are three different clusters represented by green, orange and blue color.
- Each cluster is created around a central point called as cluster centroid or cluster center.
The following topics is covered in this post:
- What is K-Means Clustering?
- K-Means clustering Python example
What is K-Means Clustering?
K-means clustering algorithm partitions data into K clusters (and, hence, K-means name). K-means algorithm belongs to the category, prototype-based clustering. Prototype-based clustering algorithms are based on one of the following:
- Centroid-based clusters: Each cluster built around a point which is termed as the centroid (average) of similar points with continuous features. K-means algorithm results in creation of centroid-based clusters.
- Medoid-based clusters: Each cluster built around a point which is termed as the medoid which represents the point that minimises the distance to all other points that belong to a particular cluster, in the case of categorical features.
Here are some of the points covered in relation to K-means clustering:
- What are key steps of K-means clustering algorithm?
- What is the objective function in K-means which get optimised?
- What are the key features of K-means clustering algorithm?
- How to find the most optimal value of K?
What are key steps of K-Means clustering algorithm?
The following represents the key steps of K-means clustering algorithm:
- Define number of clusters, K, which need to be found out. Randomly select K cluster data points (cluster centers) or cluster centroids. The goal is to optimise the position of the K centroids.
- For each observation, find out the Euclidean distance between the observation and all the K cluster centers. Of all distances, find the nearest distance between the observation and one of the K cluster centroids (cluster centers) and assign the observation to that cluster.
- Move the K-centroids to the center of the points assigned to it.
- Repeat the above two steps until there is no change in the cluster centroids or maximum number of iterations or user-defined tolerance is reached.
What is the objective function in K-means which get optimized?
K-means clustering algorithm is an optimization problem where the goal is to minimise the within-cluster sum of squared errors (SSE). At times, SSE is also termed as cluster inertia. SSE is the sum of the squared differences between each observation and the cluster centroid. At each stage of cluster analysis the total SSE is minimised with SSEtotal = SSE1 + SSE2 + SSE3 + SSE4 …. + SSEn.
The below represents the objective function which needs to be minimized:
What are key features of K-means algorithm?
The following are some of the key features of K-means clustering algorithm:
- One needs to define the number of clusters (K) beforehand. This is unlike other clustering algorithms related to hierarchical clustering or density-based clustering algorithms. The need to define the number of clusters, K, a priori can be considered as a disadvantage because for the real-world applications, it may not always be evident as to how many clusters can the data be partitioned into.
- K-means clusters do not overlap and are not hierarchical.
How to find most optimal value of K?
The technique used to find the most optimal value of K is draw a reduction in variation vs number of clusters (K) plot. Alternatively, one could draw the squared sum of error (SSE) vs number of clusters (K) plot. Here is the diagram representing the plot of SSE vs K (no. of clusters). In the diagram below, the point representing the optimal number of clusters can also be called as elbow point. The elbow point can be seen as the point after which the distortion/cluster inertia/SSE start decreasing in a linear fashion.
K-Means Clustering Python Example
In this section, we will see how to create K-Means clusters using Sklearn IRIS dataset.
import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cluster import KMeans # # Load Sklearn IRIS dataset # iris = datasets.load_iris() X = iris.data y = iris.target # # Do the scatter plot and see that clusters are evident # plt.scatter(X[:,1], X[:,3], color='white', marker='o', edgecolor='red', s=50) plt.grid() plt.tight_layout() plt.show()
Here is how the plot would look like:
Now, lets fit a K-Means cluster model. Pay attention to some of the following in relation to instantiation of K-means:
- Number of clusters defined upfront via n_clusters = 2
- init (default as k-means++): Represents method for initialisation. The default value of k-means++ represents the selection of the initial cluster centers (centroids) in a smart manner (place the initial centroids far away from each other ) to speed up the convergence. The other values of init can be random, which represents the selection of n_clusters observations at random from data for the initial centroids.
- n_init (default as 10): Represents the number of time the k-means algorithm will be run independently, with different random centroids in order to choose the final model as the one with the lowest SSE.
- max_iter (default as 300): Represents the maximum number of iterations for each run. The iteration stops after the maximum number of iterations is reached even if the convergence criterion is not satisfied. This number must be between 1 and 999. In this paper (Scalable K-Means by ranked retrieval), the authors stated that K-means converges after 20-50 iterations in all practical situations, even on high dimensional datasets as they tested.
- tol (default as 1e-04): Tolerance value is used to check if the error is greater than the tolerance value. For error greater than tolerance value, K-means algorithm is run until the error falls below the tolerance value which implies that the algorithm has converged.
# # Create an instance of K-Means # kmc = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300,tol=1e-04, random_state=0) # # Fit and make predictions # y_kmc = kmc.fit_predict(X) # # Create the K-means cluster plot # plt.scatter(X[y_kmc == 0, 1], X[y_kmc == 0, 3], s=50, c='lightgreen', marker='s', edgecolor='black', label='Cluster 1') plt.scatter(X[y_kmc == 1, 1], X[y_kmc == 1, 3], s=50, c='orange', marker='o', edgecolor='black', label='Cluster 2') plt.scatter(X[y_kmc == 2, 1], X[y_kmc == 2, 3], s=50, c='blue', marker='P', edgecolor='black', label='Cluster 3') plt.scatter(kmc.cluster_centers_[:, 1], kmc.cluster_centers_[:, 3], s=250, marker='*', c='red', edgecolor='black', label='Centroids') plt.legend(scatterpoints=1) plt.grid() plt.tight_layout() plt.show()
Here is how the K-means clusters will look like drawn using Matplotlib.pyplot. Pay attention to some of the following in the plot given below:
- There are three clusters represented using green, orange and blue colors.
- Red stars represent Centroids and three clusters created are centered around these star points.
Here is a great tutorial video on K-means clustering by StatQuest youtube channel.
Here is the summary of what you learned about K-Means clustering in general:
- There are different categories of clustering such as prototype-based clustering, hierarchical clustering, density-based clustering
- K-means clustering belongs to prototype-based clustering
- K-means clustering algorithm results in creation of clusters around centroid (average) of similar points with continuous features.
- K-means is part of sklearn.cluster package.
- K-means requires that one defines the number of clusters (K) beforehand.
- K-means clusters do not overlap and are not hierarchical.
- The objective function of the K-means is within-cluster sum of squared errors (SSE). SSE is squared sum of different between each observation and the cluster centroid.
- The optimal number of clusters, K, can be found by drawing sum of squared errors vs number of clusters point.