Categories: Big Data

Data Science – 3 Key Aspects of Applying KMeans Algorithm for Clustering Tasks

This article represents key concepts around KMeans algorithm including key aspects and formula/R command when you are working on clustering tasks. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

Following are the key points described later in this article:

  • Key aspects of applying KMeans algorithm
  • KMeans Algorithm – R Command

 

Key aspects of applying KMeans Algorithm

Key aspects of applying KMeans algorithm are following:

  • Selecting a right combination of features set: On the data set on which you may observe some of the following:
    • There are one or more features having non-numeric or character data sets. As KMeans requires a data frame containing only numeric data, the challenge is to define a numeric representation of character related features or exclude character-related features from analysis.
    • There are one or more features having missing data. There are different techniques one use to take care of missing data. In case of numeric feature set, one may use technique such as finding means (sometimes using aggregate function) and assigning the missing values with mean. In case, the feature has nominal data, one could use dummy coding technique to come up with new set of variables.

    Thus, it may so happen that you may use only a select set fo features from a given data set and do the analysis on those features set.

  • Other key aspect which is common across different algorithm is to apply normalization to data set. One could use either min-max normalization or z-score standardization technique to achieve data normalization. Following is example representing scaling of data in the data frame, someDF. Pay attention to scale() function.
    someDF_z <- as.data.frame(lapply(someDF, scale))  
    
  • Selecting an optimum number of clusters: Selecting an optimum of clusters (represents K) is the key. There are different perspectives around this and I shall talk about it in another blog.

 

KMeans Algorithm – R Commands

There is a kmeans() function in stats package in R. Note that stats package is included by default in R installation. If it is not there, you may want to install this package.
Following is the formulae:

# KMeans function applied on some data frame, someDF, where only a 
# set of features having numeric values were selected; Notice 4:9
kmeansDF <- kmeans(someDF[4:9], 5) 

 

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

Feature Engineering in Machine Learning: Python Examples

Last updated: 3rd May, 2024 Have you ever wondered why some machine learning models perform…

3 days ago

Feature Selection vs Feature Extraction: Machine Learning

Last updated: 2nd May, 2024 The success of machine learning models often depends on the…

4 days ago

Model Selection by Evaluating Bias & Variance: Example

When working on a machine learning project, one of the key challenges faced by data…

4 days ago

Bias-Variance Trade-off in Machine Learning: Examples

Last updated: 1st May, 2024 The bias-variance trade-off is a fundamental concept in machine learning…

5 days ago

Mean Squared Error vs Cross Entropy Loss Function

Last updated: 1st May, 2024 As a data scientist, understanding the nuances of various cost…

5 days ago

Cross Entropy Loss Explained with Python Examples

Last updated: 1st May, 2024 In this post, you will learn the concepts related to…

5 days ago