If you’re working with data analytics projects including building machine learning (ML) models, you’ve probably heard of the K-nearest neighbors (KNN) algorithm. But what is it, exactly? And more importantly, how can you use it in your own AI / ML projects? In this post, we’ll take a closer look at the KNN algorithm and walk through a simple Python example. You will learn about the K-nearest neighbors algorithm with Python Sklearn examples. K-nearest neighbors algorithm is used for solving both classification and regression machine learning problems. Stay tuned!
K-nearest neighbors is a supervised machine learning algorithm for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-nearest neighbors are used for classification or regression. The main idea behind K-NN is to find the K nearest data points, or neighbors, to a given data point and then predict the label or value of the given data point based on the labels or values of its K nearest neighbors. K can be any positive integer, but in practice, K is often small, such as 3 or 5. The “K” in K-nearest neighbors refers to the number of items that the algorithm uses to make its prediction whether its a classification problem or a regression problem. The following diagram represents how K nearest neighbors are used for making predictions.
The following are key aspects of K-nearest neighbor’s algorithms.
K-nearest neighbor is a non-parametric method, which means that it does not make any assumptions about the underlying data. This is advantageous over parametric methods, which do make such assumptions. The models don’t learn parameters from the training data set to come up with a discriminative function in order to classify the test or unseen data set. Rather model memorizes the training data set. This is why the K-NN classifier is also called a lazy learner. Here is what is done as part of the algorithm to classify the data:
In the diagram below, if K = 3, the class of the data point (green) is assigned as the orange triangle (2 votes to 1 in favor of the orange triangle). If K = 5, the class of the data point gets assigned as blue square (3 votes to 2 votes in favor of blue square).
The advantages of using K-NN algorithm to train the models are some of the following:
The disadvantages of K-NN algorithm are some of the following:
It is of utmost importance to choose the appropriate value of K in order to avoid issues related to overfitting or underfitting. Note that for a larger value of K, the model may underfit and for a smaller value of K, the model may overfit. Thus, it is of utmost importance to choose the most appropriate value of K. You may want to check out this post to have a good understanding of underfitting and overfitting concepts. One can draw a validation curve to assess the same. Here are some strategies which can be used for selecting the most appropriate value of K.
Here is the Python Sklearn code for training the model using K-nearest neighbors. Two different versions of the code are presented. One is the very simplistic way. Another is using pipeline and grid search.
Here is the simplistic code for fitting the K-NN model using the Sklearn IRIS dataset. Pay attention to some of the following in the code given below:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
#
# Load the Sklearn IRIS dataset
#
iris = datasets.load_iris()
X = iris.data
y = iris.target
#
# Create train and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Feature Scaling using StandardScaler
#
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Fit the model
#
knn = KNeighborsClassifier(n_neighbors=5, p=2, weights='uniform', algorithm='auto')
knn.fit(X_train_std, y_train)
#
# Evaluate the training and test score
#
print('Training accuracy score: %.3f' % knn.score(X_train_std, y_train))
print('Test accuracy score: %.3f' % knn.score(X_test_std, y_test))
The model score for the training data set comes out to be 0.981 and the test data set is 0.911. Let’s take a look at the usage of pipeline and gridsearchcv for training / fitting the K-NN model
Here is the code for fitting the model using Sklearn K-nearest neighbors implementation. Pay attention to some of the following:
#
# Load the Sklearn IRIS dataset
#
iris = datasets.load_iris()
X = iris.data
y = iris.target
#
# Create train and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Create a pipeline
#
pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier())
#
# Create the parameter grid
#
param_grid = [{
'kneighborsclassifier__n_neighbors': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'kneighborsclassifier__p': [1, 2],
'kneighborsclassifier__weights': ['uniform', 'distance'],
'kneighborsclassifier__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}]
#
# Create a grid search instance
#
gs = GridSearchCV(pipeline, param_grid = param_grid,
scoring='accuracy',
refit=True,
cv=10,
verbose=1,
n_jobs=2)
#
# Fit the most optimal model
#
gs.fit(X_train, y_train)
#
# Print the best model parameters and scores
#
print('Best Score: %.3f' % gs.best_score_, '\nBest Parameters: ', gs.best_params_)
#
# Print the model score for test data
#
print('Score: %.3f' % gs.score(X_test, y_test))
The score of the test data set comes out to be 0.911 and the training data comes out to be 0.972. Here is what the output looks like after executing the above code.
Statquest: K-nearest neighbors, clearly explained
How K-NN algorithm works?
Here is the summary of what you learned in relation to the K-nearest Neighbors Classifier:
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…