# K-Nearest Neighbors Explained with Python Examples In this post, you will learn about K-nearest neighbors algorithm with Python Sklearn examples. K-nearest neighbors algorithm is used for solving both classification and regression machine learning problems. The following topics will get covered in this post:

• Introduction to K-nearest neighbors
• What is the most appropriate value of K?
• K-NN Python example

## Introduction to K-nearest neighbors

K-nearest neighbors is a supervised learning algorithm which can be used to solve both classification and regression problems. It belongs to the class of non-parametric models. The models don’t learn parameters from training data set to come up with a discriminative function in order to classify the test or unseen data set. Rather model memorizes the training data set. This is why K-NN classifier is also called as lazy learner. Here is what is done as part of algorithm to classify the data:

• First and foremost, the value of K and distance metric is selected. The value of K represents the number of nearest data points that would be selected for making the prediction. Distance metric would be used to measure the similarity of the data points. The distance metric typically used is Euclidean distance. Read about some of the most popular distance metric measures in this post – Different types of distance measures in machine learning. In SKlearn KNeighborsClassifier, distance metric is specified using the parameter metric. The default value of metric is minkowski. Another parameter is p. With value of metric as minkowski, the value of p = 1 means Manhattan distance and the value of p = 2 means Euclidean distance.
• As a next step, the k-nearest neighbors of the data record, that needs to be classified, is found.
• Finally, the class label of data is assigned based on the majority vote.

In the diagram below, if K = 3, the class of the data point (green) is assigned as orange triangle (2 votes to 1 in favour of orange triangle). If K = 5, the class of the data point gets assigned as blue square (3 votes to 2 votes in favour of blue square). Fig 1. K-nearest neighbor algorithm with K = 3 and K = 5

The main advantage of K-NN classifier is that the classifier immediately adapts based on the new training data set. However, the main drawback is that the computational complexity for classifying new unseen data grows linearly with the increasing training dataset.

### What is the most appropriate value of K in K-NN?

It is of utmost important to choose appropriate value of K in order to avoid issues related to overfitting or underfitting. Note that for larger value of K, the model may underfit and for smaller value of K, the model may overfit. Thus, it is of utmost importance to choose the most appropriate value of K. You may want to check out this post to have a good understanding of underfitting and overfitting concepts. One can draw validation curve to assess the same. Here are some strategies which can be used for selecting the most appropriate value of K.

• Using coprime class and K combinations: Coprime numbers are the numbers which don’t share common divisors other than 1. For example, 8 and 15 or 3 and 4. Thus, if there are 3 classes, then the value or K can be 2 or 4 or 5 or 7 or 8 etc.
• Choosing the value of K which is greater or equal to the number of classes plus 1
• K = No of Classes + 1
• Choosing the value of K which is low enough to avoid noise; When using larger value of K can result in high model bias. At the same time, keep in mind that choosing very less value of K can result in high variance or overfitting.

## K-NN Python Sklearn Example

Here is the Python Sklearn code for training the model using K-nearest neighbors. Two different version of code is presented. One is very simplistic way. Another is using pipeline and gridsearch.

### Simplistic Python Code for Fitting K-NN Model

Here is the simplistic code for fitting K-NN model using Sklearn IRIS dataset. Pay attention to some of the following in the code given below:

• Sklearn.model_selection train_test_split is used for creating training and test split data
• Sklearn.preprocessing StandardScaler (fit & trainsform method) is used for feature scaling.
• Sklearn.neighbors KNeighborsClassifier is used as implementation for K-nearest neighbors algorithm for fitting the model.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
#
# Load the Sklearn IRIS dataset
#
iris = datasets.load_iris()
X = iris.data
y = iris.target
#
# Create train and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Feature Scaling using StandardScaler
#
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Fit the model
#
knn = KNeighborsClassifier(n_neighbors=5, p=2, weights='uniform', algorithm='auto')
knn.fit(X_train_std, y_train)
#
# Evaluate the training and test score
#
print('Training accuracy score: %.3f' % knn.score(X_train_std, y_train))
print('Test accuracy score: %.3f' % knn.score(X_test_std, y_test))


The model score for training data set comes out to be 0.981 and test data set is 0.911. Lets take a look at the usage of pipeline and gridsearchcv for training / fitting the K-NN model

### Pipeline and GridSearchCV for fitting K-NN model

Here is the code for fitting the model using Sklearn K-nearest neighbors implementation. Pay attention to some of the following:

• Sklearn.pipeline make_pipeline is used for create a pipeline having steps including StandardScaler and KNeighborsClassifier
• GridSearchCV is used for tuning model parameters and selecting the best estimator with most optimal score
#
# Load the Sklearn IRIS dataset
#
iris = datasets.load_iris()
X = iris.data
y = iris.target
#
# Create train and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Create a pipeline
#
pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier())
#
# Create the parameter grid
#
param_grid = [{
'kneighborsclassifier__n_neighbors': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'kneighborsclassifier__p': [1, 2],
'kneighborsclassifier__weights': ['uniform', 'distance'],
'kneighborsclassifier__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}]
#
# Create a grid search instance
#
gs = GridSearchCV(pipeline, param_grid = param_grid,
scoring='accuracy',
refit=True,
cv=10,
verbose=1,
n_jobs=2)
#
# Fit the most optimal model
#
gs.fit(X_train, y_train)
#
# Print the best model parameters and scores
#
print('Best Score: %.3f' % gs.best_score_, '\nBest Parameters: ', gs.best_params_)
#
# Print the model score for test data
#
print('Score: %.3f' % gs.score(X_test, y_test))


The score of test data set comes out to be 0.911 and training data comes out to be 0.972. Here is how the output looks like after executing the above code. Fig 2. K-NeighborsClassifier fit using Pipeline and GridSearchCV

## Some Good Youtube Tutorials on K-Nearest Neighbors

Statquest: K-nearest neighbors, clearly explained

How K-NN algorithm works?

## Conclusions

Here is the summary of what you learned in relation to K-nearest Neighbors Classifier:

• K-NN algorithm belongs to non-parametric class of machine learning algorithms.
• In K-NN algorithm, K represents the number of data points which need to be chosen for making the predictions.
• Out of K data points selected for prediction, the voting takes place in terms of how many of them belongs to which class. Based on that, the class of the data gets decided.
• It is recommended to select odd value of K for binary classification
• It is recommended that the value of K should not be a multiple of number of classes.
• One can opt for either Euclidean or Manhattan distance for measuring the similarity between the data points. For Sklearn KNeighborsClassifier, with metric as minkowski, the value of p = 1 means Manhattan distance and the value of p = 2 means Euclidean distance. The default is Euclidean distance with metric = ‘minkowski’ and p = 2.
• For real world examples, often Euclidean distance is used. However, it is recommended to standardize the data set before hand.
Follow me ## Ajitesh Kumar

I have been recently working in the area of Data Science and Machine Learning / Deep Learning. In addition, I am also passionate about various different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia etc and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc. I would love to connect with you on Linkedin.
Posted in Data Science, Machine Learning, Python. Tagged with , , , .