In this post, you will learn about another machine learning model hyperparameter optimization technique called as Grid Search with the help of Python Sklearn code examples. In one of the earlier posts, you learned about another hyperparamater optimization technique namely validation curve. As a data scientist, it will be useful to learn some of these model tuning techniques (tuning hyperparameters) as it would help us select most appropriate models with most appropriate parameters.
The following are some of the topics covered in this post:
Grid Search technique helps in performing exhaustive search over specified parameter (hyper parameters) values for an estimator. One can use any kind of estimator such as sklearn.svm SVC, sklearn.linear_model LogisticRegression or sklearn.ensemble RandomForestClassifier.
The outcome of grid search is the optimal combination of one or more hyper parameters that gives the most optimal model complying to bias-variance tradeoff. This is owing to the fact that cross-validation techniques get applied in order to train and test model and come up with most optimal parameters combination. The manner in which grid search is different than validation curve technique is it allows you to search the parameters from the parameter grid. This is unlike validation curve where you can specify one parameter for optimization purpose.
Although Grid search is a very powerful approach for finding the optimal set of parameters, the evaluation of all possible parameter combinations is also computationally very expensive. An alternative approach for sampling different parameter combinations using sklearn is randomized search. This will be dealt in one of the future posts.
The grid search is implemented in Python Sklearn using the class, GridSearchCV. The class implements two methods such as fit, predict and score method.
In this post, the grid search is applied to the following estimators:
In this section, you will see Python Sklearn code example of Grid Search algorithm applied to different estimators such as RandomForestClassifier, LogisticRegression and SVC. Pay attention to some of the following in the code given below:
For example given below, Sklearn Breast Cancer data set is used. Here is the related code:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
When applied to sklearn.ensemble RandomForestClassifier, one can tune the models against different paramaters such as max_features, max_depth etc. Here is an example demonstrating the usage of Grid Search for selection of most optimal values of max_depth and max_features hyper parameters. Note the parameter grid, param_grid_rfc
pipelineRFC = make_pipeline(StandardScaler(), RandomForestClassifier(criterion='gini', random_state=1))
#
# Create the parameter grid
#
param_grid_rfc = [{
'randomforestclassifier__max_depth':[2, 3, 4],
'randomforestclassifier__max_features':[2, 3, 4, 5, 6]
}]
#
# Create an instance of GridSearch Cross-validation estimator
#
gsRFC = GridSearchCV(estimator=pipelineRFC,
param_grid = param_grid_rfc,
scoring='accuracy',
cv=10,
refit=True,
n_jobs=1)
#
# Train the RandomForestClassifier
#
gsRFC = gsRFC.fit(X_train, y_train)
#
# Print the training score of the best model
#
print(gsRFC.best_score_)
#
# Print the model parameters of the best model
#
print(gsRFC.best_params_)
#
# Print the test score of the best model
#
clfRFC = gsRFC.best_estimator_
print('Test accuracy: %.3f' % clfRFC.score(X_test, y_test))
When applied to sklearn.svm SVC, one can tune the models against different paramaters such as the following:
Here is an example demonstrating the usage of Grid Search for selection of most optimal values of hyper parameters for SVC algorithm. Note the parameter grid, param_grid_svc
pipelineSVC = make_pipeline(StandardScaler(), SVC(random_state=1))
#
# Create the parameter grid
#
param_grid_svc = [{
'svc__C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
'svc__kernel': ['linear']
},
{
'svc__C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
'svc__gamma': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
'svc__kernel': ['rbf']
}]
#
# Create an instance of GridSearch Cross-validation estimator
#
gsSVC = GridSearchCV(estimator=pipelineSVC,
param_grid = param_grid_svc,
scoring='accuracy',
cv=10,
refit=True,
n_jobs=1)
#
# Train the SVM classifier
#
gsSVC.fit(X_train, y_train)
#
# Print the training score of the best model
#
print(gsSVC.best_score_)
#
# Print the model parameters of the best model
#
print(gsSVC.best_params_)
#
# Print the model score on the test data using GridSearchCV score method
#
print('Test accuracy: %.3f' % gsSVC.score(X_test, y_test))
#
# Print the model score on the test data using Best estimator instance
#
clfSVC = gsSVC.best_estimator_
print('Test accuracy: %.3f' % clfSVC.score(X_test, y_test))
When applied to sklearn.linear_model LogisticRegression, one can tune the models against different paramaters such as inverse regularization parameter C. Note the parameter grid, param_grid_lr. Here is the sample Python sklearn code:
pipelineLR = make_pipeline(StandardScaler(), LogisticRegression(random_state=1, penalty='l2', solver='lbfgs'))
#
# Create the parameter grid
#
param_grid_lr = [{
'logisticregression__C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
}]
#
# Create an instance of GridSearch Cross-validation estimator
#
gsLR = GridSearchCV(estimator=pipelineLR,
param_grid = param_grid_lr,
scoring='accuracy',
cv=10,
refit=True,
n_jobs=1)
#
# Train the LogisticRegression Classifier
#
gsLR = gsLR.fit(X_train, y_train)
#
# Print the training score of the best model
#
print(gsLR.best_score_)
#
# Print the model parameters of the best model
#
print(gsLR.best_params_)
#
# Print the test score of the best model
#
clfLR = gsLR.best_estimator_
print('Test accuracy: %.3f' % clfLR.score(X_test, y_test))
Here is the summary of what you learned in relation to Grid Search technique for finding most optimal combination of hyper parameters:
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…