Sklearn Machine Learning Pipeline – Python Example

0

In this post, you will learning about concepts about machine learning (ML) pipeline and how to build ML pipeline using Python Sklearn Pipeline (sklearn.pipeline) package. Getting to know how to use Sklearn.pipeline effectively for training/testing machine learning models will help automate various different activities such as feature scaling, feature selection / extraction and training/testing the models. It is recommended for data scientists (Python) to get a good understanding of Sklearn.pipeline.  The following are some of the topics covered in this post:

  • Introduction to ML Pipeline
  • Sklearn ML Pipeline Python code example

Introduction to ML Pipeline

Machine Learning (ML) pipeline, theoretically, represents different steps including data transformation and prediction through which data passes. The outcome of the pipeline is the trained model which can be used for making the predictions. Sklearn.pipeline is a Python implementation of ML pipeline. Instead of going through the model fitting and data transformation steps for the training and test datasets separately, you can use Sklearn.pipeline to automate these steps. Here is a diagram representing a pipeline for training a machine learning model based on supervised learning. Make the note of some of the following in relation to Sklearn implementation of pipeline:

  • Every step except the last one takes a set of SKlearn transformers for different purposes including feature scaling, feature selection, feature extraction etc. In the diagram given below, these transformers are represented using StandardScaler (feature scaling) and PCA (unsupervised feature extraction / dimensionality reduction). These transformers must implement the following methods for the pipeline to work:
    • Fit
    • Transform
  • Last step is an Sklearn estimator which is used for making the predictions. In the diagram given below, the estimator is represented using ML algorithm implementations such as SVC, RandomForestClassifier etc. The last step representing estimator must implement the following methods:
    • Fit
    • Predict
  • For supervised learning, input is training data and labels and the output is model.
  • Invoking fit method on pipeline instance will result in execution of pipeline for training data. This is illustrated in the code example in next section.
Machine Learning Pipeline Sklearn Implementation
Fig 1. Machine Learning Pipeline (Sklearn Implementation)

Here is how the above pipeline will look like, for test data. Pay attention to some of the following in the diagram given below:

  • Input can be test data and labels
  • Output can be either predictions or model performance score.
  • Transform method is invoked on test data in data transformation stages.
  • Methods such as score or predict is invoked on pipeline instance to get predictions or model score.
Machine Learning Pipeline - Test data prediction or model scoring
Fig 2. Machine Learning Pipeline (Test data prediction or model scoring)

Sklearn ML Pipeline Python Code Example

Here is the Python code example for creating Sklearn Pipeline, fitting the pipeline and using the pipeline for prediction. The following are some of the points covered in the code below:

  • Pipeline is instantiated by passing different components/steps of pipeline related to feature scaling, feature extraction and estimator for prediction. The last step must be algorithm which will be doing prediction. Here is the set of sequential activities along with final estimator (used for prediction)
  • Fit is invoked on the pipeline instance to perform sequential transformation activities such as the following activities:
    • Data transformation using transformers for feature scaling, dimensionality reduction etc. Transformers must implement fit and transform method.
    • Estimators is used to fit a model. Estimators must implement fit and predict method.
  • Predict or Score method is called on pipeline instance to making prediction on the test data or scoring the model performance respectively.
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#
# Create the pipeline
#
pipeline = make_pipeline(StandardScaler(), PCA(n_components=8), 
                         RandomForestClassifier(criterion='gini', n_estimators=50, max_depth=2, random_state=1))
#
# Fit the pipeline
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
print('Model Accuracy: %.3f' % pipeline.score(X_test, y_test))

Conclusions

Here is the summary of what you learned:

  • Use machine learning pipeline (sklearn implementations) to automate most of the data transformation and estimation tasks.
  • make_pipeline class of Sklearn.pipeline can be used to creating the pipeline.
  • Data transformers must implement fit and transform method
  • Estimator must implement fit and predict method.
  • Pipeline fit method is invoked to fit the model using training data.
  • Pipeline predict or score method is invoked to get predictions or determining model performance scores.
Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.