In this post, you will learning about concepts about machine learning (ML) pipeline and how to build ML pipeline using Python Sklearn Pipeline (sklearn.pipeline) package. Getting to know how to use Sklearn.pipeline effectively for training/testing machine learning models will help automate various different activities such as feature scaling, feature selection / extraction and training/testing the models. It is recommended for data scientists (Python) to get a good understanding of Sklearn.pipeline. The following are some of the topics covered in this post:
- Introduction to ML Pipeline
- Sklearn ML Pipeline Python code example
Introduction to ML Pipeline
Machine Learning (ML) pipeline, theoretically, represents different steps including data transformation and prediction through which data passes. The outcome of the pipeline is the trained model which can be used for making the predictions. Sklearn.pipeline is a Python implementation of ML pipeline. Instead of going through the model fitting and data transformation steps for the training and test datasets separately, you can use Sklearn.pipeline to automate these steps. Here is a diagram representing a pipeline for training a machine learning model based on supervised learning. Make the note of some of the following in relation to Sklearn implementation of pipeline:
- Every step except the last one takes a set of SKlearn transformers for different purposes including feature scaling, feature selection, feature extraction etc. In the diagram given below, these transformers are represented using StandardScaler (feature scaling) and PCA (unsupervised feature extraction / dimensionality reduction). These transformers must implement the following methods for the pipeline to work:
- Last step is an Sklearn estimator which is used for making the predictions. In the diagram given below, the estimator is represented using ML algorithm implementations such as SVC, RandomForestClassifier etc. The last step representing estimator must implement the following methods:
- For supervised learning, input is training data and labels and the output is model.
- Invoking fit method on pipeline instance will result in execution of pipeline for training data. This is illustrated in the code example in next section.
Here is how the above pipeline will look like, for test data. Pay attention to some of the following in the diagram given below:
- Input can be test data and labels
- Output can be either predictions or model performance score.
- Transform method is invoked on test data in data transformation stages.
- Methods such as score or predict is invoked on pipeline instance to get predictions or model score.
Sklearn ML Pipeline Python Code Example
Here is the Python code example for creating Sklearn Pipeline, fitting the pipeline and using the pipeline for prediction. The following are some of the points covered in the code below:
- Pipeline is instantiated by passing different components/steps of pipeline related to feature scaling, feature extraction and estimator for prediction. The last step must be algorithm which will be doing prediction. Here is the set of sequential activities along with final estimator (used for prediction)
- Fit is invoked on the pipeline instance to perform sequential transformation activities such as the following activities:
- Data transformation using transformers for feature scaling, dimensionality reduction etc. Transformers must implement fit and transform method.
- Estimators is used to fit a model. Estimators must implement fit and predict method.
- Predict or Score method is called on pipeline instance to making prediction on the test data or scoring the model performance respectively.
import pandas as pd import numpy as np from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # # Create the pipeline # pipeline = make_pipeline(StandardScaler(), PCA(n_components=8), RandomForestClassifier(criterion='gini', n_estimators=50, max_depth=2, random_state=1)) # # Fit the pipeline # pipeline.fit(X_train, y_train) # # Score the model # print('Model Accuracy: %.3f' % pipeline.score(X_test, y_test))
Here is the summary of what you learned:
- Use machine learning pipeline (sklearn implementations) to automate most of the data transformation and estimation tasks.
- make_pipeline class of Sklearn.pipeline can be used to creating the pipeline.
- Data transformers must implement fit and transform method
- Estimator must implement fit and predict method.
- Pipeline fit method is invoked to fit the model using training data.
- Pipeline predict or score method is invoked to get predictions or determining model performance scores.