In this post, you will learn about the concepts of Support Vector Machine (SVM) with the help of Python code example for building a machine learning classification model. We will work with Python Sklearn package for building the model. As data scientists, it is important to get a good grasp on SVM algorithm and related aspects.
What is Support Vector Machine (SVM)?
Support vector machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression tasks. At times, SVM for classification is termed as support vector classification (SVC) and SVM for regression is termed as support vector regression (SVR). In this post, we will learn about SVM classifier. The main idea behind SVM classifier is to find a hyperplane that maximally separates the data points of different classes. In other words, we are looking for the largest margin between the two classes. Given labeled training data (supervised learning), the SVM classification algorithm outputs an optimal hyperplane which categorizes new examples into different classes. This hyperplane is then used to make predictions on new data points. Support Vector Machine classifier is also termed as maximum margin classifier, meaning that it finds the line or hyperplane that has the largest distance to the nearest training data points of any class. Let’s take and example to understand Support Vector Machine better. Say you have been asked to predict whether a customer will churn or not and you have all their past transaction records as well as demographic information. After exploring the data, you’ve found that there’s not much difference between the average transaction amount of customers who churned and those who didn’t. You also found that most of the customers who churned live relatively far from the city center. Based on these findings, you decided to use Support Vector Machine classification algorithm to build your prediction model. Model trained using SVM classification algorithm will be able to classify the customers as high risk (churned) or otherwise.
There are some key concepts that are important to understand when working with SVMs. First, the data points that are closest to the hyperplane are called support vectors. These points have a direct impact on the position and orientation of the hyperplane. Second, there are two parameters that control the SVM model: C and gamma. C controls the trade-off between maximizing the margin and minimizing training error, while gamma controls the shape of the decision boundary.
As an example, let’s say we have a dataset with two features (x1 and x2) and two classes (0 and 1). We can visualize this data by plotting it in a two-dimensional space, with each point colored according to its class label. Look at the diagram below.
In the above case, we can see that there are different straight lines that can perfectly separate the two classes. However, we can still find a decision boundary that does a pretty good job. This boundary is generated by Support Vector Machine algorithm. Using SVM algorithm, as mentioned above, training the model represents finding the hyperplane (dashed line in the picture below) which separates the data belonging to two different classes by maximum or largest margin. And, the points closest to this hyperplane are called support vectors. Note this in the diagram given below.
The blue square points represent one class and the red dots represent another class. The black line is the decision boundary learned by an SVM. As you can see, the SVM has placed the boundary in such a way as to maximize the margin between the two classes.
Support vector machines are a powerful tool for classification, but like any machine learning algorithm, they require careful tuning of their hyperparameters in order to achieve optimal performance. The most important hyperparameters are the kernel function and the regularization parameter. The kernel function determines how data points are transformed into higher dimensional space, and the regularization parameter controls the trade-off between model complexity and overfitting. In addition, the Support Vector Machine also has a number of other important hyperparameters that can be adjusted to improve performance, including the maximum number of iterations, the tolerance for error, and the learning rate. By carefully tuning these hyperparameters, it is possible to achieve significantly better performance from a Support Vector Machine. Here are related post on tuning hyperparameters for building an optimal SVM model for classification:
Support vector machine (SVM) Python example
The following steps will be covered for training the model using SVM while using Python code:
- Load the data
- Create training and test split
- Perform feature scaling
- Instantiate an SVC classifier
- Fit the model
- Measure the model performance
First and foremost we will load appropriate Sklearn modules and classes.
# Basic packages import numpy as np import pandas as pd import matplotlib.pyplot as plt # Sklearn modules & classes from sklearn.linear_model import Perceptron, LogisticRegression from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn import datasets from sklearn import metrics
Lets get started with loading the data set and creating the training and test split from the data set. Pay attention to the stratification aspect used when creating the training and test split. The train_test_split class of sklearn.model_selection is used for creating training and test split.
# Load the data set; In this example, the breast cancer dataset is loaded. bc = datasets.load_breast_cancer() X = bc.data y = bc.target # Create training and test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
Next step is to perform feature scaling. The reason for doing feature scaling is to make sure that data for different features are in the same range. The StandardScaler class of sklearn.preprocessing is used.
sc = StandardScaler() sc.fit(X_train) X_train_std = sc.transform(X_train) X_test_std = sc.transform(X_test)
Next step is to instantiate a SVC (Support Vector Classifier) and fit the model. The SVC class of sklearn.svm module is used.
# Instantiate the Support Vector Classifier (SVC) svc = SVC(C=1.0, random_state=1, kernel='linear') # Fit the model svc.fit(X_train_std, y_train)
Finally, it is time to measure the model performance. Here is the code for doing the same:
# Make the predictions y_predict = svc.predict(X_test_std) # Measure the performance print("Accuracy score %.3f" %metrics.accuracy_score(y_test, y_predict))
The performance of the model will turn out to be 0.953.