In this post, you will learn the techniques in relation to knowing whether the given data set is linear or non-linear. Based on the type of machine learning problems (such as classification or regression) you are trying to solve, you could apply different techniques to determine whether the given data set is linear or non-linear. For a data scientist, it is very important to know whether the data is linear or not as it helps to choose appropriate algorithms to train a high-performance model. You will learn techniques such as the following for determining whether the data is linear or non-linear:

• Use scatter plot when dealing with classification problems
• Use scatter plots and the least square error method applied in a simple regression method when dealing with regression problems.

Use Scatter Plots for Classification Problems

In the case of the classification problem, the simplest way to find out whether the data is linear or non-linear (linearly separable or not) is to draw 2-dimensional scatter plots representing different classes. Take a look at the following examples to understand linearly separable and inseparable datasets.

Here is an example of a linear data set or linearly separable data set. The data set used is the IRIS data set from sklearn.datasets package. The data represents two different classes such as Setosa and Versicolor. Note that one can easily separate the data represented using black and green marks with a linear hyperplane/line.

The code which is used to print the above scatter plot is the following:

import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the IRIS Dataset
#
X = iris.data
y = iris.target

# Create a scatter plot
#
plt.scatter(X[:50, 0], X[:50, 1], color='green', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1], color='black', marker='x', label='versicolor')
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')
plt.show()


Here is an example of a non-linear data set or linearly non-separable data set. The data set used is the IRIS data set from sklearn.datasets package. The data represents two different classes such as Virginica and Versicolor. Note that one can’t separate the data represented using black and red marks with a linear hyperplane. Thus, this data can be called as non-linear data.

The code which is used to print the above scatter plot to identify non-linear dataset is the following:

import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the IRIS Dataset
#
X = iris.data
y = iris.target

# Create a scatter plot
#
plt.scatter(X[50:100, 0], X[50:100, 1], color='black', marker='x', label='versicolor')
plt.scatter(X[100:150, 0], X[100:150, 1], color='red', marker='+', label='verginica')
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')
plt.show()


Use Simple Regression Method for Regression Problem

In case you are dealing with predicting numerical value, the technique is to use scatter plots and also apply simple linear regression to the dataset and then check least square error. If the least square error shows high accuracy, it can be implied that the dataset is linear in nature, else the dataset is non-linear. Here is how the scatter plot would look for a linear data set when dealing with regression problem.

In addition to the above, you could also fit a regression model and calculate R-squared value. If the value is closer to 1, the data set could be seen as a linear data set.