In this post, you will learn the techniques in relation to knowing whether **the given data set is linear or non-linear.** Based on the type of **machine learning** problems (such as classification or regression) you are trying to solve, you could apply different techniques to determine whether the given data set is linear or non-linear. For a data scientist, it is very important to know whether the data is linear or not as it helps to choose appropriate algorithms to train a high-performance model. You will learn techniques such as the following for determining whether the data is linear or non-linear:

- Use scatter plot when dealing with classification problems
- Use scatter plots and the least square error method applied in a simple regression method when dealing with regression problems.

## Use Scatter Plots for Classification Problems

In the case of the classification problem, **the simplest way to find out whether the data is linear or non-linear **(linearly separable or not) is to **draw 2-dimensional scatter plots** representing different classes. Take a look at the following examples to understand linearly separable and inseparable datasets.

Here is an example of a linear data set or linearly separable data set. The data set used is the **IRIS** data set from **sklearn.datasets** package. The data represents two different classes such as Setosa and Versicolor. **Note that one can easily separate the data represented using black and green marks with a linear hyperplane/line.**

The code which is used to print the above scatter plot is the following:

import pandas as pd import numpy as np from sklearn import datasets import matplotlib.pyplot as plt # Load the IRIS Dataset # iris = datasets.load_iris() X = iris.data y = iris.target # Create a scatter plot # plt.scatter(X[:50, 0], X[:50, 1], color='green', marker='o', label='setosa') plt.scatter(X[50:100, 0], X[50:100, 1], color='black', marker='x', label='versicolor') plt.xlabel('sepal length [cm]') plt.ylabel('petal length [cm]') plt.legend(loc='upper left') plt.show()

Here is an example of a non-linear data set or linearly non-separable data set. The data set used is the **IRIS** data set from **sklearn.datasets** package. The data represents two different classes such as Virginica and Versicolor. **Note that one can’t separate the data represented using black and red marks with a linear hyperplane. **Thus, this data can be called as non-linear data.

The code which is used to print the above **scatter plot** to identify non-linear dataset is the following:

import pandas as pd import numpy as np from sklearn import datasets import matplotlib.pyplot as plt # Load the IRIS Dataset # iris = datasets.load_iris() X = iris.data y = iris.target # Create a scatter plot # plt.scatter(X[50:100, 0], X[50:100, 1], color='black', marker='x', label='versicolor') plt.scatter(X[100:150, 0], X[100:150, 1], color='red', marker='+', label='verginica') plt.xlabel('sepal length [cm]') plt.ylabel('petal length [cm]') plt.legend(loc='upper left') plt.show()

## Use Simple Regression Method for Regression Problem

In case you are dealing with p**redicting numerical value,** the technique is to **use scatter plots** and also **apply simple linear regression** to the dataset and then check least square error. If the least square error shows high accuracy, it can be implied that the dataset is linear in nature, else the dataset is non-linear. Here is how the scatter plot would look for a linear data set when dealing with regression problem.

In addition to the above, you could also **fit a regression model** and calculate **R-squared value**. If the value is closer to 1, the data set could be seen as a linear data set.

- Standard Deviation of Population & Sample – Python - August 3, 2020
- Machine Learning – Feature Selection vs Feature Extraction - August 2, 2020
- Sklearn SelectFromModel for Feature Importance - August 2, 2020