In this post, you will learn about which data structure to use between **Pandas Dataframe **and **Numpy Array **when working with **Scikit Learn **libraries. As a data scientist, it is very important to understand the difference between Numpy array and Pandas Dataframe and when to use which data structure.

Here are some facts:

- Scikit learn was originally developed to work well with Numpy array
- Numpy Ndarray provides a lot of convenient and optimized methods for performing several mathematical operations on vectors. Numpy array can be instantiated using the following manner:
**np.array([4, 5, 6])** - Pandas Dataframe is an in-memory 2-dimensional tabular representation of data. In simpler words, it can be seen as a spreadsheet having rows and columns. One can see Pandas Dataframe as SQL tables as well while Numpy array as C array. Due to this very fact,
**it found to be more convenient, at times, for data preprocessing due**to some of the following useful methods it provides.- Row and columns operations such as addition / removal of columns, extracting rows / columns information etc. Some of the most commonly used methods are
**head(), tail(), summary(), describe()**etc. - Group operations (Here the Pandas dataframe is the winner due to ease of use)
- Storing items of different types
- Creation of pivot tables
- Ease of creation of plots using Matplotlib
- Time-series functionalities

- Row and columns operations such as addition / removal of columns, extracting rows / columns information etc. Some of the most commonly used methods are
**Pandas dataframe columns gets stored as Numpy arrays**and dataframe operations are thin wrappers around numpy operations.**It is recommended to use Numpy array, whenever possible,**with Scikit learn libraries due to mature data handling.

## How to Convert Dataframe to Numpy Array?

Here is the code which can be used to convert Pandas dataframe to Numpy array:

```
import pandas as pd
# Load data as Pandas Dataframe
df = pd.read_csv("...")
# Convert dataframe to Numpy array
df.values
```

Here is what will get printed:

## Conclusion

In this post, you learned about **difference **between **Numpy array **and **Pandas Dataframe. **Simply speaking, use **Numpy array** when there are **complex mathematical operations **to be performed. Use **Pandas dataframe **for ease of usage of **data preprocessing **including performing group operations, creation of Matplotlib plots, rows and columns operations. As a matter of fact, one could use both Pandas Dataframe and Numpy array based on the data preprocessing and data processing needs.

- First Principles Understanding based on Physics - April 13, 2021
- Precision & Recall Explained using Covid-19 Example - April 11, 2021
- Moving Average Method for Time-series forecasting - April 4, 2021

## Leave a Reply