Feature Extraction using PCA – Python Example

Taj Mahal Side View

In this post, you will learn about how to use principal component analysis (PCA) for extracting important features (also termed as feature extraction technique) from a list of given features. As a machine learning / data scientist, it is very important to learn the PCA technique for feature extraction as it helps you visualize the data in the lights of importance of explained variance of data set. The following topics get covered in this post:

  • What is principal component analysis?
  • PCA algorithm for feature extraction
  • PCA Python implementation step-by-step
  • PCA Python Sklearn example

What is Principal Component Analysis?

Principal component analysis (PCA) is an unsupervised linear transformation technique which is primarily used for feature extraction and dimensionality reduction. It aims to find the directions of maximum variance in high-dimensional data and projects the data onto a new subspace with equal or fewer dimensions than the original one. In the diagram given below, note the directions of maximum variance of data. This is represented using PCA1 (first maximum variance) and PC2 (2nd maximum variance).

PCA - Directions of maximum variance
Fig 1. PCA – Directions of maximum variance

It is the direction of maximum variance of data that helps us identify an object. For example, in a movie, it is okay to identify objects by 2-dimensions as these dimensions represent direction of maximum variance.  Take a look at a real-world example of understanding direction of maximum variance in the following picture representing Taj Mahal of Agra. The diagram below represents the side view of Taj Mahal. There are multiple dimensions consisting of information (maximum variance) which helps identify the picture as Taj Mahal.

Taj Mahal Side View

Fig 2. Taj Mahal Side View

Take a look the following picture of Taj Mahal from top view. Note that there are only fewer dimensions in which information is varying and the variance is also not much. Hence, it is difficult to identify from top view whether the picture is of Taj Mahal. Thus, top view can be ignored easily.

Taj Mahal Top View

Fig 3. Taj Mahal Top View

Thus, when training a model to classify whether a given structure is of Taj Mahal or not, one would want to ignore the dimensions / features related to top view as they don’t provide much information (as a result of low variance).

How is PCA different than other feature selection techniques?

The way PCA is different from other feature selection techniques such as random forest, regularization techniques, forward/backward selection techniques etc is that it does not require class labels to be present (thus called as unsupervised). More details along with Python code example will be shared in future posts.

PCA Algorithm for Feature Extraction

The following represents 6 steps of principal component analysis (PCA) algorithm:

  1. Standardize the dataset: Standardizing / normalizing the dataset is the first step one would need to take before performing PCA. The PCA calculates a new projection of the given data set representing one or more features. The new axes are based on the standard deviation of the value of these features. So, a feature / variable with a high standard deviation will have a higher weight for the calculation of axis than a variable / feature with a low standard deviation. If the data is normalized / standardized, the standard deviation of all fetaures / variables get measured on the same scale. Thus, all variables have the same weight and PCA calculates relevant axis appropriately. Note that the data is standardized / normalized after creating training / test split. Python’s sklearn.preprocessing StandardScaler class can be used for standardizing the dataset.
  2. Construct the covariance matrix: Once the data is standardized, the next step is to create n X n-dimensional covariance matrix, where n is the number of dimensions in the dataset. The covariance matrix stores the pairwise covariances between the different features.  Note that a positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. Python‘s Numpy cov method can be used to create covariance matrix.
  3. Perform Eigendecomposition of covariance matrix: The next step is to decompose the covariance matrix into its eigenvectors and eigenvalues. The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude. Numpy linalg.eig or linalg.eigh can be used for decomposing covariance matrix into eigenvectors and eigenvalues.
  4. Selection of most important Eigenvectors / Eigenvalues: Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (). One can used the concepts of explained variance to select the k most important eigenvectors.
  5. Projection matrix creation of important eigenvectors: Construct a projection matrix, W, from the top k eigenvectors.
  6. Training / test dataset transformation: Finally, transform the d-dimensional input training and test dataset using the projection matrix to obtain the new k-dimensional feature subspace.

PCA Python Implementation Step-by-Step

This section represents custom Python code for extracting the features using PCA. The Kaggle campus recruitment dataset is used.

Fig 4. Dataset for PCA

Here are the steps followed for performing PCA:

  • Perform one-hot encoding to transform categorical data set to numerical data set
  • Perform training / test split of the dataset
  • Standardize the training and test data set
  • Construct covariance matrix of the training data set
  • Construct eigendecomposition of the covariance matrix
  • Select the most important features using explained variance
  • Construct project matrix; In the code below, the projection matrix is created using the five eigenvectors that correspond to the top five eigenvalues (largest), to capture about 75% of the variance in this dataset
  • Transform the training data set into new feature subspace

Here is the custom Python code (without using sklearn.decomposition PCA class) to achieve the above PCA algorithm steps for feature extraction:

#
# Perform one-hot encoding
#
categorical_columns = df.columns[df.dtypes == object] # Find all categorical columns

df = pd.get_dummies(df, columns = categorical_columns, drop_first=True)
#
# Create training / test split
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(df[df.columns[df.columns != 'salary']], 
                   df['salary'], test_size=0.25, random_state=1)
#
# Standardize the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Import eigh method for calculating eigenvalues and eigenvectirs
#
from numpy.linalg import eigh
#
# Determine covariance matrix
#
cov_matrix = np.cov(X_train_std, rowvar=False)
#
# Determine eigenvalues and eigenvectors
#
egnvalues, egnvectors = eigh(cov_matrix)
#
# Determine explained variance and select the most important eigenvectors based on explained variance
#
total_egnvalues = sum(egnvalues)
var_exp = [(i/total_egnvalues) for i in sorted(egnvalues, reverse=True)]
#
# Construct projection matrix using the five eigenvectors that correspond to the top five eigenvalues (largest), to capture about 75% of the variance in this dataset
#
egnpairs = [(np.abs(egnvalues[i]), egnvectors[:, i])
                for i in range(len(egnvalues))]
egnpairs.sort(key=lambda k: k[0], reverse=True)
projectionMatrix = np.hstack((egnpairs[0][1][:, np.newaxis], 
                              egnpairs[1][1][:, np.newaxis],
                              egnpairs[2][1][:, np.newaxis],
                              egnpairs[3][1][:, np.newaxis],
                              egnpairs[4][1][:, np.newaxis]))
#
# Transform the training data set
#
X_train_pca = X_train_std.dot(projectionMatrix)

PCA Python Sklearn Example

This section represents Python code for extracting the features using sklearn.decomposition class PCA. The Kaggle campus recruitment dataset is used. Here is the screenshot of the data used. Salary is the label. The goal is to predict the salary.

PCA sample dataset for feature extraction
Fig 4. Dataset for PCA

Here are the steps followed for performing PCA:

  • Perform one-hot encoding to transform categorical data set to numerical data set
  • Perform training / test split of the dataset
  • Standardize the training and test data set
  • Perform PCA by fitting and transforming the training data set to the new feature subspace and later transforming test data set.
  • As a final step, the transformed dataset can be used for training/testing the model

Here is the Python code to achieve the above PCA algorithm steps for feature extraction:

#
# Perform one-hot encoding
#
categorical_columns = df.columns[df.dtypes == object] # Find all categorical columns

df = pd.get_dummies(df, columns = categorical_columns, drop_first=True)
#
# Create training / test split
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(df[df.columns[df.columns != 'salary']], 
                   df['salary'], test_size=0.25, random_state=1)
#
# Standardize the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Perform PCA
#
from sklearn.decomposition import PCA
pca = PCA()
#
# Determine transformed features
#
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

Conclusions

Here is the summary of what you learned in relation to applying principal component analysis (PCA) for feature extraction.

  • Feature extraction is about transforming features into new feature subspace while retaining information in original features.
  • New features subspace is created by transforming d-dimensional data set into k-dimensional data set by using projection matrix. Projection matrix is constructed by selecting K most important eigenvectors.
  • Most important eigenvectors are selected by sorting eigenvalues and finding their explained variance ratio.
  • Eigenvectors & eigenvalues are created by doing eigendecomposition of covariance matrix of original data set.
  • Before performing PCA, the data set must be standardized.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Data Science, Machine Learning, Python. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *