Last updated: 17 Sept, 2024
Principal component analysis (PCA)is a dimensionality reduction technique that reduces the number of dimensions or features in a dataset without sacrificing a lot of information. What if it is told that you could take a dataset with 500 columns, use PCA to reduce it to 50 columns, and still able to retain 90% or more of the information in the original dataset? Wouldn’t that sound like a miracle?
In this post, you will learn about how to use PCA for extracting important features (also termed as feature extraction technique) from a list of given features. As a machine learning / data scientist, it is very important to learn the PCA technique for feature extraction as it helps you visualize the data in the lights of importance of explained variance of data set. The following topics get covered in this post:
Principal component analysis (PCA) is an unsupervised linear transformation technique which is primarily used for dimensionality reduction and feature extraction. It aims to find the directions of maximum variance in high-dimensional data and projects the data onto a new subspace with equal or fewer dimensions than the original one. In the diagram given below, note the directions of maximum variance of data. This is represented using PCA1 (first maximum variance) and PC2 (2nd maximum variance).
It is the direction of maximum variance of data that helps us identify an object. For example, in a movie, it is okay to identify objects by 2-dimensions as these dimensions represent direction of maximum variance. Take a look at a real-world example of understanding direction of maximum variance in the following picture representing Taj Mahal of Agra. The diagram below represents the side view of Taj Mahal. There are multiple dimensions consisting of information (maximum variance) which helps identify the picture as Taj Mahal.
Take a look the following picture of Taj Mahal from top view. Note that there are only fewer dimensions in which information is varying and the variance is also not much. Hence, it is difficult to identify from top view whether the picture is of Taj Mahal. Thus, top view can be ignored easily.
Thus, when training a model to classify whether a given structure is of Taj Mahal or not, one would want to ignore the dimensions / features related to top view as they don’t provide much information (as a result of low variance).
The way PCA is different from other feature selection techniques such as random forest, regularization techniques, forward/backward selection techniques etc is that it does not require class labels to be present (thus called as unsupervised). More details along with Python code example will be shared in future posts.
The following represents 6 steps of principal component analysis (PCA) algorithm:
This section represents custom Python code for extracting the features using PCA. The Kaggle campus recruitment dataset is used.
Here are the steps followed for performing PCA:
Here is the custom Python code (without using sklearn.decomposition PCA class) to achieve the above PCA algorithm steps for feature extraction:
#
# Perform one-hot encoding
#
categorical_columns = df.columns[df.dtypes == object] # Find all categorical columns
df = pd.get_dummies(df, columns = categorical_columns, drop_first=True)
#
# Create training / test split
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(df[df.columns[df.columns != 'salary']],
df['salary'], test_size=0.25, random_state=1)
#
# Standardize the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Import eigh method for calculating eigenvalues and eigenvectirs
#
from numpy.linalg import eigh
#
# Determine covariance matrix
#
cov_matrix = np.cov(X_train_std, rowvar=False)
#
# Determine eigenvalues and eigenvectors
#
egnvalues, egnvectors = eigh(cov_matrix)
#
# Determine explained variance and select the most important eigenvectors based on explained variance
#
total_egnvalues = sum(egnvalues)
var_exp = [(i/total_egnvalues) for i in sorted(egnvalues, reverse=True)]
#
# Construct projection matrix using the five eigenvectors that correspond to the top five eigenvalues (largest), to capture about 75% of the variance in this dataset
#
egnpairs = [(np.abs(egnvalues[i]), egnvectors[:, i])
for i in range(len(egnvalues))]
egnpairs.sort(key=lambda k: k[0], reverse=True)
projectionMatrix = np.hstack((egnpairs[0][1][:, np.newaxis],
egnpairs[1][1][:, np.newaxis],
egnpairs[2][1][:, np.newaxis],
egnpairs[3][1][:, np.newaxis],
egnpairs[4][1][:, np.newaxis]))
#
# Transform the training data set
#
X_train_pca = X_train_std.dot(projectionMatrix)
This section represents Python code for extracting the features using sklearn.decomposition class PCA. The Kaggle campus recruitment dataset is used. Here is the screenshot of the data used. Salary is the label. The goal is to predict the salary.
Here are the steps followed for performing PCA:
Here is the Python code to achieve the above PCA algorithm steps for feature extraction:
#
# Perform one-hot encoding
#
categorical_columns = df.columns[df.dtypes == object] # Find all categorical columns
df = pd.get_dummies(df, columns = categorical_columns, drop_first=True)
#
# Create training / test split
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(df[df.columns[df.columns != 'salary']],
df['salary'], test_size=0.25, random_state=1)
#
# Standardize the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Perform PCA
#
from sklearn.decomposition import PCA
pca = PCA()
#
# Determine transformed features
#
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
The following are some of the benefits of using PCA techniques:
Here is the summary of what you learned in relation to applying principal component analysis (PCA) for feature extraction.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…