PCA Explained Variance Concepts with Python Example

0

In this post, you will learn about the concepts of explained variance which is one of the key concepts related to principal component analysis (PCA). The explained variance concepts will be illustrated with Python code examples. Some of the following topics will be covered:

  • What is explained variance?
  • Python code examples of explained variance

What is Explained Variance?

Explained variance refers to the variance explained by each of the principal components (eigenvectors). It can be represented as a function of ratio of related eigenvalue and sum of eigenvalues of all eigenvectors. Let’s say that there are N eigenvectors, then the explained variance for each eigenvector (principal component) can be expressed the ratio of eigenvalue of related eigenvalue \(\lambda_i\) and sum of all eigenvalues \((\lambda_1 + \lambda_2 + … + \lambda_n)\) as the following:

\( \frac{\lambda_i}{\lambda_1 + \lambda_2 + … +\lambda_n}
\)

 

Recall that a set of eigenvectors and related eigenvalues are found as part of eigen decomposition of  transformation matrix  which is covariance matrix in case of principal component analysis (PCA). These eigenvectors represent the principal components that contain most of the information (variance) represented using features (independent variables). The explained variance ratio represents the variance explained using a particular eigenvector. In the diagram below, there are two independent principal components PC1 and PC2. Note that PC1 represents the eigenvector which explains most of the information variance. PC2 represents lesser information (variance)

Principal Components representing variance in two dimensions
Fig 1. Principal Components representing variance in two dimensions

Explained Variance using Python Code

The explained variance can be calculated using two techniques. Kaggla Data related to campus placement is used in the code given in the following sections.

  • sklearn PCA class
  • Custom Python code (without sklearn PCA) for determining explained variance

Sklearn PCA Class for determining Explained Variance

In this section, you will learn the code which makes use of PCA class of sklearn.decomposition for doing eigen decomposition of transformation matrix (Covariance matrix created using X_train_std in example given below). Here is the snapshot of the data after being cleaned up.

Data used for analysing explained variance of principal components - eigenvectors
Fig. Data used for analysing explained variance

Note some of the following in the python code given below:

  • explained_variance_ratio_ method of PCA is used to get the ration of variance (eigenvalue / total eigenvalues)
  • Bar chart is used to represent individual explained variances.
  • Step plot is used to represent the variance explained by different principal components.
  • Data needs to be scaled before applying PCA technique.
#
# Scale the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Instantiate PCA
#
pca = PCA()
#
# Determine transformed features
#
X_train_pca = pca.fit_transform(X_train_std)
#
# Determine explained variance using explained_variance_ration_ attribute
#
exp_var_pca = pca.explained_variance_ratio_
#
# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.
#
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
#
# Create the visualization plot
#
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

The Python code given above results in the following plot.

Explained Variance using sklearn PCA
Fig 2. Explained Variance using sklearn PCA

Custom Python Code (without using sklearn PCA) for determining Explained Variance

In this section, you will learn about how to determine explained variance without using sklearn PCA. Note some of the following in the code given below:

  • Training data was scaled
  • eigh method of numpy.linalg class is used.
  • Covariance matrix of training dataset was created
  • Eigenvalues and eigenvectors of covariance matrix was determined
  • Explained variance was calculated
  • Visualization plot was created for visualizing explained variance.
#
# Scale the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Import eigh method for calculating eigenvalues and eigenvectirs
#
from numpy.linalg import eigh
#
# Determine covariance matrix
#
cov_matrix = np.cov(X_train_std, rowvar=False)
#
# Determine eigenvalues and eigenvectors
#
egnvalues, egnvectors = eigh(cov_matrix)
#
# Determine explained variance
#
total_egnvalues = sum(egnvalues)
var_exp = [(i/total_egnvalues) for i in sorted(egnvalues, reverse=True)]
#
# Plot the explained variance against cumulative explained variance
#
import matplotlib.pyplot as plt
cum_sum_exp = np.cumsum(var_exp)
plt.bar(range(0,len(var_exp)), var_exp, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_exp)), cum_sum_exp, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Here is how the explained variance plot would look like:

Explained Variance using custom python code (without using sklearn pca)

References

Conclusion

Here are the conclusions / learning from this post:

  • Explained variance represents the information explained using a particular principal components (eigenvectors)
  • Explained variance is calculated as ratio of eigenvalue of a articular principal component (eigenvector) with total eigenvalues.
  • Explained variance can be calculated as the attribute explained_variance_ratio_ of PCA instance created using sklearn.decomposition PCA class.
Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.