Data Science

QA & Data Science – How to Test Features Relevance

In this post, I intend to present a perspective on the need for QA / testing team to test the feature relevance when testing the machine learning models as part of data science QA initiatives, and, different techniques which could be used to test or perform QA on feature relevance.

Feature relevance can also be termed as feature importance. Simply speaking, a feature is said to be relevant or important if it adds real predictive value to the underlying model. The relevant features must display a stable statistical relationship or association with the outcome variable. Well, an association does not imply a causation. However, a relevant feature or a feature with appropriate importance should be a part of the causal matrix which gives rise to the outcome. Read the related details on this page.

What we are saying is that the QA or testing team needs to test the feature relevance/importance from time-to-time to make sure ML models complexities and performance could be managed well.

One of the key aspects of building machine learning model is determining features set which results in high-performing models. Once the model is built and deployed, it becomes much more important to test whether features stay relevant thereby impacting model performance in a positive manner. In other words, features consist of useful information for the problem. In case the features become redundant and cease to impact the model or increase the error rates, these features need to be removed or replaced with the new features.

The QA team would need to undertake the features relevance tests at least on a quarterly basis. The goal for testing feature relevance would be to achieve some of the following objectives:

  • Ensure that the features used in the model contain useful information for the problem.

  • In case there are features which are not found to contribute to the model performance, these features should be raised as the defect and filtered out from time-to-time.

There are different techniques/approaches for testing the feature relevance vis-a-vis machine learning (ML) model from time-to-time. The following are some of them:

  • Statistical approaches

  • Feature importance techniques

There are other feature selection techniques such as grid search which could also be applied for testing feature relevance. However, for now, we will focus on the ones which do not require much knowledge of machine learning.

Testing Feature Relevance – Statistical Approaches

Testing feature relevance using statistical approaches would require QA / test engineers to learn basic statistics fundamentals such as mean, mode, variance, probability distribution, correlation, chi-square tests.

The following are some of the statistic approaches which could be adopted for measuring feature relevance in relation to its impact on the model performance.

  • Correlation of feature variable with the outcome variable

  • Feature variance

Correlation of Feature with Outcome Variable

The features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The following table can be used to determine method which could be used for measuring the correlation between the feature variable and response variable.

Feature / Response Variable Type Continuous Categorical / Discrete
Continuous Pearson’s Correlation Linear Discriminant Analysis (LDA)
Categorical Analysis of Variance (ANOVA) Chi-square

There are other techniques which could be used for testing the features’ impact to the model. For example, wrapper methods, embedded methods. However, to keep it simple, one could test the feature relationship with the outcome variable using correlation coefficients.

Test engineers would have to be trained with some of the following statistical tests to perform the testing:

  • Pearson’s correlation

  • LDA

  • ANOVA

  • Chi-square tests

These tests such as Pearson’s correlation and Chi-square tests could be done using the Excel spreadsheet. We will go into details in later articles.

Feature Variance

The features whose value remain the same or do not change much in different samples taken for hypothesis testing could be considered insignificant feature while building the models. Such features could also be termed as features with low variance.

Features with low variance below a certain threshold could as well be removed. The test engineers could write scripts to test the variance of features from time-to-time and raise appropriate defect for removal of features.

In later articles, we will discuss different techniques in Python and R which could be used for removing features with low variance.

Testing Feature Relevance – Feature Importance Technique

A given set of features could be run through some of the following classifiers to test the feature importance. This technique is also called as embedded methods used for feature selection. Basically, the processes of feature selection and model training are completely merged. The training process used for building ML model generates a presumably relevant subset of features as a byproduct. The QA/test engineers should be trained to work with the following techniques:

  • Recursive partitioning tree-based estimators such as random forest algorithm could be used to compute feature importance, which in turn can be used to discard irrelevant features.

  • The linear model with Lasso regularization

  • Neural networks, SVM, K-nearest neighbor etc.

One could get started with simplest of above such as tree-based estimators, Lasso etc.

References

Summary

When starting on with QA or testing practices for predictive analytics or data science projects, testing feature relevance in relation to machine learning models is the key and must be considered. We have seen some of the techniques such as statistical approaches which could be taken for testing the feature relevance. In future posts, I would be presenting some code samples and related perspectives for you to get started quickly.

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

Agentic Reasoning Design Patterns in AI: Examples

In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…

1 month ago

LLMs for Adaptive Learning & Personalized Education

Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…

2 months ago

Sparse Mixture of Experts (MoE) Models: Examples

With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…

2 months ago

Anxiety Disorder Detection & Machine Learning Techniques

Anxiety is a common mental health condition that affects millions of people around the world.…

2 months ago

Confounder Features & Machine Learning Models: Examples

In machine learning, confounder features or variables can significantly affect the accuracy and validity of…

2 months ago

Credit Card Fraud Detection & Machine Learning

Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…

2 months ago