QA – Why Machine Learning Systems are Non-testable

This post represents views on why machine learning systems or models are termed as non-testable from quality control/quality assurance perspectives. Before I proceed ahead, let me humbly state that data scientists/machine learning community has been saying that ML models are testable as they are first trained and then tested using techniques such as cross-validation etc., based on different techniques to increase the model performance, optimize the model.  However, “testing” the model is referred with the scenario during the development (model building) phase when data scientists test the model performance by comparing the model outputs (predicted values) with the actual values.  This is not the same as testing the model for any given input for which the output (expected) value is not known beforehand. In this post, I am rather talking about ML models testability from the overall traditional software testing perspective. Given that machine learning systems are non-testable, it can be said that performing QA or quality control checks on machine learning systems is not easy, and, thus, a matter of concern given the trust, the end-users need to have on such systems. Project stakeholders must need to understand the non-testability aspects of machine learning systems in order to put appropriate quality controls in place to serve trustable machine learning models to end users in production. This applies greatly to healthcare and financial systems where a couple of false negatives or type-II error could lead to havoc or troubles for the stakeholders. This, in a way, presents an opportunity for AI / Data Science / Machine Learning community to work on creating frameworks to enable /achieve testability of ML systems from QA perspectives. As a matter of fact, there are frameworks such as LIME etc which could play a key role in achieving testability of ML systems. I would be digging deeper and posting my research work in this field in next 3-6 months. Stay tuned!

Introduction

A software application needs to go through many quality control (QC) checks (testing) as part of quality assurance (QA) practices before it is moved into production for the consumption of end users. It is easier to perform QC checks/testing on software applications as the outputs for different classes of inputs can be verified against expected values which are known prior to starting testing. This is also termed as test oracle which we will discuss later in this post.
For machine learning (ML) systems comprising of machine learning / predictive models, there are no well-defined expected values against which the outputs can be verified and said to be correct or incorrect. In other words, the test oracle is not clearly defined for performing testing on ML systems.
And, this is why machine learning systems can be termed as non-testable. We will see the details later in this post.

What is a Test Oracle?

In testing of software applications, a frequently invoked assumption is that there are testers or external mechanisms such as automated software tests (unit tests/integration tests) which could accurately determine whether or not the output produced by the program/software apps is correct. These testers or automated software tests are termed as Oracle or Test Oracle. And, the assumption or belief that testing mechanisms can accurately determine the program correctness based on input-output is called as oracle assumption. A software/program can be termed as non-testable in the following scenarios:
  • A test oracle (testers/test programs) does not exist because the correctness of the program output can’t be verified against the expected value, maybe because, the expected values are not well defined in the first place.
  • The testers must expend some extraordinary amount of time & effort to determine whether or not the output is correct; Or, the test programs become very, to execute and maintain in order to test the program correctness at regular intervals.
In case the testers or test mechanisms could state whether the program output is correct or not without knowing the correct answer is termed as a partial oracle.

Why are Machine Learning models/systems Non-testable?

As defined earlier, a program is said to be non-testable in absence of a test oracle which is nothing but the testers/test mechanisms which could be used to verify the correctness of program outputs. Machine learning programs or models fall under the category of non-testable programs. The following represents thoughts in relation to why machine learning programs are termed as non-testable programs:
  • Unlike traditional software apps where outputs in form of expected values are known beforehand, the outputs of machine learning models are predictive in nature. Meaning, there are no expected values beforehand. Rather, the output is predicted as a result of execution of machine learning models fed with a given set of input values. Only experts could tell whether the prediction made by the model given a set of input values is correct or not.
  • Machine learning models built with a specific algorithm when optimized with techniques such as cross-validation or grid search could be given improved results/outputs. This makes it difficult to test because the same set of input values when fed into optimized models could give different outputs. Let’s take a look at an example where classifiers are built using different algorithms to predict the quality of the red wine. Here is the Kaggle project on predicting the quality of the wine. Pay attention to some of the following which reflects on non-testability of the models given the different outputs possible with optimized models:
    • Support vector classifier model is built to classify the quality of the red wine. The precision value was found to be 0.86 and recall value of 0.88. The model is optimized using a grid search technique. Later, the precision value was found to be 0.90 and recall value of 0.90.
    • Random forest classifier model was trained to classify the quality of the wine. The precision and recall value was found to be 0.87 and 0.88 respectively. Later, the classifier was optimized with cross-validation technique and the accuracy (precision value) improved to 91%. The above represents the challenges that testers (oracle) could face in determining the correctness of the model given that models in two different scenarios (non-optimized and optimized) produces different outputs.
  • Machine learning models built with different algorithms give different results based on the accuracy of the models. Models built with random forest, stochastic gradient descent and support vector classifier have different accuracy in terms of precision value such as 87%, 84%, and 86% respectively. This represents the challenges for the test oracle to determine the correctness of the outcome given the input values as the same set of input values fed into models built with different algorithm could give different output values (prediction).

Thoughts on making Machine Learning Models Testable

Given that it, that ML models are non-testable due to absense of test oracle, let’s look at some of the ways (pseudo-oracle) which could be used to perform quality control checks on the machine learning models in some ways or the other. This is not an exhaustive list by any chance. I would be posting research findings in later posts in the coming weeks/months.
  • Dual-coding technique for quality control checks of machine learning models: Build multiple models using different algorithms. In the above example, models using Random Forest, Stochastic Gradient Descent and Support vector classification (SVC) algorithms was built to predict the quality of the wine. Say, based on the performance, random forest classifier got accepted as the final model which will be moved to production. However, in QA environment, another model with second best accuracy, say, Support Vector Classifier. In case, the prediction made by two of these models are different, an alert is raised for QA/Data Scientists to validate the result.
  • Compare ML model outcome with that of a Simplified Linear Model: Build a simplified linear model (less complex) model which could be used to compare the prediction of the actual model with that of simplified model.
  • Metamorphic testing technique for quality control checks of ML models: In case, the predictions (output values) for a known set of input values can be compared based on the relationship between input-output variables, the model could be fed with known set of inputs and the output values could be evaluated appropriately for the correctness. I would go into details in one of the posts in near future.

References

Summary

In this post, you learned about the aspects related to why machine learning systems/models are non-testable. Given this, if you are part of QA team or a data scientist, and you could not find specialized QA practices to perform quality control checks of machine learning models, reach out to stakeholders in your company and get started on this. Please feel free to reach out to me. Also, feel free to comment or suggest.
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

Agentic Reasoning Design Patterns in AI: Examples

In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…

2 months ago

LLMs for Adaptive Learning & Personalized Education

Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…

2 months ago

Sparse Mixture of Experts (MoE) Models: Examples

With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…

2 months ago

Anxiety Disorder Detection & Machine Learning Techniques

Anxiety is a common mental health condition that affects millions of people around the world.…

2 months ago

Confounder Features & Machine Learning Models: Examples

In machine learning, confounder features or variables can significantly affect the accuracy and validity of…

2 months ago

Credit Card Fraud Detection & Machine Learning

Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…

2 months ago