IntroductionA software application needs to go through many quality control (QC) checks (testing) as part of quality assurance (QA) practices before it is moved into production for the consumption of end users. It is easier to perform QC checks/testing on software applications as the outputs for different classes of inputs can be verified against expected values which are known prior to starting testing. This is also termed as test oracle which we will discuss later in this post.
For machine learning (ML) systems comprising of machine learning / predictive models, there are no well-defined expected values against which the outputs can be verified and said to be correct or incorrect. In other words, the test oracle is not clearly defined for performing testing on ML systems.And, this is why machine learning systems can be termed as non-testable. We will see the details later in this post.
What is a Test Oracle?In testing of software applications, a frequently invoked assumption is that there are testers or external mechanisms such as automated software tests (unit tests/integration tests) which could accurately determine whether or not the output produced by the program/software apps is correct. These testers or automated software tests are termed as Oracle or Test Oracle. And, the assumption or belief that testing mechanisms can accurately determine the program correctness based on input-output is called as oracle assumption. A software/program can be termed as non-testable in the following scenarios:
- A test oracle (testers/test programs) does not exist because the correctness of the program output can’t be verified against the expected value, maybe because, the expected values are not well defined in the first place.
- The testers must expend some extraordinary amount of time & effort to determine whether or not the output is correct; Or, the test programs become very, to execute and maintain in order to test the program correctness at regular intervals.
Why are Machine Learning models/systems Non-testable?As defined earlier, a program is said to be non-testable in absence of a test oracle which is nothing but the testers/test mechanisms which could be used to verify the correctness of program outputs. Machine learning programs or models fall under the category of non-testable programs. The following represents thoughts in relation to why machine learning programs are termed as non-testable programs:
- Unlike traditional software apps where outputs in form of expected values are known beforehand, the outputs of machine learning models are predictive in nature. Meaning, there are no expected values beforehand. Rather, the output is predicted as a result of execution of machine learning models fed with a given set of input values. Only experts could tell whether the prediction made by the model given a set of input values is correct or not.
- Machine learning models built with a specific algorithm when optimized with techniques such as cross-validation or grid search could be given improved results/outputs. This makes it difficult to test because the same set of input values when fed into optimized models could give different outputs. Let’s take a look at an example where classifiers are built using different algorithms to predict the quality of the red wine. Here is the Kaggle project on predicting the quality of the wine. Pay attention to some of the following which reflects on non-testability of the models given the different outputs possible with optimized models:
- Support vector classifier model is built to classify the quality of the red wine. The precision value was found to be 0.86 and recall value of 0.88. The model is optimized using a grid search technique. Later, the precision value was found to be 0.90 and recall value of 0.90.
- Random forest classifier model was trained to classify the quality of the wine. The precision and recall value was found to be 0.87 and 0.88 respectively. Later, the classifier was optimized with cross-validation technique and the accuracy (precision value) improved to 91%. The above represents the challenges that testers (oracle) could face in determining the correctness of the model given that models in two different scenarios (non-optimized and optimized) produces different outputs.
- Machine learning models built with different algorithms give different results based on the accuracy of the models. Models built with random forest, stochastic gradient descent and support vector classifier have different accuracy in terms of precision value such as 87%, 84%, and 86% respectively. This represents the challenges for the test oracle to determine the correctness of the outcome given the input values as the same set of input values fed into models built with different algorithm could give different output values (prediction).
Thoughts on making Machine Learning Models TestableGiven that it, that ML models are non-testable due to absense of test oracle, let’s look at some of the ways (pseudo-oracle) which could be used to perform quality control checks on the machine learning models in some ways or the other. This is not an exhaustive list by any chance. I would be posting research findings in later posts in the coming weeks/months.
- Dual-coding technique for quality control checks of machine learning models: Build multiple models using different algorithms. In the above example, models using Random Forest, Stochastic Gradient Descent and Support vector classification (SVC) algorithms was built to predict the quality of the wine. Say, based on the performance, random forest classifier got accepted as the final model which will be moved to production. However, in QA environment, another model with second best accuracy, say, Support Vector Classifier. In case, the prediction made by two of these models are different, an alert is raised for QA/Data Scientists to validate the result.
- Compare ML model outcome with that of a Simplified Linear Model: Build a simplified linear model (less complex) model which could be used to compare the prediction of the actual model with that of simplified model.
- Metamorphic testing technique for quality control checks of ML models: In case, the predictions (output values) for a known set of input values can be compared based on the relationship between input-output variables, the model could be fed with known set of inputs and the output values could be evaluated appropriately for the correctness. I would go into details in one of the posts in near future.