In this post, you will learn about data quality assessment frameworks / techniques in relation to machine learning and why one needs to assess data quality for building high-performance machine learning models? As a data science architect or development manager, you must get a sense of the importance of data quality in relation to building high-performance machine learning models. The idea is to understand what is the value of data set. The goal is to determine whether the value of data can be quantised. This is because it is important to understand whether the data contains rich information which could be valuable for building models and inform stakeholders on data collection strategy and other aspects.
Here are the top 3 data assessment frameworks which can be considered for assessing or evaluating data quality when training models using supervised learning algorithms:
Here are some of the topics which will be covered in this post:
In order to train machine learning models having high performance, it is of utmost importance to make sure that the data used for training the models is of very high quality. As per one statistics gathered from a survey, 43% respondent said that data quality is the biggest machine learning barrier.
The following represents scenarios in which the quality of the data used for training machine learning models can said to be low:
Here are couple of reasons why you must assess the data quality while training one or more machine learning models:
Here are some of the interesting data quality assessment frameworks which can be used to quantify the data quality in order to realise the benefits of having high quality dataset for training ML models as mentioned in previous section.
Google has tested a framework called as data value estimator framework which applies reinforcement learning (instead of gradient descent methods) to estimate data values and selects the most valuable samples to train the predictor model. It is a novel meta learning framework for data valuation which determines how likely each training sample will be used in training of the predictor model. The model training process includes both the data value estimator along with the model training algorithm. With data value estimator, more valuable training samples are used time and again to train the model than less valuable samples. The animation below represents the aspect of using data value estimator being used with the predictive model. Pay attention to how data value estimator, sample and predictive model along with loss computation is used to find the data points having high value.
It is found the Google DVRL framework performs the best out of the three data quality valuation framework (such as Data Shapley, LOO & DVRL) listed in this post in terms of determining the value of each data point.
Data Shapley is an equitable framework which is used to quantify the value of individual training sources. The Data Shapley framework uniquely satisfies three natural properties of equitable data valuation which related to the following:
Data Shapley provides a metric to evaluate each training data point with respect to the machine learning model performance.
Here is an example Python Jupyter notebook of how to use Data Shapley to evaluate the value of the data. The related paper can be found here – What is your data worth? Equitable valuation of data
Leave one out method is one of the most common approach of evaluating the model performance based on the training data set. This method is most widely used method to find the value of each of the data points. In this method, the value of every data points is found out by removing each data point from the training data set and measure the model performance before and after. The value of the removed data point is calculated by subtracting the model performance before and after removing data point. In LOO method, the idea is to evaluate the model performance after removing the data point.
LOO method has the limitations in terms of computational infeasibility to re-train and re-evaluate the model on each data point or single datum.
Other limitations with LOO method is that equitable valuation conditions listed in previous section (Data Shapley) are not satisfied.
Here is the summary of what you learned in this post about evaluating or assessing the value of data points using different frameworks thereby determining the data quality:
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…