In this post, you will learn about some of the key data quality challenges which need to be dealt with in a consistent and sustained manner to ensure high quality machine learning models. Note that high quality models can be termed as models which generalizes better (lower true error with predictions) with unseen data or data derived from larger population. As a data science architect or quality assurance (QA) professional dealing with quality of machine learning models, you must learn some of these challenges and plan appropriate development processes to deal with these challenges.
Here are some of the key data quality challenges which need to be tackled appropriately in order to ensure high performance models:
- Data interdependencies
- Varied data ownership vs model ownership
- Low-valued data management issues
- Feature management issues
Data Interdependencies
Machine learning models require training data set to include different features such that a high-performance model can be trained. The data / features used for training the model do not work in isolation. They are highly inter-dependent. In fact, a certain group of features comprising of high-valued (high importance) and low-valued data set can result in a high performance models. Here are few data interdependency scenarios which results in impact on model performance:
- Data distribution of one or more features changes
- One or more features get removed
- One or more features get added.
This poses the challenge in having consistent model performance as change in data due to reasons sighted above can change the model performance in unpredictable manner.
Here are few techniques based on which the challenge thrown by data inter-dependencies could be taken care:
- Go for model ensembles; This does take care of variance introduced due to the data set
- Regularization technique also helps in making sure only features with high importance get part of the model
- Model interpretability techniques can also be used to assess the data interdependencies.
Varied Data Ownership vs Model Ownership
Many a times, the team owning the data and the team building the models are not one and the same. This pose a challenge in ensuring the high performance of models in a consistent and sustained manner. The product teams accommodate the change in data based on the change in the business landscape. And, more often than not, the change in data is not communicated to the DS team on a priority basis. This does require the change management team to lay down related processes. The change in the data could be related to some of the following:
- Data distribution related to a specific feature may change
- One or more data set representing different features may become obsolete to the the business. Ideally, this would require to be removed from modeling and hence this would need to be informed to the DS team.
- One or more dataset gets introduced in the business and the same needs to be informed to the DS team in order for them to do feature engineering around this new data set and include related features appropriately.
The following could be different set up where the data and model ownership is different:
- Different product teams work with a horizontal data science team. The product teams and data science teams are internal to the organization. The ownership of data lies with the product team while the ownership of models rest with the data science team.
- Product teams and data science team belong to partner organizations. In this setup, it poses greater challenges to make sure data owners and model owners communicate with each other at regular intervals in order to inform about change in data.
- In SaaS-based setup, data lies with the end customer and DS team has to work hard to get the data from the customers at the right time.
Here are few techniques based on which the challenge thrown by different ownership could be taken care:
- Change management strategy regarding data needs to be put in place and the same needs to be agreed between different teams. Regular change management meetups should take place to review the data landscape.
- Data versioning should be put in place to make sure data can be compared in case of change and appropriate action in terms of retraining of model could happen.
- Regular collaboration / communication between the data owners (product teams, customers) and data science teams. Here is a cartoon representing the aspects of regular communication /collaboration between owners of data and models 🙂
Low-valued Data Management Issues
Many a time, the data used for training the models is found to have low value in terms of the fact they result in only incremental increase in the model performance. You may check my related post on data quality assessment frameworks which talks about what is value of data, how does the value of data impacts the model performance and how to determine the value of data.
The following represents some of the reasons why the data having low-value gets included for model training:
- In initial stages of model development, due to lack of data, the data having low-value which contributes only a little in overall model performance gets included. And, as time passes by, one forgets to remove this data from the modeling. This results in impact on the model performance (model fails to generalize for unseen data) apart from the cost involved in maintaining / collecting the low value dataset. If there is a change in data distribution of such low-valued dataset, this does impact the model performance and becomes hard to debug.
- Many a times, the low-value dataset when combined with other dataset results in good performance of models. As time passes by, new data comes into picture resulting in improved performance. However, the data having low value do not get removed. This poses the challenge in terms of impact on model performance.
- Many a times, one achieves a minor gain by adding new features having low value. This, however, impacts the model performance in the longer term as the data with low value sticks around without data scientists making effort to remove such data set.
Here are few techniques based on which the challenge thrown by low-valued data management issues could be taken care:
- Monitor the low-valued data set at regular intervals by determining the impact of such data set on overall model performance and removing such data from training data set; The tools such as Googles Data value estimator, Data Shapley and Leave one out technique can be used for evaluating the value of data and the model performance metric. The details can be found in this post: Assessing the data value vis-a-vis model performance. The picture below represents the aspect of cleaning up low-valued data set from the training data set at regular intervals.
Tracking Data / Feature Usage in Different Models
One of the key data quality challenge is to track the quality of data / feature used in different models. In SaaS-based setup, the same use case gets adopted by different clients using master list of features and algorithms. As a result of use cases / solutions adopted by large number of clients, it can become a challenge to keep a track of which set of features got used for which models and whether all high-valued / high quality features are being used by all the models. Failing to do so will impact the model performance in the loner run.
Here is a technique based on which the challenge thrown by lack of feature tracking could be taken care.
What is required is some sort of Model Feature Catalog which keeps a track of models, algorithms, features, model performance and features / data importance. Doing so would help in governing data quality in relation to tracking features in a consistent manner. Here is a picture representing the need to track the features across different models in production.
Conclusions
Here is the summary of why you learned in this post in relation to data quality challenges for machine learning and how to deal with them:
- One of the key data quality challenge is to understand the data interdependencies in the training data set used for training the model. Some of the techniques to deal with data quality challenges posed by data interdependencies include using ensemble technique, regularization technique, model interpretability.
- Data quality is found to be impacted in case of different ownership of data and models. Most likely, this would prevail in many real world scenarios. Some of the techniques to deal with data quality challenges posed by data / model ownership issues include change management strategy, data versioning and regular collaboration / communication
- One of the other data quality challenge is posed by usage of low-valued data in training the model and not filtering these low-valued data from time-to-time. The technique which can be used to deal with this challenge is regular monitoring of data quality from data value perspective.
- What are AI Agents? How do they work? - January 7, 2025
- Agentic AI Design Patterns Examples - January 6, 2025
- List of Agentic AI Resources, Papers, Courses - January 5, 2025
I found it very helpful. However the differences are not too understandable for me