Detecting bias in machine learning model has become of great importance in recent times. Bias in the machine learning model is about the model making predictions which tend to place certain privileged groups at a systematic advantage and certain unprivileged groups at a systematic disadvantage. And, the primary reason for unwanted bias is the presence of biases in the training data, due to either prejudice in labels or under-sampling/over-sampling of data. Especially, in banking & finance and insurance industry, customers/partners and regulators are asking the tough questions to businesses regarding the initiatives taken by them to avoid and detect bias. Take an example of the system using a machine learning model to decide those who could re-offend (Recidivism – the tendency of a convicted criminal to re-offend). You may want to check one of our related articles on understanding AI/Machine Learning Bias using Examples.
In this post, you will learn about bias detection technique using the framework, FairML, which could be used to detect and test the presence of bias in the machine learning models.
FairML adopts the technique of finding relative significance/importance of the features used in the machine learning model for detecting the bias in the model. In case the feature is one of the protected attributes such as gender, race, religion etc and found to have high significance, the model is said to be overly dependent on that feature. This implies that the feature (representing protected attributes) is playing important role in model’s prediction. Thus, the model could be said to be biased and hence, unfair. In case, the said feature is found to have low importance, the model dependence on that feature is very low and hence model could not be said to be biased towards that feature. The following diagram represents the usage of FairML for bias detection:
In order to find the relative significance of the features, FairML makes use of the following four ranking algorithms to find feature significance/importance and evaluate combined ranking or feature significance using rankings of each of the following algorithms.
This technique, iterative orthogonal feature projection, is implemented as the following:
If the difference between predictions made using initial dataset and transformed dataset (with removed feature/attribute and other features made orthogonal) is statistically significant, the feature could be said to be of high significance/importance.
In this technique, the idea is to select features that correlate strongest to the target variable while being mutually far away from previously selected features. In the technique related to maximum relevance, those features are selected which correlate strongly with the target variable. However, at times, it is found that there are input features which are correlated with each other. However, it could be the case when one of these features is significant in relation to the prediction model. Other input feature, thus, act as redundant. The idea is to not include such redundant features. This is where this technique comes into the picture. Include features which have maximum relevance to the output variable but minimum redundancy with other input variables. Heuristic algorithms such as the sequential forward, backward, or bidirectional selections could be used to implement mRMR.
The mRMR ranking module in FairML was programmed in R leveraging the mRMRe package in the CRAN package manager
LASSO stands for Least Absolute Shrinkage Selection Operator. Linear regression uses Ordinary Least Squares (OLS) method to estimate the coefficients of the features. However, it has been found that the OLS method does result in low bias, high variance. As part of the regularization technique, the prediction accuracy is found to be improved by shrinking the coefficients of one or more of the insignificant parameters/features to bare minimum or near-to-zero (Ridge regression) or zero (LASSO regression). Ridge regression helps in estimating important features. However, LASSO regression helps in firming up the most important features as the non-significant features’ coefficient is set to 0.
For the LASSO ranking in FairML, the implementation provided through the popular Scikit-Learn package in python is leveraged.
Random Forest, one of the ensemble modeling technique, is used to determining the feature importance by making use of some of the following techniques:
For the random forest ranking in FairML, the implementation provided through the popular Scikit-Learn package in python is leveraged.
In this post, you learned about the FairML framework and the technique used to determine the bias in the machine learning model. Primarily, if it is claimed that the model is biased against a specific class of data, FairML helps you determine the relative significance of the data representing those attributes and appropriately provide explanation against the claim.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…