Data readiness levels (DRLs) and related assessments are an important part of data analytics. Data readiness levels is a concept where different stages represent the quality and maturity of data. Data science is becoming increasingly popular, but not all companies have the right level of data readiness for this type of work. Performing data readiness levels assessment is important because it gives an insight into the quality and quantity of your current datasets and helps determine future success of the data analytics project. This blog post will explain what data readiness levels are and why assessment tests are important in relation to them.
What are data readiness levels?
Data readiness is defined as the state of the readiness of data for a particular use such as building AI / machine learning models. Data readiness levels of the given data set to be used in the project can help project stakeholders take proactive action for risk mitigation if any due to lack of proper data. In order to determine data readiness levels (DRLs), data readiness assessment tests are performed at different stages of project execution including the beginning of the project and as the project implementation moves along. The resulting data readiness reports are published to key stakeholders including data science team, engineering and business team to remain confidence about the decisions they make based on the data. There are three different levels at which data readiness for a project or product is assessed. They are the following:
- Band A: Band A represents the assessment of data utility in terms of data appropriateness to solve the desired business problem.
- Band B: Band B represents the assessment of data validity in terms of data completeness and data correctness (data quality). The activities which are done to evaluate the data readiness level at band B can be thought of as data exploratory analysis (as done while building ML models). At B1, the data is ready to be used to build machine learning models.
- Band C: Band C represents the assessment of data in terms of aspects such as accessibility, privacy, legalities, format etc. At band C, there are different levels ranging from C1 to C4 with C4 being the lowest. At C4 level, the data is assumed to exist but not verified. This is the lowest level. At C1, the data is ready to be loaded into analysis software, or it can be made available for others to access. It is machine readable and ethical procedures for data handling have been addressed. Bringing data to C1 level often requires a significant effort involving many lines of code and human understanding of systems, ethics and the law.
The data readiness level assessment test starts at band C and move forward to band B and finally band A.
Why are data readiness assessment tests are important?
Data that enters an machine learning (ML) pipeline is subjected to pre-processing by various stakeholders in their own distinctive manner using tools (such as Jupyter notebook, R studio etc) and methods (such as data exploratory analysis). This ad-hoc and iterative nature of work limits reuse and results in loss of productivity. Data practitioners such as data analysts, data scientists etc spend a significant percentage of their time in exploring and tackling various data accessibility and validity issues. This is due to their lack of expertise in dealing with the problems that incoming data poses, as well as whether any modifications or changes have been made to it, and if so, by whom. What is needed is a sort of practice which can help assess the data readiness much in advance and, at regular intervals. This is where the the concept of a data readiness report gets introduced.
Data readiness report can be defined as a documentation to a data quality and readiness assessment that would allow data consumers such as data scientists and ML engineers to get regular and detailed data insights into the quality of input data across various different standardized dimensions. It serves as a comprehensive documentation of all data properties and quality issues including data operations by various personas to give a detailed record of how data has evolved.
How to perform a data readiness levels assessment test?
Data readiness is tested or evaluated at regular intervals in order to ensure success of data analytics projects including advanced analytics or machine learning based projects. The following are some of the following criteria / parameters against which evaluation needs to be carried out.
- Data accessibility: Data accessibility is used to measure how easy is it to get an access to the data. You will require to assess the accessibility related aspects such as programmatic access and data format.
- Programmatic access: Is there programmatic interface available for data access?
- Data format: Is data available in appropriate format for easy access? Many a times, data is in PDF or word format and not in a machine readable format. Thus, the data would need to be converted in format such as text for programs to easily read and process the data.
- Data licenses: Are there suitable licenses available to access the data if the data is procured from a third party? This is related to legalities or legal aspects of data acquisition.
- Ethics & safety: Is this safe and ethical enough to provide access to the historical and current data for different teams? Sometimes, it is not appropriate to allow access to medical information to an engineering team.
- Secured data access: Is the data secured after the access permissions are provided?
- Usage restrictions / Access permissions: Are there appropriate access permissions available for team to access the data in terms of read, write and execute?
- Data validity: Data validity is assessed to measure whether available dataset is valid for usage in applications.
- Basic metadata check: Is there required information available regarding metadata? This includes basic information of dataset including name, description, date of creation, data ownership, contact person, type such as structured, unstructured etc.
- Data profile evaluation: Is the data profile related information gathered and evaluated? This includes capturing information such as basic characteristics and statistical properties of the input data. For instance, in structured data, a data profile section could cover aspects such as number of rows and columns, description of each column including datatype, minimum value, maximum value, missing data percentage, number of unique values and other statistical measures such as columns correlations etc.
- Data quality profile evaluation: The data must be accurate and up to date for any analysis or computation involving it. It is necessary that the datasets are complete with no missing values since this will affect the whole dataset’s accuracy in future predictions / computations. In addition, aspects such as class imbalance, outliers, inconsistent values, data bias etc. are also evaluated.
- Data utility: Data utility is assessed to determine whether the data is appropriate enough to solve the desired business problem. A business problem when required to be solved can be broken down to different business sub-problems (questions). The data set aside to solve the business problem must meet the requirements of being appropriate enough to solve all of the business sub-problems. Data which is appropriate can be found at level A1. The assessment of data at band A would result in the requirements of data collection.
References
- Data readiness levels
- The importance of data readiness in NLP
- The dataset nutrition level: A framework to drive higher data quality standards
- Data readiness report
Data readiness levels are an important component of data analytics projects. The assessment process should consider the following criteria/parameters: Data accessibility, data validation checks and finally, data utility test. After completing all these steps in your assessment test you will have a better idea about what level of data is appropriate for solving business problems.
- Invoke Python ML Models from Other Applications – Examples - September 18, 2024
- Principal Component Analysis (PCA) & Feature Extraction – Examples - September 17, 2024
- Content-based Recommender System: Python Example - September 17, 2024
Leave a Reply