In this post, you will learn about interview questions that can be asked if you are going for a data scientist architect job. Data science architect needs to have knowledge in both data science/machine learning and cloud architecture. In addition, it also helps if the person is hands-on with programming languages such as Python & R. Without further ado, let’s get into some of the common questions right away. I will add further questions in the time to come.
Q1. How do you go about architecting a data science or machine learning solution for any business problem?
- Understand the business problem: First and foremost, data science architect should work with product managers / business analysts to understand the business problem. He/she could be part of design thinking workshop to understand the real problems. He/she could use analytical approaches such as breaking down problems into sub-problems to get a holistic picture of the problem. He/she should be well knowledgeable about the questioning techniques such as 5-whys, Socrates method, etc which helps arrive at the actual problem.
- Laying down the hypotheses: Once the problem is understood well, one should go about laying down one or more hypotheses related to solution of the problem. One can use value-complexity mapping to select the top 3 hypotheses to work on.
- Defining the KPIs: The hypotheses are related to decisions and actions which can be mapped to the final outcome. Each of the action will be measured by leading KPIs and the final outcome is measured by lagging KPIs. Thus, before getting ahead, most important leading and lagging KPIs must be defined for hypotheses validation.
- Identify the business levers: The business levers represent the input to the system which can influence the business outcome. The input to the system can represent the variables that can be controlled (levers that can be pulled) and, the variables which can’t be controlled.
- Collect the data: Next step is to determine what data you have and what would you need to collect.
- Design one or more models and combine them as the solution: Once the objective, input levers, and the data are set, the final step is to design one or more models whose predictions can be combined to create solutions representing the modeler, simulator, and optimizer.
In relation to the above, check one of the related posts titled as drivetrain approach for machine learning. You can also get to learn examples for designing the machine learning solutions using the drivetrain approach.
Q2. How would you go about deploying a machine learning model in the cloud and serve predictions through APIs?
Here are the steps for deploying the machine learning models in the cloud. The points below represents couple of options related to deploying the models in Amazon cloud.
- Deployment using Python Flask App
- Deploy the model file (say, python pickle file) in Amazon S3 storage.
- Create a Python flask-based app that loads the model for serving the predictions. The python flask app can be dockerized and deployed using Amazon elastic container (ECS) service.
- Expose the python Flask app through REST API. The REST API can be exposed using the Amazon gateway service.
- Deployment using Amazon Sagemaker
- Train the model using Amazon sagemaker studio
- Deploy the model as Lambda service right from within Sagemaker
Q3. What will be your governance strategy for machine learning-based solutions?
The governance strategy for machine learning-based solutions is about capturing data related to KPIs, track and monitor the KPIs, and report the KPIs to stakeholders from time to time. KPIs can be leading as well as lagging KPIs. While lagging KPIs are also called as value metrics and related to measuring the business impact, leading KPIs are related to measuring the performance of the models and take appropriate actions in case the model accuracy dips below a particular threshold. While lagging KPIs are primarily tracked by product managers / BAs, leading KPIs such as model performance can be tracked by data science architects. One can have a system of Red-Amber-Green to represent the model performance and have a playbook to take appropriate actions based on the model performance being labelled as red-amber-green. Note that the threshold accuracy range mentioned below is hypothetical and can vary based on your requirements.
- In case the model accuracy is above 85% or so, one can tag the model as green. Nothing needs to be done here.
- In case the model accuracy stays in the range of say, 70-85%, the model can be tagged as Amber. One should examine the reason for the dip in model accuracy and take the appropriate action such as re-training the models.
- In case, the model accuracy dips below 70%, one can tag the model as Red. In this case, the model should be replaced with the last best model, or some alternate rules-based solution be deployed or there should be the provision of exception handling.
Q4. Talk about a cloud-based platform that could be used for training machine learning models by the data science team?
One can design the data science workbench using Amazon Sagemaker Studio (IDE for machine learning models). It is a great tool and provides a cost-effective platform for training machine learning models. The best part is it can be easily integrated with data lake (S3) on Amazon. There can be other viable options with other cloud platforms such as Azure and Google.