In this post, you will learn about some of the top data science skills / concepts which may be required for product managers / business analyst to have, in order to create useful machine learning based solutions. Here are some of the topics / concepts which need to be understood well by product managers / business analysts in order to tackle day-to-day challenges while working with data science / machine learning teams. Knowing these concepts will help product managers / business analyst acquire enough skills in order to solve machine learning based problems.
- Understanding the difference between AI, machine learning, data science, deep learning
- Which problems are machine learning problems?
- Could AI problems be called as machine learning problems?
- What are machine learning models?
- What does training / fitting a model mean?
- What are features or what is feature engineering?
- Defining business metrics is key
- What is model monitoring / retraining?
- What is model tuning or model hyper parameters tuning?
- Machine learning terminologies
Understand the difference between AI, Machine Learning & Data Science
Product managers / business analyst must understand the following terminologies in a clear manner without any ambiguity.
- Artificial Intelligence (AI): AI is a broad term which can be used for machine learning, deep learning etc. Simply speaking, AI is defined as computer systems which can mimic the human intelligence in order to solve a problem in hand. In order to solve any problem, humans do use past experience or just apply complex set of rules. In the similar manner, AI systems which use past experience (data) to solve a problem (prediction), it is called machine learning based systems. However, AI could as well be a system applying a set of complex rules as like humans.
- Machine learning (ML): Machine learning can be defined as computer systems which can learn from past experience (data) in order to solve a problem in hand (make predictions / classify / segment data etc). Based on the quality of predictions made by ML systems, their performance can be measured at regular intervals and accordingly these ML systems can be retrained with new data set (experiences).
- Data Science: Data science can be defined as a stream of science involving data related hypotheses covering data relationships and patterns which are analysed by applying mathematical modeling techniques. Data science includes solving problems (business-related or otherwise) in different domains using machine learning algorithms while making use of computing technologies. Statistical tests are key to data science.
- Deep Learning: Deep learning is class of machine learning techniques which mimics human brain in form of neural networks to solve different problems.
Which Problems are Machine Learning Problems?
It is of utmost importance for product managers to identify the problems as machine learning problems. Here are thumb rules to determine whether the problem is a machine learning problem or otherwise:
- Determine whether the solution to the problem is based on a set of rules which don’t change frequently with the data. Also, determine whether the rules are straight forward. If the rules are straight forward, not very large in numbers and don’t change frequently with the data, you may not need machine learning based solution.
- If the solution to the problem does depend upon the rules which are not easy to be determined and change with data distribution, in that case, one may opt for machine learning based solution.
Could AI problems be called as machine learning problems?
Not necessarily. AI problems which applies machine learning techniques can be termed as machine learning problems. There can be problems which can be solved using large number of complex set of rules which requires computing rather than being solved by humans.
What are Machine Learning Models?
“Models” is the term which gets used frequently to represent the computing entity which serves the predictions. It must be clearly understood that “Models” are nothing but “Mathematical Models“. Machine learning is used to learn parametric or non-parametric models (mathematical models). Parametric models are those which require determining coefficients / parameters of mathematical models. For example, linear regression / logistic regression etc. Non-parametric models are based on machine learning algorithms which are not based on parameters. For example, decision tree, random forest etc. Many a times, models and algorithms are used interchangeably.
What does training / fitting a model mean?
Training or fitting a model means using the past / historical data to train a machine learning model using different machine learning algorithms. For example training a linear regression model means using data to determine coefficients of linear regression algorithm.
What are features or what is feature engineering?
“Features” and “Feature engineering” are most frequently used words when you will be dealing with data science team. Here is the summary of what they mean:
- Features are nothing but a set of attributes which can be used to make the predictions. The features / attributes can be raw (straight out of database) or derived (from raw features). For example, height in terms of cms can be raw features. However, how tall is a person in terms of tall, medium or short can be derived based on height (cm) compared with different threshold.
- Feature engineering is the process of determining most important features which can be used to train a machine learning model. Feature engineering involves some of the following activities:
- Deriving features from raw features
- Extracting features from raw / derived features
- Determining the features importance
- Selecting the most appropriate features based on different techniques
Defining Business Metrics / KPIs is Key
When the problem gets identified as machine learning problem, the first thing that product managers / business analyst should do is to define business metrics or key performance indicators (KPIs) which need to be evaluated at regular intervals to assess the performance of machine learning models. In my experience, I have seen that the metrics remain vague till the point the models get deployed in the system. And, this is one of the primary reasons that the business is unable to determine the ROI of machine learning deployments. And, thus, most likely, the projects get shelved.
Business metrics can be defined based on technical metrics or otherwise. For instance, if you are designing a worklist prioritisation system based on machine learning recommendation system scoring the work items in the worklist, business metrics can represent number of hours saved in analysis due to worklist prioritization. Alternatively, business metrics can also be based on technical metrics such as accuracy, prediction or recall of correctly classifying worklist items based on, say, business criticality.
What is Model Monitoring / Retraining?
Model once deployed in production starts serving predictions. However, it is required to be seen as how many predictions matched the actual values. This is what is termed as model monitoring. Model monitoring is about evaluating model performance based on pre-defined metrics at regular intervals.
In case the models are found to be not performing well, models is scheduled for retraining. Retraining models would mean some of the following:
- New algorithms can be used for model retraining
- New set of features can be used
- New values of hyper parameters can be used
- Different ensemble of models can be used
- One or more models can be used for different data segments
What is Model tuning or Model Hyper parameters tuning?
Model tuning is nothing but tuning model hyper parameters. Model hyper parameters are meta data associated with the quality of models. For example, for some algorithms such as logistic regression, there are regularization related parameters. For non-parametric models such as those trained using random forest, hyper parameters are number of trees, maximum depth of tree etc.
Machine Learning Terminologies
The following are some of the most important terminologies which when understood can help product managers / business analyst sail through without much issues.
- Supervised learning: Machine learning problems where labels related to classification or regression is present. The labels are nothing but the actual values for the outcome which needs to be predicted. Thus, models in supervised learning problems can be trained with features and label.
- Unsupervised learning: Machine learning problems where labels (actual values relating to the predictions) are not present are called unsupervised learning problems. Typically, clustering or identifying data in clusters is related to unsupervised learning.
- Regression problems: Machine learning problems where it is required to predict numerical values. For example, housing prices.
- Classification problems: Machine learning problems where it is required to classify data in two or more category.
- Clustering problems: Machine learning problems where it is required to segment data in two or more clusters.
- AutoML: Automated machine learning is called as AutoML. AutoML automates the process of coming up with most optimal model with input as training data set.