What is data science? This is a question that many people are asking, and for good reason. Data science is a relatively new field, and it covers a lot of ground. In this blog post, we will discuss what data science is, and we will give some examples of how it can be used to solve problems. Stay tuned, because by the end of this post you will have a clear understanding of what data science is and why it matters!
What is Data Science?
Before understanding what is data science, let’s understand what is science?
Science can be defined as a systematic and logical approach to discovering how things are in the universe. It is a way of acquiring knowledge by using observation and experimentation to describe and explain natural phenomena.
What is data science?
Data science, in simple terms, can be defined as an approach to acquiring knowledge about different things by using data, algorithms, and technology. Things can be related to different business domains including healthcare, finance, retail, banking, insurance, etc. At times, knowledge is also referred to as insights or actionable insights. Popular scientific methods such as setting hypotheses, experimenting to test the hypothesis, and establishing the hypothesis as true or otherwise also hold good with data science. Data science can be seen as an extension of science but with greater power due to the lever of technology related to big data, elastic computing infrastructure to store and process data, programming tools, etc.
The following represents some of the key activities in data science:
- Understand problems: Breaking down problems into sub-problems can be helpful. Questioning techniques such as the 5-whys, and the Socratic method can be very helpful. Interacting with end users can also help understand problems better.
- Identify hypotheses to work with: Hypotheses, at times, can be understood as the solution you think can solve the problem. The solution can be hypotheses that can be tested using statistical tests, running simulations based on outcomes of predictive or optimization models, etc. You need to test the solution and find out whether the solution really solves the problem and then establish it for enterprise-wide adoption. For example, running one or marketing campaigns on the customers who are likely to churn out can result in avoiding customer churns. The hypotheses can be around some of the following: A. Deciding one or more particular forms of campaigns that can be most effective B. Identifying the customers who are likely to churn. In one simulation exercise, a set of predictive models can be used to predict and work with the hypothesis that those predictions are correct, e.g, those predicted customers are likely to churn. If the predictions ain’t effective, it will be required to come up with another hypothesis that predicts another set of customers who could churn out. And, in this case, different kinds of predictive models may be required to be built and tested.
- Identify and collect data: Identify key hypotheses/solutions levers and acquire related data (internal or external). Levers are the key attributes that can impact business outcomes when applied. Levers can be represented in form of raw or derived data. Most of these data can be found within the organization. However, one must not be shy of going and getting an external dataset even if it is associated with cost in order to avoid data-related bias.
- Perform data pre-processing including data cleaning, etc: Data in its original form may require some processing in order to prepare data for analysis purposes.
- Perform hypothesis testing (Statistical tests, KPIs, simulations, etc): At times one can test the hypotheses by performing statistical tests. Alternatively, one would need to track KPIs for a certain period of time to accept the hypotheses or solution as a truth. This is where the dashboard could prove to be very helpful. One can also use optimization techniques for prescribing the solution that could be most optimal and run the simulation for a certain period of time to establish the optimization parameters value as truth. One can also use predictive engines or models for estimation purposes and run the simulation exercise (take action) to measure the effectiveness of decisions taken based on the Continuous monitoring holds the key to hypothesis testing.
- Establish new truths to be adopted: As a result of hypothesis testing, or running simulations with optimization or prediction engines, you would be able to establish the truth given the evidence in form of data. This truth can then be established enterprise-wide unless a new truth emerges based on ever-changing data. For example, if a prediction engine is predicting the potential customers who can churn and if a particular kind or form of marketing campaign is run on them, the customer’s churn is found to be reduced. Thus, the predictive engine and the form of marketing can be accepted as truth until the customer’s churn is avoided as desired.
- Continuous monitoring to align solutions/hypotheses: Once the hypotheses have been established as truth, it is of utmost importance to continuously monitor the data and align the solution to ever-changing data. For example, a customer’s behavior might change over time and, in turn, the prediction engine may require re-tuning to predict more accurately. The form of the marketing campaign that was effective earlier for customers who are likely to churn might not be as effective now and would require a different form or mix of campaigns to be run.
What are the key skills of a data scientist?
A data scientist needs to be good with some of the following:
- Good knowledge of business domain; He/she can take help from product managers or business analysts in this area.
- Expert knowledge of statistics: A must-have skill to enable data scientists to design hypotheses, formulate hypotheses in form of the null and alternate hypotheses, and perform hypothesis testing in order to establish the truth to work with.
- Expert knowledge of programming: A must-have skill to enable data scientists to leverage programming knowledge to test the hypothesis, and build predictive models. Programming languages such as Python and R are the most popular ones. Other programming languages include Julia, Scala, and Java.
- Advanced knowledge in data visualization (Desirable): Data scientists should also be adept with at least one of the data visualization tools such as Tableau, Qlikview, and D3.js to communicate the data insights in an effective manner. The tools can help them work with data visualization in a fast manner. They can however do the same thing with Python or R.
- Advanced knowledge of machine learning and optimization algorithms: It would be good to have knowledge of some machine learning algorithms to enable data scientists to find the right pattern in data
- Intermediate knowledge of cloud services: An advanced knowledge of how to operate some of the tools on, at least, one of the clouds such as Amazon, Azure, Google, etc would prove to be very helpful in working with data. The cloud services related to elastic storage and computing infrastructure help working with a vast amount of data and efficient data processing.
- Good knowledge of big data technology: A decent knowledge of some of the big data technology such as Hadoop, Spark, etc may help a great deal.
What are the key outputs of data scientists?
Here are some of the key outputs of projects that data scientists work on:
- Hypotheses formulation and analysis
- Establish new truths supported by hypotheses testing; These new truths get adopted and adapted in the organization. The hypotheses can be propositions that can be tracked on the dashboard, and the predictions and optimization output can be tracked as part of the simulation.
- Continuously monitor and re-align the solution to the ever-changing data
Data science is all about understanding problems better, testing hypotheses using data for deriving insights that can help make well-informed decisions thereby improving organizational performance. Thanks for reading! Please feel free to share your thoughts in the comments section below. Do you have any specific examples or case studies in mind? Let us know in the comments below.