In this post, you will learn about different aspects of data science and data engineering team and also understand the key differences between them. As data science / engineering stakeholders, it is very important to understand whether we need to have one or both the teams to achieve high quality dataset & data pipelines as well as high-performant machine learning models.
When an organization starts on the journey of building data analytics products, primarily based on predictive analytics, it goes on to set up a centralized (mostly) data science team consisting of data scientists. The data science team works with the product team or multiple product teams to gather the predictive analytics use case and work on to create the machine learning (ML) models. The product teams own the data and data science team owns the data pipelines & specific data sets used for ML modeling.
The challenges that data science team face are the following:
- While they are expected to create ML models having high performance, they also end up doing tasks related to what is called as data engineering.
- They also end up doing aspects related to software engineering such as integration with products, building & deploying ML models.
While it is great to have data scientists do tasks related to data engineering and software engineering, there is a possibility that overall software & data quality may suffer in relation to all the three aspects of having high-performance machine learning models, high quality data (data engineering) and data pipelines, and streamlined build and deployment process for machine learning models along with well-defined integration protocol to serve predictions to different products.
This is where there is a need to have data engineering responsibilities clearly defined and segregated. The goal is to have a separate data engineering team which can work along side data science team.
Data Science Team Responsibilities
Data science team comprising of data scientists would have some of the following as their key responsibilities:
- Training / testing machine learning models. While doing this task, he/she undertakes different sub-tasks such as data preparation, hypothesis testing, exploratory data analysis, feature engineering / selection, training models, model / algorithm selection (hyper-parameters tuning)
- Model performance monitoring
- Model retraining
The following represents some of the key skills required for the data scientists:
- Strong knowledge of mathematics and statistics concepts
- Good understanding / experience with machine learning algorithms
- Decent understanding of business domain concepts
- Data storytelling skills
- Knowledge of cloud services (AWS, Azure, Google etc) related to training machine learning models. For example, Amazon Sagemaker Studio
Data science team structure: It is seen that most companies set up centralized (horizontal) data science team which has small sub-teams / PODs dedicated to building / training machine learning models for different products.
Data Engineering Team Responsibilities
Data engineering team could have data engineers and architects to do some of the following:
- Data modeling: Create / maintain data models used for training machine learning models. The data models design can follow different topologies including designing data used for training one individual models, or designing data representing business domain (& related one or more products)
- Data pipelines: Create / maintain data pipelines using different data / big data / ETL technologies which can move data from one type of data storage (internal or external) to another type of data storage while performing data processing at regular / scheduled intervals. In addition, data engineers are also required to perform tasks related to data encryption / decryption (data security).
- Data warehouse solutions: Design / develop data warehouse solutions which can host data that could be used for building machine learning models.
- Software engineering: Create / maintain software engineering frameworks which can help achieve data models / data pipelines. In addition, he may also be required to create / maintain build / deployment pipelines for machine learning models.
The following represents some of the key skills of data engineering team members:
- Data warehouse solutions: Strong knowledge /experience of designing / developing data warehouse solutions which can host data that could be used for building machine learning models.
- Big data technologies: Strong knowledge / concepts of data processing tools & frameworks including those related to big data technologies such as Spark etc
- Cloud services for data processing: Given that everything is getting done / deployed on cloud, data engineers would need to have knowledge / experience working with cloud services related to big data processing etc. For example, Amazon EMR
- Data pipelines: Knowledge / experience working software engineering tools & frameworks to create data pipelines. The goal is to achieve data quality checks & anomaly detection while ensuring data security. Knowledge of ETL tools, data encryption / decryption plays proves to ve beneficial.
- Build / Deployment: Knowledge / experience working with cloud tools which can be used to host the machine learning models (Amazon ECS, Azure ACS, Google GCS)
Data engineering team structure: Data engineering team can have the following structure:
- If the organization is very large with multiple product teams, one could have small data engineering teams reporting to the product teams and one centralized data engineering team which will be responsible for maintaining architectural standards, best practices, tools & frameworks R&D / POCs in relation to data.
- If the organization is small/medium, one could have a centralized data engineering team with small sub-teams doing data engineering tasks related to specific products while one sub-team responsbible for architecture standards and best practices.