Big Data – Functional & Technology Architecture for Beginners

This article represents a view associating functional and technology elements of Big Data reference architecture. The objective of this article is to present a view relating key functional areas in Big Data with relevant technologies. The diagram and related description could be of use to Big Data beginners (developers, architects, business analysts etc) wanting to get a high-level view on functional and technology aspect of Big Data. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following diagram represents the functional and technology landscape view of Big Data. The objective of the diagram below is following:

  • Associate functional areas with technologies
  • Reflect on demand vs supply by showing the boxes/texts in green which is readily available to that in red which is difficult to find
  • Reflect on functional/technology areas with respect to easy/intermediate/difficult (green/amber/red)
Big Data Functional Technology Architecture

Big Data Functional Technology Architecture

Following are two core areas of Big Data represented in the diagram above. We shall look into technologies as well as people aspect of each of the core areas in detail, later in this article.

  • Data Engineering
  • Data Science

 

Data Engineering

Data engineering includes following as key functional areas along with key technologies mentioned side-by-side:

  • Collect Data: This is about collecting or gathering data from different data sources. For example, data could either be collected from one or more RDBMS databases or data could be streaming data such as log files (data from internal or external data sources). Different technologies such as following could be used to gather or collect data:
    • Sqoop
    • Flume
    • Scribe
    • Storm
  • Store Data: Once data is collected, it needs to be stored for further processing. Different technologies (frameworks) such as following can be used for handling data storage:
    • HDFS (Hadoop Distributed File System)
    • HBase (NoSQL datastore)
    • MongoDB (NoSQL datastore)
    • Cassandra (NoSQL datastore)
    • CouchDB (NoSQL datastore)
  • Transform, Simplify and Analyze Data: The data, once gathered and stored, needs to be processed further for transforming the data into different forms for performing analytics activity on the data. Hadoop Map/Reduce jobs are run on the stored data which then gets stored on datastores such as HBase etc. From there on, the data analysis phase starts in which tools such as following comes into picture:
    • Hive
    • PIG

All of the above tasks may require data engineer with good knowledge of Hadoop technology stack. One may note that this part if comparatively easier than the data science.

 

Data Science

Once done with data engineering phases, the data analysis phase starts in which some of the following technologies (frameworks) come very handy:

  • R programming language
  • Mahout
  • Pig
  • Hive
  • Java/Python libraries

The person working in data analysis phase need to be strong with following skills:

  • Machine learning algorithms
  • Mathematics & Statistics knowledge

This person can also be called as “Data Scientist” and is very much in demand as to find a person with above skills is a difficult task.

 

Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with .