Big Data – Top 6 Frameworks Required to Get Started

This article represents top 6 software frameworks (or tools) to get started with Big Data POC projects. This article may be of interest to those who are beginning with Big Data and want to understand about tools/frameworks required to get started with their Big Data POC projects. The article presents only the  bare minimum set of frameworks that are required to get started. I am sure there could be more to this list. However, my objective is to cover only the minimum set. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following are key functional areas in Big Data that need to be taken care in order to get started with Big Data proof-of-concepts (POC) projects.

  • Data processing
  • Data storage
  • Data coordination
  • Data access
  • Data transfer to/from RDBMS

Following are the minimum set of tools that are required to be installed and configured in your Big Data environment setup to take care of above-mentioned functional areas.

  • HDFS
  • Hadoop – Map/Reduce
  • HBase
  • Zookeeper
  • Hive or PIG or R
  • Sqoop or Flume

 

HDFS – Hadoop Distributed File System

Well, the first challenge that needs to be met while processing large amount of data is related with storage of data. Say, you are crawling Web for a particular type of data sets. And, there are tons of data out there. Once crawled, given that the data is Big (volumne, variety, velocity), RDBMS based databases may not be sufficient to handle & process such large data sets. This is where HDFS comes into picture. It would be helpful to know that HDFS is a part of Apache Hadoop project whose other key part is Map-Reduce. As per Hadoop page on HDFS, The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.. Another good definition I found on Hortonworks page on HDFS which defines Hadoop Distributed File System (HDFS) as a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS, MapReduce, and YARN form the core of Apach Hadoop. If you are starting on BIG data projects, this is one technology you need to be very much familiar with. You may want to check following video to get a quick information on HDFS. I must tell that this may not be the best of the video on HDFS, but I liked it at the time of writing this blog.
.

Further explanation of HDFS is out of the scope of this blog.

 

Hadoop Map/Reduce

Hadoop Map/Reduce is a part of Apache Hadoop project and is primarily used to process the data. In order for you to get started with Big Data projects, this is one of the most important technology apart from HDFS that is described above. It should be noted that both Apache Map/Reduce and HDFS is deployed on a single set of servers. This means that on a single box, both Map/Reduce and HDFS gets deployed and not on the seperate boxes. In the video below, it is nicely said that Map/Reduce moves the “compute” to the data and not the other way around. As we will agree that in our traditional programming, we first retrieve the data from remote data bases and then process the data. In Hadoop world, Map/Reduce owing to the fact that it is placed with HDFS, knows where the data is located and thus, move the task into the HDFS data node where the data is located.

 

HBase

HBase is one of the key element of any Big Data based technology architecure. HBase serves the need of Hadoop Map/Reduce jobs reading or writing data from/to HDFS in real-time. It is serving several data-driven websites including the likes Facebook etc. HBase is a distributed column-oriented database running on top of HDFS. HBase could be accessed using one of the following manners:

  • Java API
  • REST
  • Apache AVRO
  • Thrift APIs

Following video can give you a head-start with HBase.

 

Zookeeper

Zookeeper handles the co-ordination that is required between softwares such as Hadoop Map/Reduce, HBase, HDFS to process the Big Data.

 

Hive or Pig or R

Apache PIG is a scripting language used to write data analysis program for large datasets stored within Hadoop clusters. Apache HIVE is an SQL-like language which enables querying of data from HDFS. One could make use of R language (console-based) to do the data analysis.

 

Apache Sqoop or Apache Flume

Apache Sqoop allows reading and writing data to/from RDBMS to HDFS. Apache Flume allows one to handle streaming data and move it into Hadoop cluster. When you are starting with Big Data, you could choose to use one of them depending upon whether you are going to extract data from RDBMS-based system (Sqoop) or streaming data related system (Flume) such as log server etc.

 

Ajitesh Kumar
Latest posts by Ajitesh Kumar (see all)

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with .