- Data processing
- Data storage
- Data coordination
- Data access
- Data transfer to/from RDBMS
Following are the minimum set of tools that are required to be installed and configured in your Big Data environment setup to take care of above-mentioned functional areas.
- Hadoop – Map/Reduce
- Hive or PIG or R
- Sqoop or Flume
HDFS – Hadoop Distributed File System
Well, the first challenge that needs to be met while processing large amount of data is related with storage of data. Say, you are crawling Web for a particular type of data sets. And, there are tons of data out there. Once crawled, given that the data is Big (volumne, variety, velocity), RDBMS based databases may not be sufficient to handle & process such large data sets. This is where HDFS comes into picture. It would be helpful to know that HDFS is a part of Apache Hadoop project whose other key part is Map-Reduce. As per Hadoop page on HDFS, The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.. Another good definition I found on Hortonworks page on HDFS which defines Hadoop Distributed File System (HDFS) as a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS, MapReduce, and YARN form the core of Apach Hadoop. If you are starting on BIG data projects, this is one technology you need to be very much familiar with. You may want to check following video to get a quick information on HDFS. I must tell that this may not be the best of the video on HDFS, but I liked it at the time of writing this blog.
Further explanation of HDFS is out of the scope of this blog.
Hadoop Map/Reduce is a part of Apache Hadoop project and is primarily used to process the data. In order for you to get started with Big Data projects, this is one of the most important technology apart from HDFS that is described above. It should be noted that both Apache Map/Reduce and HDFS is deployed on a single set of servers. This means that on a single box, both Map/Reduce and HDFS gets deployed and not on the seperate boxes. In the video below, it is nicely said that Map/Reduce moves the “compute” to the data and not the other way around. As we will agree that in our traditional programming, we first retrieve the data from remote data bases and then process the data. In Hadoop world, Map/Reduce owing to the fact that it is placed with HDFS, knows where the data is located and thus, move the task into the HDFS data node where the data is located.
HBase is one of the key element of any Big Data based technology architecure. HBase serves the need of Hadoop Map/Reduce jobs reading or writing data from/to HDFS in real-time. It is serving several data-driven websites including the likes Facebook etc. HBase is a distributed column-oriented database running on top of HDFS. HBase could be accessed using one of the following manners:
- Java API
- Apache AVRO
- Thrift APIs
Following video can give you a head-start with HBase.
Zookeeper handles the co-ordination that is required between softwares such as Hadoop Map/Reduce, HBase, HDFS to process the Big Data.
Hive or Pig or R
Apache PIG is a scripting language used to write data analysis program for large datasets stored within Hadoop clusters. Apache HIVE is an SQL-like language which enables querying of data from HDFS. One could make use of R language (console-based) to do the data analysis.
Apache Sqoop or Apache Flume
Apache Sqoop allows reading and writing data to/from RDBMS to HDFS. Apache Flume allows one to handle streaming data and move it into Hadoop cluster. When you are starting with Big Data, you could choose to use one of them depending upon whether you are going to extract data from RDBMS-based system (Sqoop) or streaming data related system (Flume) such as log server etc.