This article intends to present dummies notes on how distributed computing works using Hadoop. As Hadoop is inspired by Google GFS/Map-Reduce/BigTable paper,I have tried and refer to GFS/Map-Reduce/BigTable in this article appropriately wherever possible. One must note that distributed computing paradigm has become mainstream given the advent of Big Data related large scale project implementation going on in several companies. Please feel free to shout if you find discrepancies with my understanding and help me correct the mistakes.
Simply speaking, distributed computing refers to the computing paradigm in which processing happens on multiple different boxes consisting of data and, the result is, then, aggregated appropriately to display the final result. In traditional computing paradigm, however, the data is gathered from different data sources, brought to the application server running the program, and then executed upon by the program running on those application servers. In distributed computing paradigm, the computation is moved to the boxes which stores the data. The instructions for execution is sent/scheduled appropriately on all of these boxes that hold the computation unit (jar files) and the data.
Following are key components of distributed computing accomplished using Hadoop:
- Key Technologies Enabling Distributed Computing
- Multiple Physical/Virtual Boxes
- Distributed Storage (Filesystem): Distribute the Data (GFS/HDFS)
- Distributed Processing: Distribute the Computation (Map-reduce)
- Distributed Database for Random IO (HBase or BigTable)
- Coordination of Distributed Applications
Key Technologies Enabling Distributed Computing
Traditional technologies (tools & frameworks) could not suffice the requirements of distributed computing. That gave birth to some of the following technologies which helped community as talrge to adopt distributed computing in fairly easy manner:
- Hadoop Distributed FileSystem (HDFS). It is inspired by Google File System (GFS)
- Hadoop Map-Reduce. It is inspired by Google Map-Reduce.
- HBase. It is inspired by Google BigTable
- Zookeeper for coordination of distributed computing. Google counter-part is Chubby distributed lock service.
Multiple Physical/Virtual Boxes
Multiple servers (physical or virtual boxes) host both computation units and the data. These boxes host following two types of processes (different JVMs):
- Data Storage Unit (represented as JVM) which processes the instructions for writing/updating the data on the box.
- Data Computation Unit (A set of Jar files) running on separate JVM than data storage unit.
- Distributed database for random read and write, running on a separate JVM
- Distributed computing coordination service running on separate JVM
Distributed Storage (Filesystem): Distribute the Data (GFS/HDFS)
GFS stands for Google Filesystem and HDFS stands for Hadoop Distributed Filesystem. HDFS is designed based on paper published on GFS and have similar architecture. Both of them represent large-scale distributed filesystem. The data is broken into chunks and moved onto different servers (boxes). This is unlike the traditional filesystem such as Network File System (NFS) where the entire data is stored on a centralized server and the data is accessed from this server. This is pioneered by Google in form of GFS. GFS, in brief, consists of following key components:
- Master server: This is the server which takes care of concerns such as storing information related with metadata of data stored on different serrvers, coordinating different form of data mutation requirements (writes/updates), namespace information etc. To ensure high availability, master server is provided redundancy using what is termed as shadow servers. One other key aspect of master server is operational log which stores the metadata information about the data chunks to be written to different chunk servers.
- Chunk servers: These are servers which are responsible for reading and writing large chunks of data. Data chunks are replicated on 3 machines with master server ensuring that replicas exist.
GFS reads and writes the data in following manner:
- Read Data from GFS: GFS client wanting to read the data from GFS gets the chunk server (hosting the data chunks) information using Master server and then contacts the chunk server to retrieve the data.
- Write/Update Data to GFS: Once the information about the chunk servers are obtained, the data is copied on primary and secondary replicas and finally written all at once.
Inspired by GFS was written HDFS. Following are key components of HDFS:
- Name Node: As like master server in GFS, there is something termed as Name Node which stores the metadata information about data blocks (data chunks with GFS) which are stored on data nodes.
- Data Node: As like chunkservers in GFS, there are multiple Data Nodes which holds the data blocks (data chunks with GFS).
Distributed Processing: Distribute the Computation (Map-reduce)
Different servers, hosting the data, also hosts the computation unit (jar files) that run on the JVM on these servers. This is pioneered by Google in form of map-reduce programming paradigm. There are following two key aspects of map-reduce programming paradigm:
- Map: The chunkservers host what is termed as Mapper programs. The mapper program executes the instructions and comes up with output in form of key-value pairs. These key-value pairs are then combined to form distinct unit of key-listOfValue pairs. This very step is often termed as Combiner
- Reduce: The mapped/combined data is them moved to servers consisting of reduce program, often termed as reducer servers. The data is then shuffled and then finally reduced to consist of final output.
Distributed Database for Random IO (BigTable/HBase)
The distributed filesystem (GFS or HDFS) is optimized for writes operation wherein the data is appended to existing data chunk or block. These data chunks or data blocks could be streamed for reading. However, in real world scenarios, what is needed is low-latency random reads and writes. And, this is what is achieved using distributed database system termed as BigTable (pioneered by Google) or HBase (open-source;similar to BigTable). These distributed database could handle petabytes of data. The coolest thing about these database system is autosharding which allows tables to be dynamically distributed as the data size becomes too large. Lets take a quick look at both, BigTable and HBase. Following are key components of BigTable:
- Tablet: A table is represented as one or more tablets.
- Tablet Server: One or more tablets reside on the server that also hosts data which could be accessed using chunkserver and also host the map-reduce computation unit. From that perspective, these servers are also called as Tablet Servers. Thus, same server hosts following three servers:
- Chunkserver hosting the data
- Map-reduce computation unit
- Tablet server consisting of multiple tablets which consists of one or more SSTables which holds data data chunks
- SSTable: SSTable represents immutable sorted file of key-value pairs. It holds one or more data chunks along with an index. One or more SSTable forms a Tablet.
Following are key components of HBase:
- Regions: The table data resides in one or more logical units called as Regions. Regions are essentially a contiguous, sorted range of rows that are stored together.
- Region Servers: The servers consisting of one or more regions are called as Region Servers.
- Master Server: Master server is responsible to coordinate the cluster and execute administrative operations.
Distributed Application Coordination Services
Once we have talked about distributed storage and distributed processing/computation unit, what is needed is the service which needs to take care of some of the following aspects of distributed applications:
- Coordination
- Synchronization
- Serialization
- Reliability
- Atomicity
Some of the above key aspects of distributed applications is associated with some of the following challenges:
- Error proneness in general
- Race conditions with synchronization
- Partial failures
Apache Zookeeper is the most popular framework which is used as distributed coordination service. Google has the distributed coordination service named as Chubby. A Chubby instance is often referred to as Chubby Cell.
The most crucial usecase of the distributed coordination service including Zookeeper is electing a leader among a set of otherwise equivalent servers. Take a look at following examples in this regard:
- GFS/HDFS may use this coordination service to elect the master server.
- BigTable/HBase may use the coordination service in some of the following ways:
- Elect a master server
- Allow master server discover the slave servers that it controls
- Allow clients to find the master server
- GFS/HDFS may also use this service to store small amount of meta-data
I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on
Linkedin.
Check out my latest book titled as
First Principles Thinking: Building winning products using first principles thinking.
Latest posts by Ajitesh Kumar
(see all) Ajitesh KumarI have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.