This app is an All-In-One package to provide everything to HANA Lovers.
1. Courses on SAP HANA - Basics, Modeling and Administration
2. Multiple Quizzes on Overview, Modelling, Architeture, and Administration
3. Most popular articles on SAP HANA
4. Series of Interview questions to brushup your HANA skills
Hadoop Cluster - Architecture, Core Components and Work-flow
In this article we will explain
The architecture of Hadoop Cluster
Core Components of Hadoop Cluster
Work-flow of How File is Stored in Hadoop
Confused Between Hadoop and Hadoop Cluster?
Hadoop is an open source framework, that supports the processing of large data sets in a distributed computing environment.
Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. MapReduce and Hadoop distributed file system (HDFS) are the main component of Hadoop.
Normally any set of loosely connected or tightly connected computers that work together as a single system is called Cluster.
In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.
Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers.
Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects them.
Large Hadoop Clusters are arranged in several racks. Network traffic between different nodes in the same rack is much more desirable than network traffic across the racks.
A Real Time Example:
Here is a picture of Yahoo's Hadoop cluster.
They have more than 10,000 machines running Hadoop and nearly 1 petabyte of user data.
Hadoop Cluster Architecture:
Let's try to understand Hadoop Cluster architecture with the help of an example.
What would be typical Hadoop cluster setup for 4500 nodes??
In this case Hadoop Cluster would consists of
110 different racks
Each rack would have around 40 slave machine
At the top of each rack there is a rack switch
Each slave machine(rack server in a rack) has cables coming out it from both the ends
Cables are connected to rack switch at the top which means that top rack switch will have around 80 ports
There are global 8 core switches
The rack switch has uplinks connected to core switches and hence connecting all other racks with uniform bandwidth, forming the Cluster
In the cluster, you have few machines to act as Name node and as JobTracker. They are referred as Masters. These masters have different configuration favoring more DRAM and CPU and less local storage.
The majority of the machines acts as DataNode and Task Trackers and are referred as Slaves. These slave nodes have lots of local disk storage and moderate amounts of CPU and DRAM
Core Components of Hadoop Cluster:
Hadoop cluster has 3 components:
The role of each components are shown in the below image.
Let's try to understand these components one by one.
It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how the data should be processed and then retrieve the data to see the response after job completion.
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.
NameNode does NOT store the files but only the file's metadata. In later section we will see it is actually the DataNode which stores the files.
NameNode oversees the health of DataNode and coordinates access to the data stored in DataNode.
Name node keeps track of all the file system related information such as to
Which section of file is saved in which part of the cluster
Last access time for the files
User permissions like which user have access to the file
JobTracker coordinates the parallel processing of data using MapReduce.
Secondary Name Node:
Don't get confused with the name "Secondary". Secondary Node is NOT the backup or high availability node for Name node.
So what Secondary Node does?
The job of Secondary Node is to contact NameNode in a periodic manner after certain time interval(by default 1 hour).
NameNode which keeps all filesystem metadata in RAM has no capability to process that metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't have any backup of filesystem. What secondary node does is it contacts NameNode in an hour and pulls copy of metadata information out of NameNode. It shuffle and merge this information into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
Store the data
Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to their masters.
The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a slave to the NameNode
Hadoop- Typical Workflow in HDFS:
Let's try to find out answers of these questions
Take the example of input file as Sample.txt.
How Sample.txt gets loaded into the Hadoop Cluster?
Client machine does this step and loads the Sample.txt into cluster. It breaks the sample.txt into smaller chunks which are known as "Blocks" in Hadoop context. Client put these blocks on different machines (data nodes) throughout the cluster.
Next, how does the Client knows that to which data nodes load the blocks?
Now NameNode comes into picture. The NameNode used its Rack Awareness intelligence to decide on which DataNode to provide. For each of the data block (in this case Block-A, Block-B and Block-C), Client contacts NameNode and in response NameNode sends an ordered list of 3 DataNodes.
For example in response to Block-A request, Node Name may send DataNode-2, DataNode-3 and DataNode-4.
Similarly for Block-B DataNodes list DataNode-1, DataNode-3, DataNode-4 and for Block C data node list DataNode-1, DataNode-2, DataNode-3.
Block A gets stored in DataNode-2, DataNode-3, DataNode-4
Block B gets stored in DataNode-1, DataNode-3, DataNode-4
Block C gets stored in DataNode-1, DataNode-2, DataNode-3
Every block is replicated to more than 1 data nodes to ensure the data recovery on the time of machine failures. That's why NameNode send 3 DataNodes list for each individual block
Who does the block replication?
Client write the data block directly to one DataNode. DataNodes then replicate the block to other Data nodes. When one block gets written in all 3 DataNode then only cycle repeats for next block.
Few Important Notes:
In Hadoop Gen 1 there is only one NameNode wherein Gen2 there is active passive model in NameNode where one more node "Passive Node" comes in picture.
The default setting for Hadoop is to have 3 copies of each block in the cluster. This setting can be configured with "dfs.replication" parameter of hdfs-site.xml file.
Keep note that Client directly writes the block to the DataNode without any intervention of NameNode in this process.