SapHanaTutorial.Com HOME     Learning-Materials Interview-Q&A Certifications Quiz Online-Courses Forum Jobs Trendz FAQs  
     Explore The World of Hana With Us     
About Us
Contact Us
 Apps
X
HANA App
>>>
Hadoop App
>>>
Tutorial App on SAP HANA
This app is an All-In-One package to provide everything to HANA Lovers.

It contains
1. Courses on SAP HANA - Basics, Modeling and Administration
2. Multiple Quizzes on Overview, Modelling, Architeture, and Administration
3. Most popular articles on SAP HANA
4. Series of Interview questions to brushup your HANA skills
Tutorial App on Hadoop
This app is an All-In-One package to provide everything to Hadoop Lovers.

It contains
1. Courses on Hadoop - Basics and Advanced
2. Multiple Quizzes on Basics, MapReduce and HDFS
3. Most popular articles on Hadoop
4. Series of Interview questions to brushup your skills
Apps
HANA App
Hadoop App
';
Search
Stay Connected
Search Topics
Topic Index
+
-
Hadoop Overview
+
-
MapReduce
+
-
YARN
+
-
Miscellaneous
Hadoop Cluster - Architecture, Core Components and Work-flow


In this article we will explain
    • The architecture of Hadoop Cluster
    • Core Components of Hadoop Cluster
    • Work-flow of How File is Stored in Hadoop

Confused Between Hadoop and Hadoop Cluster?

Hadoop:
Hadoop is an open source framework, that supports the processing of large data sets in a distributed computing environment.
Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. MapReduce and Hadoop distributed file system (HDFS) are the main component of Hadoop.

For more information on Hadoop, check the article What is Hadoop?

Hadoop Cluster:
Normally any set of loosely connected or tightly connected computers that work together as a single system is called Cluster. In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.

Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers.

Hadoop Cluster - Architecture and Components

Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects them.

Large Hadoop Clusters are arranged in several racks. Network traffic between different nodes in the same rack is much more desirable than network traffic across the racks.

A Real Time Example:
Here is a picture of Yahoo's Hadoop cluster. They have more than 10,000 machines running Hadoop and nearly 1 petabyte of user data.

Hadoop Cluster - Architecture and Components

Hadoop Cluster Architecture:

Let's try to understand Hadoop Cluster architecture with the help of an example.
What would be typical Hadoop cluster setup for 4500 nodes??

Hadoop Cluster - Architecture and Components

In this case Hadoop Cluster would consists of
    • 110 different racks
    • Each rack would have around 40 slave machine
    • At the top of each rack there is a rack switch
    • Each slave machine(rack server in a rack) has cables coming out it from both the ends
    • Cables are connected to rack switch at the top which means that top rack switch will have around 80 ports
    • There are global 8 core switches
    • The rack switch has uplinks connected to core switches and hence connecting all other racks with uniform bandwidth, forming the Cluster
    • In the cluster, you have few machines to act as Name node and as JobTracker. They are referred as Masters. These masters have different configuration favoring more DRAM and CPU and less local storage.
    • The majority of the machines acts as DataNode and Task Trackers and are referred as Slaves. These slave nodes have lots of local disk storage and moderate amounts of CPU and DRAM

Core Components of Hadoop Cluster:

Hadoop cluster has 3 components:
    1. Client
    2. Master
    3. Slave
The role of each components are shown in the below image.

Hadoop Cluster - Architecture and Components

Let's try to understand these components one by one.

Client:

It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how the data should be processed and then retrieve the data to see the response after job completion.

Hadoop Cluster - Architecture and Components

Masters:

The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.

Hadoop Cluster - Architecture and Components

NameNode:
NameNode does NOT store the files but only the file's metadata. In later section we will see it is actually the DataNode which stores the files.

NameNode oversees the health of DataNode and coordinates access to the data stored in DataNode.
Name node keeps track of all the file system related information such as to
    • Which section of file is saved in which part of the cluster
    • Last access time for the files
    • User permissions like which user have access to the file
JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.

To know more about JobTracker, please read the article All You Want to Know about MapReduce (The Heart of Hadoop)

Secondary Name Node:
Don't get confused with the name "Secondary". Secondary Node is NOT the backup or high availability node for Name node.

So what Secondary Node does?

Hadoop Cluster - Architecture and Components

The job of Secondary Node is to contact NameNode in a periodic manner after certain time interval(by default 1 hour).
NameNode which keeps all filesystem metadata in RAM has no capability to process that metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't have any backup of filesystem. What secondary node does is it contacts NameNode in an hour and pulls copy of metadata information out of NameNode. It shuffle and merge this information into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.

Slaves:

Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
    • Store the data
    • Process the computation

Hadoop Cluster - Architecture and Components

Each slave runs both a DataNode and Task Tracker daemon which communicates to their masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a slave to the NameNode

Hadoop- Typical Workflow in HDFS:


Hadoop Cluster - Architecture and Components

Let's try to find out answers of these questions
Take the example of input file as Sample.txt.
    1. How Sample.txt gets loaded into the Hadoop Cluster?

      Client machine does this step and loads the Sample.txt into cluster. It breaks the sample.txt into smaller chunks which are known as "Blocks" in Hadoop context. Client put these blocks on different machines (data nodes) throughout the cluster.

      Hadoop Cluster - Architecture and Components

    2. Next, how does the Client knows that to which data nodes load the blocks?

      Now NameNode comes into picture. The NameNode used its Rack Awareness intelligence to decide on which DataNode to provide. For each of the data block (in this case Block-A, Block-B and Block-C), Client contacts NameNode and in response NameNode sends an ordered list of 3 DataNodes.

      For example in response to Block-A request, Node Name may send DataNode-2, DataNode-3 and DataNode-4.

      Hadoop Cluster - Architecture and Components

      Similarly for Block-B DataNodes list DataNode-1, DataNode-3, DataNode-4 and for Block C data node list DataNode-1, DataNode-2, DataNode-3. Hence
        • Block A gets stored in DataNode-2, DataNode-3, DataNode-4
        • Block B gets stored in DataNode-1, DataNode-3, DataNode-4
        • Block C gets stored in DataNode-1, DataNode-2, DataNode-3
      Every block is replicated to more than 1 data nodes to ensure the data recovery on the time of machine failures. That's why NameNode send 3 DataNodes list for each individual block
    3. Who does the block replication?

      Client write the data block directly to one DataNode. DataNodes then replicate the block to other Data nodes. When one block gets written in all 3 DataNode then only cycle repeats for next block.

      Hadoop Cluster - Architecture and Components

Few Important Notes:

    • In Hadoop Gen 1 there is only one NameNode wherein Gen2 there is active passive model in NameNode where one more node "Passive Node" comes in picture.
    • The default setting for Hadoop is to have 3 copies of each block in the cluster. This setting can be configured with "dfs.replication" parameter of hdfs-site.xml file.
    • Keep note that Client directly writes the block to the DataNode without any intervention of NameNode in this process.
Continue reading:
Top 10 Myths about Hadoop
Hadoop Basic Course





Support us by sharing this article.



Explore More
Close X
Close X

11 thoughts on “Hadoop Cluster – Architecture and Core Components

  1. jessie c says:

    Nice platform to learn Hadoop

  2. Jeet says:

    Very Very helpful and clear understanding..Excellent tutorial

  3. Bhagyashri Salunkhe says:

    Good tutorial … Easy to understand … Thanks.

  4. xplorerdev says:

    Excellent stuff. Very well explained.

    I have a general deployment question now: I want to use my Tableau reports to use Hadoop Cluster. I want to use Spark Execution Engine. At the same time, HDFS/MapReduce/Scala/Hive/HiveServer2/Thriftserver/HiveOnSpark/HiveOnMR etc also comes into pictures.

    Could you briefly pictureize the Tableau-Hadoop-Spark-Hive environment/components/services, I would need to deploy for viewing my Tableau reports?

    Best Regards

  5. Anjum Suri says:

    Hi

    Good explanation…I want to extend my knowledge, reading and understanding of Hadoop Architecture and ideally would want to move towards ‘Hadoop Administrator’

    Can you please advise where should I start from…? I am not at all into programming but can do a bit of shell scripting. I am basically an Oracle Database Administrator on Linux/UNIX environments. Can you please send me some links where I can find relevant documentation…? Please advise…?

    Thanks
    Anjum

  6. Shafi says:

    Hi,

    After reading this i got a clear clarity about clusters, master nodes and slave nodes.

    Thanks​ for this tutorial, it’s really helpful

  7. Ananthi says:

    Great and informative article.. i clearly understand the hadoop cluster architecture from this article.

Leave a Reply

Your email address will not be published. Required fields are marked *

Current day month ye@r *

 © 2017 : saphanatutorial.com, All rights reserved.  Privacy Policy