Hadoop Distributed File System (HDFS) is a distributed file system designed to hold very large volume of data (terabytes, petabytes).
Conceptually, HDFS is a block-structured file system where
- Individual files are broken into blocks of a fixed size.
- These blocks are stored across a cluster of one or more machines with data storage capacity.
- Individual machines in the cluster are referred to as DataNodes.
- HDFS is responsible for distributing the data across the DataNodes.
- HDFS does the administrative jobs like addition of the nodes from cluster , removal of the nodes from cluster
- HDFS is also responsible to do the recovery of DataNodes
Components of HDFS:
The main components of HDFS are Name Node, Data Node and Backup Node.
It is the admin / master of the system.
- It manages the blocks which are present on the Data Node.
- Name Node stores the meta-data and it runs on high quality hardware.
- It is single entry point for any of the failures happening in Hadoop Cluster.
These are the slaves which are deployed on each machine.
- This is the place where actual data is stored.
- It is`responsible for serving read and write requests for the clients.
- Each Data Node holds a part of the overall data and processes the part of the data which it holds.
This is responsible for performing periodic checkpoints. In the event of NameNode failure, you can restart the NameNode using the checkpoint.