MapReduce is programming model for efficient computation of data. MapReduce brings computation to the data location in contrast to traditional parallelism.
Conceptually, MapReduce works in 2 steps.
Input and Output types of a Map/Reduce job:
- First being the mapper phase where the map job takes a collection of data and convert it in to another set of data, where individual elements are broken into <key,value> pair.
- Next in reducer phase , the reduce job takes the output from the map as input and those <key,value> pairs into smaller set of <key,value> pairs as an output.
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
Confused??? Let's understand with the help of an example.
Take simple Word Count Example.
Word count is just to count the number of occurrences of words in set of files which is input.
Input can be collection of thousand of files, documents etc. For now just take a small set of 3 files.
First File has content : "Hello Bob, How are you?"
Second File has content: "I see you Bob."
Third File has content : "I want to talk to you."
Now let's see how Map Reduce approach works:
Each line would be distributed over individual mapper instances.
"Hello Bob, How are you?" ” To mapper instance 1
"I see you Bob" - To mapper instance 2
"I want to talk to you." - To mapper instance 3
Here In the map job, the sentence would be split as words and form the initial key value pair as
<Bob ,1 >
In the reduce phase the keys are grouped together and the values for similar keys are added.
So the result of the reducer phase would be as below.
< Bob, 1>
< How 2>
< Hello, 2>
< World, 2>
You see, This would give the number of occurrence of each word in the input files.
Have a look at below Image to understand the flow of mapper and reducer with one more example.
You can develop MapReduce applications in Java or any JVM-based language.
Components of MapReduce:
The main components of MapReduce are as described below:
It is the service in Hadoop which send map reduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTrackers are the slaves which are deployed on each machine. They are responsible for running the map and reduce tasks as instructed by the JobTracker.
JobHistoryServer is a daemon that serves historical information about completed applications.
How exactly MapReduce (map/ reduce) job works (WorkFlow):
Let's try to understand the work flow of map reduce here in bit more detail.
- When Client applications submit map reduce jobs to the Job tracker.
- The JobTracker talks to the Name node to determine the location of the data
- The JobTracker locates Tasktracker nodes with available slots at or near the data.
- The JobTracker submits the work to the chosen Tasktracker nodes.
- The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker
- When the work is completed, the JobTracker updates its status.
- Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.