In our previous article MapReduce - Heart of Hadoop
we talked about
- What is MapReduce?
- MapReduce Engine
Now let’s move one step ahead and try to understand MapReduce concept in more detail.
MapReduce works on parallel programming concept. Let’s try to understand first a parallel programming concept and then how the MapReduce works on that concept.
What is Parallel Programming?
In a parallel program, the processing is broken up into several parts and each of the part gets executed concurrently.
Important point to note here is that “Each part should have no dependency on other parts, No communication should be needed to other parts when executing all parts in parallel manner.
Got confused!! Let’s take an example…
Suppose you are writing a program and your program needs to access some common data. Now when you are executing several instances of that program at the same time, there could be conflicts like one instance of program is changing some data while other instance is reading it. So, you have to handle these cases in your program code. Here you’re doing Concurrency
But if your program instance is working on some data which no other instance needs, then you’re doing here Parallelism
In other words - Those programs where data is not dependent on each other and is isolated can be executed in parallel programming manner.
Parallel programs are not only faster but they can also be used to solve problems on large datasets using non-local resources.
MapReduce Works on Parallel Programming Concept:
MapReduce is also inspired by this parallel programming concept and is a mechanism for processing large amounts of raw data. For example web request logs, Website URLs etc.
This data is so large; it must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies parallel computing since the same computations (program) are performed on each CPU, but with a different dataset. Here dataset are not depending on each other during execution.
MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.
3 Stages of MapReduce – Map, Shuffle and Reduce:
MapReduce programming model comes with 3 simple stages.
The Shuffle part is done automatically by Hadoop, we just need to implement the Map and Reduce parts.
- Input for Map stage is always a <Key, Value> pair and it produces a set of intermediate key/value pairs as an output.
- All intermediate values associated with the same intermediate key are grouped together and passed through to the reduce stage.
- The Reduce stage concepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values.
A Real World Example:
Now that we understood concept of parallelism, MapReduce and three phases of MapReduce.
Let’s takes a real world use case where problem is solved using MapReduce programming model.
In election we need to find out which election party got how many votes in every state.
Solution using MapReduce Parallel Programming:
State wise votes will depend on city wise vote count and all cities under one state will together give the total votes from that country.
Note that we can calculate total count from one state without caring about votes coming from different cities of other states, we can use parallel algorithm (MapReduce).
Here if you start doing sequentially instead of doing parallel, you need to start with empty list of states and then iterate through the vast list of cities and for each city, look at the state, and then update(add) state vote count.
You can think from performance point of view also, how bad it would be. Luckily we can use parallel programming here and distribute data set per state wise and then work on each states parallel to calculate total votes coming in from different cities under one state.
Let’s do it using MapReduce.
Consider few states like State-A, State-B, State-C ...... and several cities City-A1, City-A2, City-A3 etc... City-B1, City-B2, City-B3…. And so on….
Each city will have two attributes: the state it belongs to, vote count.
Now let’s consider particular state State-A.
As an input to Map phase we will be having several key value pairs:
, [State, Vote Count]
Key:City-A1, Value :[ State-A,10]
Key: City-A2 Value:[ State-A,40]
Key: City-A3 Value:[ State-A,50]…….
Since, we are only interested in state wise vote count. Here state name will become the key so we will have
Key: State-A, Value :10
Key: State-A, Value:40
Key: State-A, Value:50…….
Now Shuffle phase will take these as input and group all values for single key.
Reduce function will do the logic on the data received from shuffle (intermediate) step and returns total vote count from state A. The output will be:
We got the result (total votes/ per state ) by executing MapReduce task in parallel.
What We Learnt So Far:
In this article we understood
- The difference between two term concurrency and parallelism
- How parallel programming works in MapReduce concept
- A real world example where MapReduce approach solves the problem much faster and easier way.
MapReduce is NOT ONLY a programming model but also provides execution Engine to run MapReduce task and handles fault tolerance.
Read the next article to know about this
MapReduce, A Complete Execution Engine not just a Programming Model