SapHanaTutorial.Com HOME     Learning-Materials Interview-Q&A Certifications Quiz Online-Courses Forum Jobs Trendz FAQs  
     Explore The World of Hana With Us     
About Us
Contact Us
 Apps
X
HANA App
>>>
Hadoop App
>>>
Tutorial App on SAP HANA
This app is an All-In-One package to provide everything to HANA Lovers.

It contains
1. Courses on SAP HANA - Basics, Modeling and Administration
2. Multiple Quizzes on Overview, Modelling, Architeture, and Administration
3. Most popular articles on SAP HANA
4. Series of Interview questions to brushup your HANA skills
Tutorial App on Hadoop
This app is an All-In-One package to provide everything to Hadoop Lovers.

It contains
1. Courses on Hadoop - Basics and Advanced
2. Multiple Quizzes on Basics, MapReduce and HDFS
3. Most popular articles on Hadoop
4. Series of Interview questions to brushup your skills
Apps
HANA App
Hadoop App
';
Search
Stay Connected
Search Topics
Topic Index
+
-
Hadoop Overview
+
-
MapReduce
+
-
YARN
+
-
Miscellaneous

Mapreduce - A Parallel Programming Concept

In our previous article MapReduce - Heart of Hadoop we talked about
  • What is MapReduce?
  • Architecture
  • MapReduce Engine
Now let’s move one step ahead and try to understand MapReduce concept in more detail.

MapReduce works on parallel programming concept. Let’s try to   understand first a parallel programming concept and then how the MapReduce works on that concept.

What is Parallel Programming?

In a parallel program, the processing is broken up into several parts and each of the part gets executed concurrently.
Important point to note here is that “Each part should have no dependency on other parts, No communication should be needed to other parts when executing all parts in parallel manner.

Got confused!! Let’s take an example…
Suppose you are writing a program and your program needs to access some common data. Now when  you are executing   several instances of that program at the same time, there could be conflicts like one instance of program is changing some data while other instance is reading it. So, you have to handle these cases in your program code. Here you’re doing Concurrency.
But if your program instance is working on some data which no other instance needs, then you’re doing here Parallelism.

MapReduce Programming Concept

In other words - Those programs where data is not dependent on each other and is isolated can be executed in parallel programming manner.
Parallel programs are not only faster but they can also be used to solve problems on large datasets using non-local resources.


MapReduce Works on Parallel Programming Concept:

MapReduce is also inspired by this parallel programming concept and is a mechanism for processing large amounts of raw data. For example web request logs, Website URLs etc.
This data is so large; it must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies parallel computing since the same computations (program) are performed on each CPU, but with a different dataset. Here dataset are not depending on each other during execution.
MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

3 Stages of MapReduce – Map, Shuffle and Reduce:

MapReduce programming model comes with 3 simple stages.
The Shuffle part is done automatically by Hadoop, we just need to implement the Map and Reduce parts.
  • Input for Map stage is always a <Key, Value> pair and it produces a set of intermediate key/value pairs as an output.
  • All intermediate values associated with the same intermediate key are grouped together and passed through to the reduce stage.
  • The Reduce stage concepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values.

A Real World Example:

Now that   we understood concept of parallelism, MapReduce and three phases of MapReduce.
Let’s takes a real world use case where problem is solved using MapReduce programming model.
Problem: In election we need to find out which election party got how many votes in every state.
Solution using MapReduce Parallel Programming:
State wise votes will depend on city wise vote count and all cities under one state will together give the total votes from that country.
Note that we can calculate total count from one state without caring about votes coming from different cities of other states, we can use parallel algorithm (MapReduce).
Here if  you start doing sequentially instead of doing parallel, you need to start with empty list of states and then iterate through the vast list of cities and for each  city, look at the state, and then update(add) state  vote count.
You can think from performance point of view also, how bad it would be. Luckily we can use parallel programming here and distribute data set per state wise and then work on each states parallel to calculate total votes coming in from different cities under one state.
Let’s do it using MapReduce.
Consider few states like State-A, State-B, State-C ...... and several cities City-A1, City-A2, City-A3 etc... City-B1, City-B2, City-B3…. And so on….

MapReduce Programming Concept

Each city will have two attributes: the state it belongs to, vote count.
Now let’s consider particular state State-A.
As an input to Map phase we will be having several key value pairs:
 <City, [State, Vote Count]>
For example
Key:City-A1, Value :[ State-A,10]
Key: City-A2  Value:[ State-A,40]
Key: City-A3 Value:[ State-A,50]…….
Since, we are only interested in state wise vote count. Here state name will become the key so we will have
Key: State-A, Value :10
Key: State-A, Value:40
Key: State-A, Value:50…….

MapReduce Programming Concept

Now Shuffle phase will take these as input and group all values for single key.
< State-A,[10,40,50]>…….
Reduce function will do the logic on the data received from shuffle (intermediate) step and returns total vote count from state A. The output will be:
< State-A,100>
We got the result (total votes/ per state ) by executing MapReduce task in parallel.

What We Learnt So Far:

In this article we understood
  • The  difference between  two  term concurrency and parallelism
  • How parallel programming works in MapReduce concept
  • A real world example where MapReduce approach solves the problem much faster and easier way.

What Next:

MapReduce is NOT ONLY a programming model but also provides execution Engine to run MapReduce task and handles fault tolerance.
Read the next article to know about this
MapReduce, A Complete Execution Engine not just a Programming Model



Support us by sharing this article.



Explore More
Close X
Close X

One thought on “MapReduce – A Parallel Programming Concept

  1. Srinivas Reddy says:

    Very Good Article .. It helps me a lot.. Thank you 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Current day month ye@r *

 © 2017 : saphanatutorial.com, All rights reserved.  Privacy Policy