In our previous articles What is Hadoop?
, MapReduce - The Heart of Hadoop
and Hadoop Cluster - Architecture, Core Components and Work-flow
we explained Hadoop, MapReduce and Hadoop Eco System.
In this article we will talk about 2 important Hadoop High level Programming Languages - Hive and Pig, which sits on the top of MapReduce layer in the Hadoop ecosystem.
Why do we need Hive and Pig?
Once you have all your big data loaded into Hadoop, the next step would be to play with the data i.e. process the data, analyze the data and once you get the output, make the business decisions from the analysis.
In Hadoop, you write MapReduce programs to do data analysis but writing MapReduce programs require Java skills and you need to write so many lines of plain Java code.
Hive (an SQL programming languages) integrated in Hadoop ecosystem solves this problem up to much extent.
Pig Latin (also called Pig) and Hive sits on one layer up of MapReduce layer in Hadoop ecosystem and provides high level language for using MapReduce library
What is Pig?
or Apache Pig
is a high-level platform for creating programs that run on Hadoop. The language for this platform is called Pig Latin. Pig Latin
is a scripting language.
Pig was developed at Yahoo for analysis of large data and to reduce time in writing map-reduce programs.
Pig has 2 core components.
- SQL-like scripting language, called Pig Latin
- Second one being the runtime environment where Pig Latin programs would get executed
What are the differences between Pig and MapReduce? What are the advantages of Pig?
- Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.
- Pig internally create sequence of MapReduce jobs that are run by Hadoop Cluster
- Pig scripts takes 5% of the time compared to writing MapReduce programs in JAVA but run time performance reduce by 50% due to obvious reason that MapReduce programs (in java) runs natively.
- What if something goes wrong in between? Pig being a procedural language, you can easily debug at each intermediate steps.
- For scenarios with complex job pig would be the best language to work with due to its debugging capability.
What is Hive?
or Apache Hive
is a data warehouse system for querying and analyzing large datasets stored in Hadoop files.
Hive provides a SQL-like query language which is known as HiveQL. In Hive you have SQL-like interface where user can write limited number of commands like SQL.
Initially developed by Facebook, Apache Hive is now used and developed by many other companies.
This is the reason why Hive is mostly known as Hive SQL.
- Hive provides a data warehouse view of the data loaded in HDFS.
- With Hive you can do ad hoc analysis, perform joins on the data sets.
- Remember that Hive only deals with structured data.
- Hive also allow to write custom mapper and reducers to extend the QL-capabilities.
Which one shall I use - Hive or Pig? Difference between Hive and Pig?
From a technical point of view, both Pig and Hive are feature complete, so you can do tasks in either tool. However, you will find one tool or the other will be preferred by the different groups that have to use Apache Hadoop. The good part is they have a choice and both tools work together.
Hive because of its SQL like query language
is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.
Pig fits in through its data flow strengths
where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying.
There is no simple way to compare both Pig and Hive. However, you can check below points to get a fair idea.