SapHanaTutorial.Com HOME     Learning-Materials Interview-Q&A Certifications Quiz Online-Courses Forum Jobs Trendz FAQs  
     Explore The World of Hana With Us     
About Us
Contact Us
 Apps
X
HANA App
>>>
Hadoop App
>>>
Tutorial App on SAP HANA
This app is an All-In-One package to provide everything to HANA Lovers.

It contains
1. Courses on SAP HANA - Basics, Modeling and Administration
2. Multiple Quizzes on Overview, Modelling, Architeture, and Administration
3. Most popular articles on SAP HANA
4. Series of Interview questions to brushup your HANA skills
Tutorial App on Hadoop
This app is an All-In-One package to provide everything to Hadoop Lovers.

It contains
1. Courses on Hadoop - Basics and Advanced
2. Multiple Quizzes on Basics, MapReduce and HDFS
3. Most popular articles on Hadoop
4. Series of Interview questions to brushup your skills
Apps
HANA App
Hadoop App
';
Search
Stay Connected
Search Topics
Topic Index
+
-
Hadoop Overview
+
-
Hadoop Examples
+
-
MapReduce
+
-
YARN
+
-
Miscellaneous

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

In previous articles, we learnt what is Hadoop, Hadoop Ecosystem and Hadoop Hello World Programs using HDFS, Hive and Pig.

In this article, we will show how to use Hadoop to solve a real-life problem.

Prerequisite

  • Install Hortonworks ODBC driver (64-bit) and configured
  • Microsoft Excel

Use case: Click Stream Data and Hadoop

One of the original use of Hadoop at Yahoo is to store and analyze their massive volume of "Click Stream Data".

Many Industries are using Hadoop and Hortonworks data platform to analyze Click Stream Data to take better decisions on
What is the most efficient path for a site visitor to research a product, and then buy it?
What products do visitors tend to buy together, and what are they most likely to buy in the future?
Where should I spend resources on fixing or enhancing the user experience on my website?
Let's start by taking single use case where a company wants to analyze the Click Stream Data in order to reduce Bounce Rate and improve Conversion Rate.

Click Stream Data:

Click Stream Data is the data about a customer's clicking behaviour on a Web page. Example IP address of the user, date and time stamp etc. This data is typically stored in the form of semi-structure web logs.
As web is becoming a main channel for reaching customers and prospects, Click Stream Data generated by websites has become very important.

Bounce Rate:

Bounce Rate is the percentage of visitors to a website who navigate away from the site after viewing only one page.
"A rising bounce rate is a sure sign that your homepage is boring or outdated".

Conversion Rate:

The conversion rate is the percentage of users who take a desired action. The archetypical example of conversion rate is the percentage of website visitors who buy something on the site.
Example: An e-commerce site is visited by 100,000 people during the month of April. During that month, 2,000 users purchased something from the site. Thus, the site's conversion rate is 2,000/100,000 = 2%.

Load The Data into Hadoop HDFS

First, we need to load raw weblogs data together with customer and product data information into Hadoop HDFS. It would contain user data in multiple format from many places.
Omniture logs: Website log files containing information such as URL, timestamps, IP address, geo-coded IP address, and user ID (SWID).
User Log: CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.
Product logs: CMS data that maps product categories to website URLs.

Download the sample data from the below links.

Upload it into the file browser of HDFS. For details steps on how to load the data into HDFS, refer to Hadoop Hello World - Using HDFS, HCatalog and Hive

After this you should be able see all 3 zip data file in File Browser. Omniture.0.tsv.gz, users.tsv.gz, products.tsv.gz


Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Register the Data with HCatalog

Register the data loaded into HDFS to the HCATALOG to make the data available across all API's and languages. For details steps on how to register the data with, refer to Hadoop Hello World - Using HDFS, HCatalog and Hive
Click on create table from file and select our very first "Omniture.0.tsv.gz" file [remember to uncheck the ReadColum Header checkbox]


Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

View the Data in HCatalog

Observe the Omniture table by clicking on Browse Data option, you would find user id (SWID), IP address, timestamp, URL info.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Next click on Users table you can see the SWID, birth date, and gender column.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

At Last with products table you see the mapping of product categories to website URLs.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Refine the Click Stream DATA with HIVE

Open Beeswax tool to write Hive script. Write first hive script to generate an omniture view that would contain a subset of the data in the Omniture log table.
Click Query Editor and paste the following scripts.
Omniture view will be created which you can see in Tables also.

  CREATE VIEW OMNITURE AS
     SELECT COL_2 TS, COL_8 IP, COL_13 URL, COL_14 SWID,
      COL_50 CITY, COL_51 COUNTRY, COL_53 STATE FROM OMNITURELOG.


Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Omniture view will be created which you can see in Tables also.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

You can check the view data which would be the same (result of the query you executed).

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Create next script that joins the omniture website log data to the CRM data (registered users) and CMS data (products).
Click Query Editor and paste the below scripts.

  CREATE TABLE WEBLOGANALYTICS AS
     SELECT TO_DATE(O.TS) LOGDATE, O.URL, O.IP, O.CITY, UPPER(O.STATE) STATE, O.COUNTRY, P.CATEGORY,
     CAST(DATEDIFF( FROM_UNIXTIME( UNIX_TIMESTAMP() ),
     FROM_UNIXTIME( UNIX_TIMESTAMP(U.BIRTH_DT, 'DD-MMM-YY'))) / 365 AS INT) AGE,
     U.GENDER_CD GENDER FROM OMNITURE O
     INNER JOIN PRODUCTS P ON O.URL = P.URL
     LEFT OUTER JOIN USERS U ON O.SWID = CONCAT('{', U.SWID , '}')


Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Save the query name webloganalytics and execute it.
You can view the data generated by the script as described in the previous steps.

Access and Analyze the Hadoop Data with Excel

You can use Business Intelligence (BI) applications such as Microsoft Excel to access and analyze the data. In our case we are using Microsoft Excel Professional 2013 for analysis of the data.

First, we will import the Hortonworks Sandbox data into Microsoft Excel.
In Windows, open a new Excel workbook, then select Data -> From Other Sources -> From Microsoft Query.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

On the Choose Data Source pop-up, select the Hortonworks ODBC data source then click OK.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

After the connection to the sandbox is established, the Query Wizard appears.
Select the webloganalytics table in the Available tables and columns box --> click the right arrow button to add the entire webloganalytics table to the query.
Click Next to continue. Click Finish on the Query Wizard Finish screen to retrieve the query data from the sandbox and import it into Excel.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

On the Import Data dialog box, click OK to accept the default settings and import the data as a table. The imported query data appears in the Excel workbook.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Visualize the Website Click Stream Data Using Excel Power View

In this section we will: Identify a few web pages with the highest bounce rates Click on Insert->PowerView to open report The Power View Fields area appears on the right side of the window, with the data table displayed on the left.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

Now a page would be considered to have a high bounce rate if it is the last page a user visited before leaving the website. In the Power View Fields area, select url. All of the other fields should be unselected. In the Power View Fields area, move the pointer over url, click the down-arrow, and then select Add to Table as Count.

Hadoop Real World Example - Using HDFS, HCatalog, Hive and Excel

The chart shows that we should focus on optimizing four of our website pages for the market segment of men between the ages of 22 and 30. Now we can redesign these four pages and test the new designs based on our target demographic, thereby reducing the bounce rate and increasing customer retention and sales.

What is Next?


Have a question or doubt? Please post that in comment.



Support us by sharing this article.



Explore More
Close X
Close X

Leave a Reply

Your email address will not be published. Required fields are marked *

Current day month ye@r *

 © 2017 : saphanatutorial.com, All rights reserved.  Privacy Policy