SapHanaTutorial.Com HOME     Learning-Materials Interview-Q&A Certifications Quiz Online-Courses Forum Jobs Trendz FAQs  
     Explore The World of Hana With Us     
About Us
Contact Us
Hadoop App
Tutorial App on SAP HANA
This app is an All-In-One package to provide everything to HANA Lovers.

It contains
1. Courses on SAP HANA - Basics, Modeling and Administration
2. Multiple Quizzes on Overview, Modelling, Architeture, and Administration
3. Most popular articles on SAP HANA
4. Series of Interview questions to brushup your HANA skills
Tutorial App on Hadoop
This app is an All-In-One package to provide everything to Hadoop Lovers.

It contains
1. Courses on Hadoop - Basics and Advanced
2. Multiple Quizzes on Basics, MapReduce and HDFS
3. Most popular articles on Hadoop
4. Series of Interview questions to brushup your skills
Hadoop App
Stay Connected
Search Topics
Topic Index
Hadoop Overview

Data Loading in Hadoop

Hadoop can process huge volume of data. Data Analysis is like cooking half food only, loading the Data into Hadoop would be like cooking the other half? Getting the data into Hadoop cluster is the first step in any Big Data deployment and analysis.
In this article we will be talking about 2 majorly used data loading tools of Hadoop.
  1. Apache Flume
  2. Apache Sqoop

We will focus on the functional aspect of Flume and Sqoop and in the next article we will do practical examples.

Data Loading into Hadoop- Apache Flume and Sqoop

Why do we need external data loading tools?

In today’s scenario we see that data is generating exponentially. You can think about the data generated from Stock market, Facebook, twitter, web logs, GPS tracking, e-commerce site like E bay etc.
One time loading can be done using HDFS command but if the data is generating continuously then to handle that could be difficult task.

You can always write scripts to load the data into Hadoop but this process would be inefficient and time consuming. Also loading bulk data into Hadoop from external systems or accessing it from MapReduce applications running on large clusters can be a challenging task.
We need to take care of many things like ensuring consistency of data, the consumption of external system resources etc.

To solve this problem, Apache has come up with different tools to load the different types of data into Hadoop.
Two of the most widely tools are Flume and Sqoop. Both flume and Sqoop are meant for data movement but they are different in what kind of Data they transfer into Hadoop Cluster?


Flume is distributed system for collecting log data from many sources, aggregating it, and writing it to Hadoop HDFS.

Work Flow of Flume and its various components
  1. Client 
A flow in Flume starts from the Client. The Client transmits the event to a Source operating within the flume Agent. Client could be a user or any tool producing the data and submitting to flume source.

  1. Event
A unit of data in Flume is called an event, and events flow throughout various components of flume to reach its destination. Event could be anything from newline-terminated strings in stdout, single log entry in case of logs database, it all depends on what sources the agent is configured to use.

  1. Sources
Source is the entity through which data enters into Flume A source is responsible to listen and consume events (data) coming from Client (e.g. App servers, Logs databases) and forwards them to one or more channels. Sources examples could be Apache log4j (enable Java applications to write events to files in HDFS via Flume).

  1. Channels
Channels are the way through which Flume agents transfer events from their sources to their sinks. So channel you can think of as holding area which stores event sent from source.

Channel examples could be memory, jdbc, file, other (custom) etc.

Channels are typically of two types: in-memory queues and durable disk-backed queues. In-memory channels provide high throughput but no recovery if an agent fails. File or database-backed channels, on the other hand, are durable. They support full recovery and event replay in the case of agent failure.

  1. Sinks
The sink removes the event from the channel and writes it into an external system. It can also forward it to the source of the next agent in the flow if more than one agent is configured.
Like sources, sinks correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or any number of other external repositories.
  1. Agent 
An Agent is any physical Java virtual machine running Flume and it holds sources, sinks and channels.

Data Loading into Hadoop- Apache Flume and Sqoop

Advanced Components of Flume:

There are some advanced components also like Interceptors, Channel Selector and Sink processor

  1. Interceptors  
Events can be inspected and interrupted through Interceptors. Filtering of events can be as they pass between a source and a channel, and the developer is free to modify or drop events as required.
  1. Channel selector
They basically decide which channel my flume event should go, since there could be many channels configured
  1. Sink processor
It is mechanisms by which you can create failover part for example you can do the load balancing in case of so many events are generating, failing

Sqoop (SQL to Hadoop):

Sqoop is a connectivity tool which transfers data between structured data stores such as relational databases (MySQL, Oracle, and Teradata), data warehouses and Hadoop HDFS and other Hadoop data source like Hive, HBase.
  1. Sqoop is a command line tool
  2. Sqoop allows easy import and export of data from structured data stores.
  3. You can either import individual tables or entire databases to HDFS.
  4. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.
  5. Sqoop internally generates MapReduce  code to transfer the  data

Data Loading into Hadoop- Apache Flume and Sqoop

Sqoop Import and Export
Sqoop Import is the process of bringing data into Hadoop; Sqoop Export is the process of taking the data from Hadoop and putting it back into the system. Sqoop can manage both of these processes by using the Sqoop Import and Sqoop Export functions.

Sqoop Connectors
Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems. Data transfer between Sqoop and external storage system is made possible with the help of Sqoop's connectors.

Why Sqoop Connectors?
Every DBMS are designed with SQL standard in mind but they differ with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.
Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2.  Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java's JDBC protocol
Workflow of Sqoop

What happens behind the scene is the dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

Support us by sharing this article.

Explore More
Close X
Close X

5 thoughts on “Data Loading into Hadoop- Apache Flume and Sqoop

  1. seema says:

    Very good explanation!It really help me to understand the data loading technologies in hadoop.

    Thank for providing such a great document.

  2. Vijay says:

    Thanks admin for providing such a good material for Hadoop. Each and every topic explained here are easy to understand.

  3. Ankur says:

    What are the commands?for loading data

  4. Aswinimohan says:

    Please post the commands for loading the data with example

  5. kavya says:

    Hi article is nice, sqoop and flume are data loading tool in bigdata Hadoop, the main difference is sqoop handles batch data but apache flume handles real time data Example twitter data.

Leave a Reply

Your email address will not be published. Required fields are marked *

Current day month ye@r *

 © 2017 :, All rights reserved.  Privacy Policy