Hadoop can process huge volume of data. Data Analysis is like cooking half food only, loading the Data into Hadoop would be like cooking the other half? Getting the data into Hadoop cluster is the first step in any Big Data deployment and analysis.
In this article we will be talking about 2 majorly used data loading tools of Hadoop.
- Apache Flume
- Apache Sqoop
We will focus on the functional aspect of Flume and Sqoop and in the next article we will do practical examples.
Why do we need external data loading tools?
In today’s scenario we see that data is generating exponentially. You can think about the data generated from Stock market, Facebook, twitter, web logs, GPS tracking, e-commerce site like E bay etc.
One time loading can be done using HDFS command but if the data is generating continuously then to handle that could be difficult task.
You can always write scripts to load the data into Hadoop but this process would be inefficient and time consuming. Also loading bulk data into Hadoop from external systems or accessing it from MapReduce applications running on large clusters can be a challenging task.
We need to take care of many things like ensuring consistency of data, the consumption of external system resources etc.
To solve this problem, Apache has come up with different tools to load the different types of data into Hadoop.
Two of the most widely tools are Flume and Sqoop. Both flume and Sqoop are meant for data movement but they are different in what kind of Data they transfer into Hadoop Cluster?
Flume is distributed system for collecting log data
from many sources, aggregating it, and writing it to Hadoop HDFS.
Work Flow of Flume and its various components
A flow in Flume starts from the Client. The Client transmits the event to a Source operating within the flume Agent. Client could be a user or any tool producing the data and submitting to flume source.
A unit of data in Flume is called an event, and events flow throughout various components of flume to reach its destination. Event could be anything from newline-terminated strings in stdout, single log entry in case of logs database, it all depends on what sources the agent is configured to use.
Source is the entity through which data enters into Flume A source is responsible to listen and consume events (data) coming from Client (e.g. App servers, Logs databases) and forwards them to one or more channels. Sources examples could be Apache log4j (enable Java applications to write events to files in HDFS via Flume).
Channels are the way through which Flume agents transfer events from their sources to their sinks. So channel you can think of as holding area which stores event sent from source.
Channel examples could be memory, jdbc, file, other (custom) etc.
Channels are typically of two types: in-memory queues and durable disk-backed queues. In-memory channels provide high throughput but no recovery if an agent fails. File or database-backed channels, on the other hand, are durable. They support full recovery and event replay in the case of agent failure.
The sink removes the event from the channel and writes it into an external system. It can also forward it to the source of the next agent in the flow if more than one agent is configured.
Like sources, sinks correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or any number of other external repositories.
An Agent is any physical Java virtual machine running Flume and it holds sources, sinks and channels.
Advanced Components of Flume:
There are some advanced components also like Interceptors, Channel Selector and Sink processor
Events can be inspected and interrupted through Interceptors. Filtering of events can be as they pass between a source and a channel, and the developer is free to modify or drop events as required.
- Channel selector
They basically decide which channel my flume event should go, since there could be many channels configured
- Sink processor
It is mechanisms by which you can create failover part for example you can do the load balancing in case of so many events are generating, failing
Sqoop (SQL to Hadoop):
Sqoop is a connectivity tool which transfers data between structured data stores such as relational databases (MySQL, Oracle, and Teradata), data warehouses and Hadoop HDFS and other Hadoop data source like Hive, HBase.
Sqoop Import and Export
- Sqoop is a command line tool
- Sqoop allows easy import and export of data from structured data stores.
- You can either import individual tables or entire databases to HDFS.
- Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.
- Sqoop internally generates MapReduce code to transfer the data
Sqoop Import is the process of bringing data into Hadoop; Sqoop Export is the process of taking the data from Hadoop and putting it back into the system. Sqoop can manage both of these processes by using the Sqoop Import and Sqoop Export functions.
Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems. Data transfer between Sqoop and external storage system is made possible with the help of Sqoop's connectors.
Why Sqoop Connectors?
Every DBMS are designed with SQL standard in mind but they differ with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.
Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java's JDBC protocol
Workflow of Sqoop
What happens behind the scene is the dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.