Trending word/technology over the internet among the techies
now days. Many are predicting about its big future. Many are blogging to give
its paid/free trainings. What exactly is it? In which programming domain Hadoop
lays? Is Hadoop a new era of storage? Is Hadoop a database? SQL/NoSQL? It that
a cloud computing?
Many questions come into mind when we read a wiki about
Hadoop. To be very simple over the Hadoop, It’s neither a database nor a
storage methodology. It’s an improved era of data ware housing and data
processing for analytics.
Let me come up with an example..
Let’s think about an international retail store. Millions of
customers purchasing billions of products of their choices and fulfilling their
needs on daily basis. All the billing including the process from purchase to
sales is managing by a real time centralize application which is generating huge
amount of logs by keeping
transaction details for each customer and respective purchased products
globally. Huge Amount?
Yes, Very Huge!
MBs (Megabytes)? GB s(Gigabytes)?
Nope, more than that. TBs (Terabytes) , PBs (Petabytes). Yes
tracking history for each sale record worldwide will reach the log file size up
to TBs and PBs. This log file is the Big
Data. Big Data?
Big Data, A or Any bunch of data (might be complex) on which data processing can’t be done by traditional
(legacy) methods cause if size, i.e. Big Data.
Really that’s just not possible to process the file to find
out a desired result or analytical calculation by just those common worldwide
methodologies. Analytical Calculation?
-To calculate the consumption of a soft drink on the
parameter of flavor. Or to calculate the best-selling toothpaste brand in Asia
will process the whole sales log file. These were the examples of analytical calculation.
To process Terabyte, Petabyte sized files for the analytical
calculation will take time of days meanwhile selling trends will get changed.
So the produced results will be either useless or wrong to estimate the future
sell. But to process this much file size on a single machine is just practically
impossible with mostly used commodity hardware.
Commodity Hardware: Home basis
configured, easily affordable or not expensive.
An alternative way to produce these results by using Terabyte,
Petabyte files is to divide the file in to pieces (large file into many small
files) and produce result by processing those pieces and combine the results.
But the time taken by this processing will produce the result just after it
required.
So, An idea was introduced to take over the problem i.e. Parallel Processing also known as Distributed Processing.
“The file will still divided into small chunks (pieces) but
this time piece(s) will be distributed among multiple systems (commodity
hardware) and processed parallel to produce a combined cumulative single file result”.
Resultant, Due to parallel processing over the file pieces
it saves lots of load (over a single machine) and reduces huge processing time
(hours to secs). This is the concept behind BigData Processing using Hadoop.
That is, a group of computers works under same network
together. One computer works as Master to guide/operate/drive/monitor
processing over rest of the computers who works as slave to process over chunks
of files.
Technically, Hadoop consist of two parts, 1. Storage
(HDFS). 2. Processing (Map-Reduce).
In this post, I
just tried to explain Hadoop (as idea)
in general technical words without going into its technicalities and technical
terms. In next post I will be introducing with its core functioning and
technical terms.
It helps me to understand Big Data and hadoop basic concepts. It cleared my many doubt as well. Great post. Curiously waiting for your next post.
ReplyDeleteThanks