Tuesday, August 13, 2013

Hadoop: An Introduction to Idea !

Trending word/technology over the internet among the techies now days. Many are predicting about its big future. Many are blogging to give its paid/free trainings. What exactly is it? In which programming domain Hadoop lays? Is Hadoop a new era of storage? Is Hadoop a database? SQL/NoSQL? It that a cloud computing?
Many questions come into mind when we read a wiki about Hadoop. To be very simple over the Hadoop, It’s neither a database nor a storage methodology. It’s an improved era of data ware housing and data processing for analytics.
Let me come up with an example..
Let’s think about an international retail store. Millions of customers purchasing billions of products of their choices and fulfilling their needs on daily basis. All the billing including the process from purchase to sales is managing by a real time centralize application which is generating huge amount of logs by keeping transaction details for each customer and respective purchased products globally. Huge Amount?
Yes, Very Huge!
MBs (Megabytes)? GB s(Gigabytes)?
Nope, more than that. TBs (Terabytes) , PBs (Petabytes). Yes tracking history for each sale record worldwide will reach the log file size up to TBs and PBs. This log file is the Big Data. Big Data?
Big Data, A or Any bunch of data (might be complex) on which data processing can’t be done by traditional (legacy) methods cause if size, i.e. Big Data.
Really that’s just not possible to process the file to find out a desired result or analytical calculation by just those common worldwide methodologies. Analytical Calculation?
-To calculate the consumption of a soft drink on the parameter of flavor. Or to calculate the best-selling toothpaste brand in Asia will process the whole sales log file. These were the examples of analytical calculation.
To process Terabyte, Petabyte sized files for the analytical calculation will take time of days meanwhile selling trends will get changed. So the produced results will be either useless or wrong to estimate the future sell. But to process this much file size on a single machine is just practically impossible with mostly used commodity hardware.
Commodity Hardware: Home basis configured, easily affordable or not expensive.
An alternative way to produce these results by using Terabyte, Petabyte files is to divide the file in to pieces (large file into many small files) and produce result by processing those pieces and combine the results. But the time taken by this processing will produce the result just after it required.
So, An idea was introduced to take over the problem i.e. Parallel Processing also known as Distributed Processing.
“The file will still divided into small chunks (pieces) but this time piece(s) will be distributed among multiple systems (commodity hardware) and processed parallel to produce a combined cumulative single file result”.
Resultant, Due to parallel processing over the file pieces it saves lots of load (over a single machine) and reduces huge processing time (hours to secs). This is the concept behind BigData Processing using Hadoop.
That is, a group of computers works under same network together. One computer works as Master to guide/operate/drive/monitor processing over rest of the computers who works as slave to process over chunks of files.
Technically, Hadoop consist of two parts, 1. Storage (HDFS). 2. Processing (Map-Reduce).
In this post, I just tried to explain Hadoop (as idea) in general technical words without going into its technicalities and technical terms. In next post I will be introducing with its core functioning and technical terms.