Ever wonder how you could develop large scale high volume data intensive computing the likes of a search engine? Never gave it much thought have you considering that it is maybe something way out of reach. Or maybe you are wondering that you won’t get the kind of computing power. With the internet growing as large as ever how do you think that Google or any other search engine manages to index and process individual web pages? Do you think that a lot of computing power is going to cut it?
So how does one manage to develop an ecosystem so large that can work on a large amount of data, which again is unstructured to carry out meaningful operation on them? The answer is what we do with any problem: bring it down to size. The answer to such huge requirements in computing power is to break the system down to smaller parts that can be worked on with a number of smaller easily available computing powers. This is distributed processing of applications. Today we are going to talk about one such distributed application development framework: Hadoop.
Hadoop is a library or framework that allows distributed processing of large data sets across vast clusters of computers. It works up from a single server to thousands of smaller computers that have local computation and storage. What is different with this is that instead of waiting for availability, the framework can itself detect failures and itself handle them at the application layer. This means that applications developed with this framework and running in its runtime are independent of a single machine’s computing capabilities. You can run applications across a number of nodes with data involving thousands of Terabytes.
Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at Yahoo at the time, named it after his son’s toy elephant. It was originally developed to support distribution for the Nutch search engine project. Hadoop was inspired by Google’s MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster.
We are going to start a series dedicated to Hadoop so that you can get a better idea of this great application development framework. So stay tuned till next time.