The Story of Hadoop and Why Should I Care?

by Xavier Comments: 0

You might have heard or seen the term Big Data. The term refers to data sets that are too large or complex to be dealt with through traditional processing applications.

In fact, the information within these data packets is so enormous it can’t be stored or processed on one server. Instead, it might take calls to several devices to retrieve the data. Even then, process time can still be incredibly slow.

Distributed Computing

This is where Hadoop comes in. Developed in 2005 by a pair of Apache software engineers, the platform creates a distributed model to store large data sets within computer clusters. In turn, these clusters work together to execute programs and handle potential issues.

So, how did we get to this point in the world of digital information? Did it appear without notice, or did the concept of large data sets gradually form?

Let’s get into some history on the creation of Big Data and its connections with Hadoop.

Beyond The Information Age

The concept of Big Data goes beyond the Information Age. Individuals and groups have dealt with large amounts of information for centuries.

For instance, John Graunt had to deal with volumes of information during the Bubonic Plague of the 17th century. When he compiled the data into logical groups, he created a set of statistics. Graunt eventually became known as the father of demographics.
Issues with large data occurred after this, as did the development of solutions. In 1881, Herman Hollerith created a tabulating machine that used punch cards to calculate the 1880 Census. In 1927, Fritz Pfleumer invented a procedure to store data on a strip of magnetic tape.
As more data was collected, the means to store and sort it changed. There wasn’t any choice as the information became increasingly complicated. For example, the amount of calculations required by NASA and other space agencies to launch successful programs.
Move Into Popular Culture

However, this didn’t match the accumulation of data collected once computers were made available to the public. It reached enormous sizes when those users learned about the internet. Add smart devices, artificial intelligence, and the Internet of Things (IoT), and “Big” has become exponentially huge.

Consider what is part of this label. Social media is a large piece of it. Credit card companies and other groups that handle Personally Identifiable Information (PII) also produce large amounts of information. Banks and other financial firms create well beyond trillions of data bytes in a single hour.

The Official Term

It wasn’t until 2005 that this process was given the name we know today. It was coined in 2005 by Roger Mougalas, a director of market research at O’Reilly Media. At that time, he referred to it as a set information that was nearly impossible to process with traditional business tools. That includes Relational Database Management Systems (RDBMS) like Oracle.

What could a business or government entity do at that point? Even without excessive information from mobile devices, there was still a large volume of data to compile and analyze. This is where two Apache designers — Doug Cutting and Mike Cafarella — came into play.

Computer Clusters And Large Data

In 2002, these engineers started work on the Apache Nutch product. Their goal was to build a new search engine that could quickly index one billion pages of information. After extensive research, it was determined the creation of Nutch would be too expensive. So, the developers went back to the drawing board.

Over the next two years, the team studied potential resolutions. They discovered two technological white papers that helped. One was on the Google File System (GFS) and the other was on MapReduce. Both discussed ways to handle large data sets as well as index them to avoid slowdowns.

This is when Cutting and Cafarella decided to utilize these two principles and create an open source product that would help everyone index these large data amounts. In 2005, they created the first edition of the product, then realized it needed to be established on computer clusters to properly work. A year later, Cutting moved the Nutch product to Yahoo.

It’s here he got to work. Cutting removed the distributed computing parts of Nutch to create the framework for Hadoop. He got the name from a toy elephant his son owned.

With GFS and MapReduce, cutting created the open source platform to operate on thousands of computer nodes. In 2007, it was successfully tested on 1000 nodes. In 2011, the software was able to sort a Petabyte of data in 17 hours. This is equal to 1000 Terabytes of material. The product became available to everyone that same year.

Of course, this is not the end to the story of solutions needed for the index of large data. Technology continues to change, especially if outside influences make more of us head to our computers. There will come a time when something more powerful will be required than multiple storage nodes.

Until then, we thank those who have already gone through the steps to help all of us retrieve large amounts of data in the quickest and most efficient way possible.