Simply defining Hadoop is a difficult task. Since Hadoop doesn’t fit in other common software categories such as operational systems, databases, message queues, etc., defining it usually requires long sentences. However, long sentences would prevent this introduction from achieving its purpose, which is to explain Hadoop and its' MapReduce framework to those who know absolutely nothing about it. So, how do we explain something without describing what it is? By describing what it does and how it works.
What does Hadoop do?
Most developers work their entire life without creating a truly distributed application. They create applications that interact with distributed databases, distributed message brokers, and maybe even distributed application servers, but their applications are not distributed ones.
Hadoop gives developers the power to create distributed applications and execute distributed processes, but it does so by running itself in a completely distributed model.
Why is my application not a distributed one?
The same concept may have lots of different definitions. A distributed application, in this context, means taking an input or a request, and executing distributed processes to generate an output or response.
For example, a Rest API application may run over a distributed application server, and can be deployed across many physical servers (or many virtual machines). It can communicate with distributed databases and any other distributed system, but when a request comes, the code that handles that request executes on a single server or process.
Why do I need to run distributed processes?
Usually, developers don’t need to think about distributed processes. They rarely write code that handles more than a couple of messages, thousands of records, or more than one file at a time regardless of the number of rows in a table, the number of messages in a topic, or the number of files on filesystem.
But, what if there is a requirement where you need to process an entire table or all the files you have on a filesystem?
Distributed processes become the solution when the volume of data that has to be processed cannot be handled by a single process executing on a single thread running on a single server.
Are you talking about Big Data?
Maybe. Notice that most companies who use some sort of computer system to support their operations already accumulate huge volumes of information from those systems.
Regardless of their business segment, they may have millions of operations and transactions registered somewhere. However, they process these operations and transactions on an individual level. Their systems cannot extract valuable information from all the accumulated information because they would not be capable of processing all the data.
How does Hadoop handle huge volumes of data?
In order to understand how Hadoop processes data, it is important to understand how it stores data. The answer is the Hadoop Distributed Filesystem, or simply HDFS, which is a cluster of servers running HDFS that store the data that will be processed by Hadoop. If the volume of data increases, the solution is to add more servers to the cluster. A HDFS cluster can literally have thousands of nodes.
The strategy used by Hadoop to allow developers to create processes using this huge volume of data is to distribute and execute code provided by developers across all nodes. Each node will execute the code, only processing the data stored on that same node.
Notice how this is completely different from other programming models. On Hadoop, data is not transported to where the code will be executed. Instead, the code is transported to and executed where the data is stored.
How do I create distributed applications with Hadoop?
You can do so by using Hadoop MapReduce. This framework provides developers the ability to create code that will be executed on HDFS nodes.
Basically, developers need to implement two classes when implementing MapReduce. As the framework’s name suggests, there will be a mapper class and a reducer class.
Mapper will be executed on all HDFS nodes, no exceptions. There are more details involved, but the basic idea is that Hadoop will read HDFS, look for the files expected by the mapper, and invoke it by passing the content of the file.
Mapper will then process, filter out, sort the data on file, and forward it to reducer. Reducer can perform additional operations on data received from mapper, but the essential function is that the output generated by reducer will be saved to HDFS on new files.
Reducers also execute on nodes, but since the data generated by the mapper phase is considerably smaller than the original data stored on HDFS, only a few instances of reducers may be required during the execution of a MapReduce, instances that will only run on some nodes of the cluster.
Once the processed result is saved to HDFS, external applications can access HDFS and provides users with the information they are looking for.
Can you give a real and practical example?
Yes. Let's take a look at a retail company with a couple of different systems managing information for about a million products. Each piece of information (description, images, pricing, inventory, etc.) about each product is successfully managed by each of those different systems. All these systems use some sort of relational database, but there is no system or database centralizing all the information regarding products.
This company will need to summarize all the information regarding all of its products multiple times per day and send it to some partners via a text file.
The approach that most of developers would choose is to create a single application that connects to all databases, reads records from necessary tables, and saves them onto a file. Notice that this application would need to group information about the products before saving it all onto the file.
The problem is that millions of records are stored by each database and they all need to be used to create the output file. Reading these records one by one is not a viable solution as it would take hours to complete. Furthermore, files need to be generated with up-to-date information and sent to partners every 15 minutes. Using memory to store all the records is also not a viable solution because the amount of memory required would be huge and likely not provided by any server.
Therefore, this company used Hadoop to solve its issues. First, they utilized processes connecting to databases and copying information from them to HDFS. These processes were spread over hundreds of nodes in the Hadoop cluster. Each process only reads certain records from tables, not all of them. When these processes finished, thousands of files containing product information were saved and distributed across the HDFS nodes.
However, this information was not yet summarized. It was merely a copy of what was stored by original databases. In order to create the ideal file expected by partners, other processes were executed to read the files stored on HDFS, group the information about products, and save everything onto a single file. Again, these processes were not run on single servers, instead, they were run on the hundreds of servers in the Hadoop cluster.
By running on this distributed model, files only took 5 minutes to be generated, as opposed to the expected 15 minutes, and contained summarized and up-to-date information about millions of products.
Is MapReduce the only solution?
No, not anymore.
On Hadoop 1, MapReduce was not only a framework provided to developers for writing distributed applications, but also the underlying mechanism used to distribute and execute these applications across Hadoop nodes.
On Hadoop 2, these two tasks were split. MapReduce still exists as a framework, but managing distributed applications is now a task executed by YARN.
YARN enabled the creation of other frameworks to run on Hadoop and extract information from HDFS. A very common alternative to MapReduce is Spark. Basically with Spark, you process data creating a pipeline of operations. Data can be processed in many steps and unlike MapReduce, you are not limited to only two processing phases. Another option would be Tez, which is intended to replace MapReduce.
There are plenty of other tools available for use as well such as HBase which is a database for Hadoop. There's also Sqoop which makes it easier to transfer data between relational databases and Hadoop. Pig also offers a high-level language to write programs that analyze data stored in Hadoop. These are just some of the options from a much longer list.
What should I do next?
If you want to learn more about Hadoop, here are my recommendations:
Try to get familiar with HDFS. Hadoop documentation explains how to install and use it. Installation and configuration are good starting steps that will help you to learn more about HDFS works; however, if you wish to skip these steps and expedite the process, Cloudera and HortonWorks are two companies that provide virtual machines with everything already set. Just choose one, download it, and start playing around with the HDFS commands.
You can also implement your own MapReduce. The tutorial provided by Apache is a good starting point, but also try to write your own mapper and reducer from scratch utilizing a scenario from your business where it could function a solution.