MapReduce is a programming model for processing large amounts of data. It works best when you have a relatively simple program, but data is spread across thousands of servers. MapReduce was invented and popularized by Google. I’ll talk about MapReduce in general and Hadoop in particular.
How MapReduce works?
Imagine you have a lot of data, way more than fits on disk. As a matter of fact, MapReduce works best with distributed file systems like HDFS. MapReduce requires you to write two pieces of code: map and reduce. Duh! First, your input data is split into records. These can be database records or, more commonly, lines from a large file. In the map phase, input records are transformed. For each input, you can produce as many key-value pairs as you want. In the reduce phase, pairs with the same key are collected and combined. The implementation of the reduce phase is also up to you.
Hello world example: word count
The “Hello, world” example of MapReduce is counting distinct words in a large file. So large it spans across hundreds of disks. In the map phase, for each line, we produce one pair for each distinct word found on that line. Key is the word and value is simply number one. One occurrence of a given word. Then the MapReduce framework shuffles the data around so that pairs for the same key are transferred to the same node. At this point the reduce phase kicks in. Its responsibility is to merge, summarize pairs for the same key. In our case it’s simple: sum up values for each key individually.
The map and reduce phases are trivial to implement. The most complex part is provided by the framework, like Hadoop. It’s shuffling. Shuffling sorts and moves data around the cluster. This allows grouping, summarizing and sorting input. Shuffling can also be the most time consuming and network intensive operation.
That’s one of the reasons of MapReduce’s success.
Writing map and reduce code is fairly straightforward.
The heavy-lifting of moving data around is done for us.
Also MapReduce implementation monitors for failures.
If either step fails due to node failure, it is repeated somewhere else.
Especially the map phase is trivial to parallelize, so MapReduce jobs scale really well.
If all you need is to perform a single, independent action on each record, MapReduce works like a charm.
For example, a bank sending personalized PDF statements to tens of millions of customers.
You could write a
for loop for that.
Yet, MapReduce adds almost unlimited parallelism and seamlessly handles failures.
Last, but not least, frameworks like Hadoop try really hard to run jobs locally.
What does it mean?
Data is typically distributed on hundreds, if not thousands of disks.
Hadoop tries to run mappers on the same machine as input data.
Or at least on the
same datacenter rack.
Also, one can implement an additional so-called
This piece of code is similar to the reducer, but runs locally.
Thanks to this, shuffling transfers preaggregated data.
And we save a lot of bandwidth.
So if MapReduce is so awesome, why is Google no longer actively using it?
Turns out this simplistic approach of two steps, map and reduce is… too simplistic.
for loop, counting words, simple aggregations - these are straightforward.
But trying to implement something more sophisticated becomes cumbersome.
Often we must pipeline multiple MapReduce jobs.
The output of the first one becomes the input of the second one.
And so on.
The programming model is simply too low-level.
For example, Apache Hive supports SQL-like queries over big data.
Underneath it produces Hadoop jobs on the fly for you.
These days more elegant and expressive frameworks are popular.
Most commonly: Spark.
- MapReduce on Wikipedia
- How Hadoop Works Internally – Inside Hadoop
- Hadoop Combiner – Best Explanation to MapReduce Combiner
- Apache Hive