MapReduce in a Bottle
You probably have heard about MapReduce, the method to highly parallelize certain algorithms so that they can be run on large processing farms. Several more-or-less good introductions of how this works exist, but most of them suffer from lots of technological noise, distracting from the core concepts.
Here - mostly for my own understanding - I try to recapture what MapReduce does, and I try to be as minimalistic as possible, even throwing away first MapReduce's biggest feature: parallelism.
Map and Reduce
If you are one of the nerdy Python hackers then you know map and reduce already: map iterates over a list and generates something from each list item. The result is again a list. And reduce takes a whole list and somehow combines the list into one value. The beardy Perl hackers (ab)use grep to get the functionality of reduce, or they resort to a separate package List::Util.
MapReduce Operator
In contrast, mapreduce is an operation which transforms a hash (dictionary) into some other hash, and amazingly enough this is something which a number of problems are all about.
Let us assume that we wanted to count the words in a text. So the plan is to start out with
'1' => 'this is something',
'2' => 'this is something else',
'3' => 'something else completely',
};
and arrive at
'completely' => 1,
'else' => 2,
'is' => 2,
'this' => 2,
'something' => 3
}
If you wanted to automate this with mapreduce, then it could look like this:
The dots (...) should indicate that we need the MapReduce algorithm to tell the details how it should behave in certain situations.
MapReduce Algorithm
So what is the overall algorithm? Obviously it receives the incoming hash as parameter (h1 below):
$map and $reduce are the user-provided functions which will be called in corresponding phases. In the map phase ...
while (my ($k, $v) = each %h1) {
my %h2 = &$map ($k => $v);
map { push @{ $h3{$_} }, $h2{$_} } keys %h2;
}
... all key/value pairs of the original hash (h1) are visited. For each such pair a user-provided map function will create a new (usually small) hash (h2). That will be embedded into a larger hash (h3). The keys of that hash are completely independent from the original hash. The values might be single values, or more generally a list. This is why I use a push above.
In the reduce phase ...
... h3 is revisited, each key/value is sent to reduce. That will mostly ignore the key, but will concentrate to produce a single value from the list behind the key. That key together with the value then make it into h4. All done, this will be the end result.
That Really Works?
Back to our word counting problem: The map function we have to provide for mapreduce will get a key/value such as
and is supposed to produce a new hash:
While we ignore the key (it is only the boring line number), we split the string and for every word we create a hash with the word as key and 1 as value:
'this' => 1,
'something' => 1
That hash will be embedded in h3. If we now manage to look at one key in h3 and sum up all occurrences in [1, 1, 1, ...], then we arrive at the total count of each word:
my ($k, $v) = (shift, shift);
my $sum = 0;
map { $sum += $_ } @$v;
return $sum;
};
For our all confusion I have made egregious use of the Perl map function. But otherwise the inner workings of $mapper and $reducer are irrelevant.
Putting It Together
In Perl/Python/... it is easy to pass around functions ...
my $producer = sub { ... };
my $A = { ... };
my $B = mapreduce ($A, $mapper, $reducer);
... or even leave these anonymous ...
my $B = mapreduce ($A, sub { ... }, sub { ... });
... if you are a variable-hater like me.
Going Parallel
There must be gain from all this pain, and that is that the whole setup can be heavily parallelized.
Obviously the first parallelization can be derived from mapreduce doing something in two phases, as they could be run in parallel as pipeline.
But each map and each reduce invocation is independent from any other, implying that all these can be done in parallel, possible farmed out to hundreds if not thousands of nodes. That is what Google alledgedly does.
And this is what I will look into next, even though my farm is somewhat smaller.
AsTMa= Snippet
homepage: http://labs.google.com/papers/mapreduce.html
paper: http://labs.google.com/papers/mapreduce-osdi04.pdf
wikipedia: http://en.wikipedia.org/wiki/MapReduce
is-implemented-by hadoop,
which isa software
written-in java
homepage: http://hadoop.apache.org/core/ ,
is-implemented-by perl-mapreduce-IWOODHEAD
which isa software
written in perl
download: http://backpan.perl.org/authors/id/I/IW/IWOODHEAD/MapReduce-0.03.tar.gz

Post new comment