MapReduce in a Bottle

You probably have heard about MapReduce, the method to highly parallelize certain algorithms so that they can be run on large processing farms. Several more-or-less good introductions of how this works exist, but most of them suffer from lots of technological noise, distracting from the core concepts.

Here - mostly for my own understanding - I try to recapture what MapReduce does, and I try to be as minimalistic as possible, even throwing away first MapReduce's biggest feature: parallelism.

Map and Reduce

If you are one of the nerdy Python hackers then you know map and reduce already: map iterates over a list and generates something from each list item. The result is again a list. And reduce takes a whole list and somehow combines the list into one value. The beardy Perl hackers (ab)use grep to get the functionality of reduce, or they resort to a separate package List::Util.

MapReduce Operator

In contrast, mapreduce is an operation which transforms a hash (dictionary) into some other hash, and amazingly enough this is something which a number of problems are all about.

Let us assume that we wanted to count the words in a text. So the plan is to start out with

my $A = {
         '1' => 'this is something',
         '2' => 'this is something else',
         '3' => 'something else completely',
        };

and arrive at

{
  'completely' => 1,
  'else' => 2,
  'is' => 2,
  'this' => 2,
  'something' => 3
}

If you wanted to automate this with mapreduce, then it could look like this:

my $B = mapreduce ($A, ..., ...);

The dots (...) should indicate that we need the MapReduce algorithm to tell the details how it should behave in certain situations.

MapReduce Algorithm

So what is the overall algorithm? Obviously it receives the incoming hash as parameter (h1 below):

sub mapreduce {
    my %h1     = %{ shift };
    my $map    = shift;
    my $reduce = shift;

$map and $reduce are the user-provided functions which will be called in corresponding phases. In the map phase ...

my %h3;
    while (my ($k, $v) = each %h1) {
        my %h2 = &$map ($k => $v);
        map { push @{ $h3{$_} }, $h2{$_} } keys %h2;
    }

... all key/value pairs of the original hash (h1) are visited. For each such pair a user-provided map function will create a new (usually small) hash (h2). That will be embedded into a larger hash (h3). The keys of that hash are completely independent from the original hash. The values might be single values, or more generally a list. This is why I use a push above.

In the reduce phase ...

my %h4;
    while (my ($k, $v) = each %h3) {
        $h4{$k} = &$reduce ($k => $v);
    }
    return \ %h4;
}

... h3 is revisited, each key/value is sent to reduce. That will mostly ignore the key, but will concentrate to produce a single value from the list behind the key. That key together with the value then make it into h4. All done, this will be the end result.

That Really Works?

Back to our word counting problem: The map function we have to provide for mapreduce will get a key/value such as

'1' => 'this is something'

and is supposed to produce a new hash:

my $mapper = sub {
                  my ($k, $v) = (shift, shift);
                  return map { $_ => 1 } split /\s+/, $v;
             };

While we ignore the key (it is only the boring line number), we split the string and for every word we create a hash with the word as key and 1 as value:

'is'        => 1,
'this'      => 1,
'something' => 1

That hash will be embedded in h3. If we now manage to look at one key in h3 and sum up all occurrences in [1, 1, 1, ...], then we arrive at the total count of each word:

my $reducer = sub {
                   my ($k, $v) = (shift, shift);
                   my $sum = 0;
                   map { $sum += $_ } @$v;
                   return $sum;
              };

For our all confusion I have made egregious use of the Perl map function. But otherwise the inner workings of $mapper and $reducer are irrelevant.

Putting It Together

In Perl/Python/... it is easy to pass around functions ...

my $mapper   = sub { ... };
my $producer = sub { ... };

my $A = { ... };
my $B = mapreduce ($A, $mapper, $reducer);

... or even leave these anonymous ...

my $A = { ... };
my $B = mapreduce ($A, sub { ... }, sub { ... });

... if you are a variable-hater like me.

Going Parallel

There must be gain from all this pain, and that is that the whole setup can be heavily parallelized.

Obviously the first parallelization can be derived from mapreduce doing something in two phases, as they could be run in parallel as pipeline.

But each map and each reduce invocation is independent from any other, implying that all these can be done in parallel, possible farmed out to hundreds if not thousands of nodes. That is what Google alledgedly does.

And this is what I will look into next, even though my farm is somewhat smaller.

AsTMa= Snippet

mapreduce isa technology
homepage: http://labs.google.com/papers/mapreduce.html
paper: http://labs.google.com/papers/mapreduce-osdi04.pdf
wikipedia: http://en.wikipedia.org/wiki/MapReduce
  is-implemented-by hadoop,
           which isa software
           written-in java
           homepage: http://hadoop.apache.org/core/ ,
  is-implemented-by perl-mapreduce-IWOODHEAD
           which isa software
           written in perl
           download: http://backpan.perl.org/authors/id/I/IW/IWOODHEAD/MapReduce-0.03.tar.gz

Posted In