Announce: Perl Parallel::MapReduce on CPAN

Following the release-early-release-often tradition I have uploaded Parallel::MapReduce onto CPAN. It follows the ideas about embedding MapReduce more into the language, and to offer a pipelining feature.

But needless to say, this is all very pre-alpha, so tread carefully at your own risk and let me know what you think about the roadmap.

Look Ma! No files!

One notable deviation from the usual implementation is that I do not use a replicating file system (such as GFS or HDFS) to propagate data from the master to the workers and back.

I actually do not use a file system at all, but instead store hash data onto a pool of memcache daemons.

I am not sure whether and how good this scales, but given that a Perl-based MR installation will probably not have 100.000 machines, a pool of memcached's might just do fine. We will see.


Remoting of processes is done via SSH, so there is a second dependency on external software. The advantage is that I can use the created SSH tunnel to exchange data between master and worker processes.

One deployment scenario I have in mind is abusing boxes I have SSH accounts on, so this approach might come with zero additional costs.

Look Ma! No Enterprisiness!

Sincere apologies to all those who expect nowadays enterprisy software. So no 10^6 lines of code (only about 1000 lines including documentation) and also not 550 classes (only 6).


Posted In
Anonymous (not verified) | Wed, 07/23/2008 - 11:02
rho | Wed, 07/23/2008 - 11:29