Announce: Perl Parallel::MapReduce on CPAN
Following the release-early-release-often tradition I have uploaded Parallel::MapReduce onto CPAN. It follows the ideas about embedding MapReduce more into the language, and to offer a pipelining feature.
But needless to say, this is all very pre-alpha, so tread carefully at your own risk and let me know what you think about the roadmap.
Look Ma! No files!
One notable deviation from the usual implementation is that I do not use a replicating file system (such as GFS or HDFS) to propagate data from the master to the workers and back.
I actually do not use a file system at all, but instead store hash data onto a pool of memcache daemons.
I am not sure whether and how good this scales, but given that a Perl-based MR installation will probably not have 100.000 machines, a pool of memcached's might just do fine. We will see.
Remoting of processes is done via SSH, so there is a second dependency on external software. The advantage is that I can use the created SSH tunnel to exchange data between master and worker processes.
One deployment scenario I have in mind is abusing boxes I have SSH accounts on, so this approach might come with zero additional costs.
Look Ma! No Enterprisiness!
Sincere apologies to all those who expect nowadays enterprisy software. So no 10^6 lines of code (only about 1000 lines including documentation) and also not 550 classes (only 6).