[ Top ]

pdmr

Table of Contents

  1. pdmr : The original goal was to build a perl version of Google's MapReduce. It works!, now I have more to do.


Contents

  1. pdmr
  2. The original goal was to build a perl version of Google's MapReduce. The scope has somewhat expanded into a quick and dirty parallel framework.

    The basic idea of MapReduce has its' roots as a longish standing computer solution. There are two phases to the solution. The first is Map, which just means take a bunch of objects... say an array of them, and do something to each individual object, without reference, deference, or interference by any other object. Much like a mechanic checking tire pressure on all the tires of all the cars in a Dealers lot. The tires don't conspire to change each others pressures. And if you had them, an army of mechanics could all check the pressure of every tire at the same time.

    The next little bit is the reduce. Taking the new list of objects, results, or whatever from the map operation and winnowing them down to something more useful than a pile of raw results. This technique is often used for large fields of numbers. Taking the above example, lets find the average tire pressure in the lot. To find the average, we first need to find the sum of all the tire pressures. The sum operation is the reduce step of the MapReduce. It squishes all the measurements together, and after dividing by the number of tires, we get a possibly useful average tire pressure.

    The interesting part of reduce is that it too has some possibilities for parallel optimization. After you squint your eyes at it for a while, MapReduce starts turning into the hammer you'd like to hit all your data with. It's often the right idea.

    Problems that map reduce well

    Enough about the general problem, on to the specific solution.

    There are a few really cool features that I have stumbled on. The first is mobile code and perl. By implementing the system in perl and keeping to the modules that most systems already have installed, a cluster can be dynamically overlaid on any machines that someone has ssh access into. Be it a corporate network of shared machines, or an instructional lab. This nicely avoids a fair amount of the initial startup costs for cluster installation so a proof concept project be done quickly and easily.

    Another interesting feature is the 'arbitrary data'ness of the entire thing. The message passing system is just passing around serialized perl, so anything can be passed around. Unlike MapReduce, or other solutions any standard perl data-structure can be passed around, be it a scalar with an image in it, or a hash of hashes as an inverted index.

    Novel features:

    1. Perl and ssh based.
    2. Mobile code approach. The only thing a cluster node requires is ssh access and a recent version of perl that is probably already on the machine.
    3. Rendering
    4. Some Mathmatics operations

    Example :

     
    	
    	@nodelist = qw( 10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.6 10.0.0.7 );	
    
    	$node_collection = init_workerpile(\@nodelist);
    
    	$resultref = pd_map_reduce($node_collection, \@nodelist , \&workfunction, \&reducefunction , [ @data ] ); 
    
    		# <...> As many pd_map_reduce calls you'd like. 
    
    	deinit_workerpile($pile);	
    
    

    or

     
    
    
            $word_count_func = sub { 
                   my ($args,$datakey,$data) = @_; 
                   my %ret;
    
                   $data =~ s/(\S+)/{$ret{$_}++}/eg; 
    
                   \%ret;
            }; 
    
            $resultref = pd_map_reduce($node_collection, \@nodelist , $word_count_func  ,  $merge_func , [ @text_buffers] );
    
    
            print "Word counts from all text buffers\n";
            map { 
                   print "$_  :   $$resultref[0]{$_}\n";
            } sort keys %{$$resultref[0]}; 
    
    
    The cool part of course is that this can be done on multiple resources at once, so even if the individual map operation is less than the best algorithm for the job, by throwing enough hardware at the problem it can get done much more quickly, and often times at a cheaper price point than the big iron and dedicated software solution.

    So far only simple systems testing system have be built out of pdmr.

    Systems implimented:

    Future plans:

[ Top ]