random memes }

Wide finder - combine for top N

[wrapup - added later]

As the last exercise showed, doing the full combine of all the reduced data was just too fat to get acceptable performance. This iteration is to re-write combine.pl to limit the amount of data retained to not much more than the final "top N" result.

The notion here is to only keep a sample of the largest values found, adding to the sample only values above a minimum. When the pool of samples grows too large, sort the sample values, and trim to just N (the final number of samples), and choose the smallest as the new minimum value. This keeps sort operations small, and few. The vast majority of values (well over 99%) are rejected with a simple comparison, so this pass is essentially O(N) and very cheap per value.

    sub sampleTop {
        my ( $n, $kv ) = @_;
        my $iLast  = $n - 1;
        my $nLimit = 100 + $n;
        my $min    = 0;
        my @sample = ();
        while ( my ( $k, $v ) = each(%$kv) ) {
            next if $v < $min;
            push( @sample, [ $v, $k ] );
            next if ( scalar @sample ) < $nLimit;
            @sample = sort { $b->[0] <=> $a->[0] } @sample;
            $#sample = $iLast;
            $min = $sample[$#sample]->[0];
        }
        @sample = sort { $b->[0] < => $a->[0] } @sample;
        $#sample = $iLast;
        return @sample;
    }

Unfortunately, this step is necessarily a single process.

The single-process combine result and time for the full (reduced) data on the Sun test box:

    bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only

    Top 10 URIs by total response bytes
            919814823566: /ongoing/ongoing.atom
            393012328499: /ongoing/potd.png
            297110748615: /ongoing/ongoing.rss
            95967470509: /ongoing/rsslogo.jpg
            70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
            46373582976: /talks/php.de.pdf
            43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
            42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
            38415215289: /ongoing/
            35603054785: /ongoing/moss60.jpg


    Top 10 URIs returning 404 (Not Found)
            54271: /ongoing/ongoing.atom.xml
            28030: /ongoing/ongoing.pie
            27365: /ongoing/favicon.ico
            26084: /ongoing/Browser-Market-Share.png
            24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
            24078: /ongoing/Browsers-via-search.png
            24004: /ongoing/Search-Engines.png
            22637: /ongoing/ongoing.atom'
            22619: //ongoing/ongoing.atom'
            20587: /ongoing/Feeds.png


    Top 10 URIs by hits on articles
            614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
            561720: /ongoing/When/200x/2003/07/17/noIE.gif
            321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
            252828: /ongoing/When/200x/2004/02/18/Bump.png
            242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
            241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
            219569: /ongoing/When/200x/2003/09/18/NXML
            204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
            168652: /ongoing/When/200x/2003/03/16/XML-Prog
            137457: /ongoing/When/200x/2006/03/30/IMG_4613.png


    Top 10 client IPs by hits on articles
            366634: msnbot.msn.com
            192147: cmbg-cache-2.server.ntli.net
            161867: crawler14.googlebot.com
            145264: crawl-66-249-72-173.googlebot.com
            132805: crawl-66-249-72-172.googlebot.com
            131051: cmbg-cache-1.server.ntli.net
            100298: crawl-66-249-72-72.googlebot.com
            95580: wfp2.almaden.ibm.com
            90831: sv-crawlfw3.looksmart.com
            84546: crawler10.googlebot.com


    Top 10 referrers by hits on articles
            993394: http://www.google.com/reader/view/
            243013: http://planet.xmlhack.com/
            195861: http://tbray.org/ongoing/
            194726: http://planetsun.org/
            181280: http://planetjava.org/
            158613: http://slashdot.org/
            129256:
            117228: http://www.chat.kg/
            112469: http://planet.intertwingly.net/
            89177: http://www.planetjava.org/


    real    10m50.437s
    user    10m32.404s
    sys     0m17.107s

The above time may be slightly pessimistic, as there was some fairly heavy processing (a gcc-build apparently) going on at the same time. With so many CPUs there was no competition for CPU. There might have been some competition for disk usage.

This same test took 3 minutes (elapsed) on my local x86 box - about a 3x difference. Since the earlier test showed the reduce processing will take ~28 minutes, the ~11 minutes taken by combine is reasonably similar.

Good enough. On to the full test run... :)