[wrapup - added later]

As the last exercise showed, doing the full combine of all the reduced data was just too fat to get acceptable performance. This iteration is to re-write combine.pl to limit the amount of data retained to not much more than the final “top N” result.

The notion here is to only keep a sample of the largest values found, adding to the sample only values above a minimum. When the pool of samples grows too large, sort the sample values, and trim to just N (the final number of samples), and choose the smallest as the new minimum value. This keeps sort operations small, and few. The vast majority of values (well over 99%) are rejected with a simple comparison, so this pass is essentially O(N) and very cheap per value.

sub sampleTop {
    my ( $n, $kv ) = @_;
    my $iLast  = $n - 1;
    my $nLimit = 100 + $n;
    my $min    = 0;
    my @sample = ();
    while ( my ( $k, $v ) = each(%$kv) ) {
        next if $v < $min;
        push( @sample, [ $v, $k ] );
        next if ( scalar @sample ) < $nLimit;
        @sample = sort { $b->[0] <=> $a->[0] } @sample;
        $#sample = $iLast;
        $min = $sample[$#sample]->[0];
    }
    @sample = sort { $b->[0] < => $a->[0] } @sample;
    $#sample = $iLast;
    return @sample;
}

Unfortunately, this step is necessarily a single process.

The single-process combine result and time for the full (reduced) data on the Sun test box:

bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only

Top 10 URIs by total response bytes
        919814823566: /ongoing/ongoing.atom
        393012328499: /ongoing/potd.png
        297110748615: /ongoing/ongoing.rss
        95967470509: /ongoing/rsslogo.jpg
        70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        46373582976: /talks/php.de.pdf
        43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
        42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
        38415215289: /ongoing/
        35603054785: /ongoing/moss60.jpg


Top 10 URIs returning 404 (Not Found)
        54271: /ongoing/ongoing.atom.xml
        28030: /ongoing/ongoing.pie
        27365: /ongoing/favicon.ico
        26084: /ongoing/Browser-Market-Share.png
        24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
        24078: /ongoing/Browsers-via-search.png
        24004: /ongoing/Search-Engines.png
        22637: /ongoing/ongoing.atom'
        22619: //ongoing/ongoing.atom'
        20587: /ongoing/Feeds.png


Top 10 URIs by hits on articles
        614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
        561720: /ongoing/When/200x/2003/07/17/noIE.gif
        321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
        252828: /ongoing/When/200x/2004/02/18/Bump.png
        242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
        241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
        219569: /ongoing/When/200x/2003/09/18/NXML
        204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        168652: /ongoing/When/200x/2003/03/16/XML-Prog
        137457: /ongoing/When/200x/2006/03/30/IMG_4613.png


Top 10 client IPs by hits on articles
        366634: msnbot.msn.com
        192147: cmbg-cache-2.server.ntli.net
        161867: crawler14.googlebot.com
        145264: crawl-66-249-72-173.googlebot.com
        132805: crawl-66-249-72-172.googlebot.com
        131051: cmbg-cache-1.server.ntli.net
        100298: crawl-66-249-72-72.googlebot.com
        95580: wfp2.almaden.ibm.com
        90831: sv-crawlfw3.looksmart.com
        84546: crawler10.googlebot.com


Top 10 referrers by hits on articles
        993394: http://www.google.com/reader/view/
        243013: http://planet.xmlhack.com/
        195861: http://tbray.org/ongoing/
        194726: http://planetsun.org/
        181280: http://planetjava.org/
        158613: http://slashdot.org/
        129256:
        117228: http://www.chat.kg/
        112469: http://planet.intertwingly.net/
        89177: http://www.planetjava.org/


real    10m50.437s
user    10m32.404s
sys     0m17.107s

The above time may be slightly pessimistic, as there was some fairly heavy processing (a gcc-build apparently) going on at the same time. With so many CPUs there was no competition for CPU. There might have been some competition for disk usage.

This same test took 3 minutes (elapsed) on my local x86 box - about a 3x difference. Since the earlier test showed the reduce processing will take ~28 minutes, the ~11 minutes taken by combine is reasonably similar.

Good enough. On to the full test run… :)