random memes }

Wide finder - combine and sort

[wrapup - added later]

Took another crack at the combine process for my implementation of wide finder. My theme is to try and come up with the something easily re-used and re-purposed, rather than the fastest possible specialized solution. The first round used the generic Perl sort operation, with an inline (two key) sort function. For small sizes the built-in Perl sort works well. When fed the full 45GB example data and attempting to sort the counts of client IPs - between a large memory footprint and a too-slow sort - for data that large Perl's sort is not the right solution.

Tried the old trick of transforming multiple keys into a single key for sorting. Back in the 1980's the problem I occasionally faced was how to efficiently sort data extracted from large sets of log files. The Unix sort command was (and is) pretty efficient, but the data was often not in a form sort could handle. On Unix the answer was simple - use awk and/or sed to re-shape the data to be fed into sort. Seemed like a pretty obvious approach at the time. Seems very unlikely I was the first to use a transform to turn awkward and/or multi-key data into a single-key sortable form.

This was long before the first version of Perl was posted to USENET. Funny thing - someone wrote a paper on "efficient Perl sorting". Funny because this "new" approach is rather familiar.

Tried and discarded a transformed-to-single-key variant for Perl sort - as it did not perform well.

The data to be sorted for problems like the wide finder exercise can be large enough to often-but-not-always fit in memory. Counting on an in-memory sort makes me uncomfortable (keeping to the "general purpose" meme). Perhaps better to use the Unix sort command. The Unix sort command does a smart disk-based sort-merge for large data, and is quite lean in it's use of memory. Also, we can pick up a small amount of concurrent execution by piping from combine to sort.

The relevant code from combine (which may change later):

    sub processSection {
        my ($c,$kv) = @_;
        local *OUT;
        open( OUT, "| sort -k 2nr,2 -k 3 | head -n 10" )
            or die "Cannot sort results!\n";
        while ( my ( $k, $v ) = each(%{$kv}) ) {
            print OUT $c . " " . $v . " " . $k . "\n";
        close OUT;

    while ( my ( $c, $kv ) = each %v ) {

Got acceptable (though not inspiring) times on an x86 box, so ran the already-reduced data through combine on the Sun box.

    bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only
    H 614255 /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
    H 561720 /ongoing/When/200x/2003/07/17/noIE.gif
    H 321873 /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
    H 252828 /ongoing/When/200x/2004/02/18/Bump.png
    H 242520 /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
    H 241340 /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
    H 219569 /ongoing/When/200x/2003/09/18/NXML
    H 204202 /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
    H 168652 /ongoing/When/200x/2003/03/16/XML-Prog
    H 137457 /ongoing/When/200x/2006/03/30/IMG_4613.png
    N 54271 /ongoing/ongoing.atom.xml
    N 28030 /ongoing/ongoing.pie
    N 27365 /ongoing/favicon.ico
    N 26084 /ongoing/Browser-Market-Share.png
    N 24631 /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
    N 24078 /ongoing/Browsers-via-search.png
    N 24004 /ongoing/Search-Engines.png
    N 22637 /ongoing/ongoing.atom'
    N 22619 //ongoing/ongoing.atom'
    N 20587 /ongoing/Feeds.png
    R 993394 http://www.google.com/reader/view/
    R 243013 http://planet.xmlhack.com/
    R 195861 http://tbray.org/ongoing/
    R 194726 http://planetsun.org/
    R 181280 http://planetjava.org/
    R 158613 http://slashdot.org/
    R 129256
    R 117228 http://www.chat.kg/
    R 112469 http://planet.intertwingly.net/
    R 89177 http://www.planetjava.org/
    I 366634 msnbot.msn.com
    I 192147 cmbg-cache-2.server.ntli.net
    I 161867 crawler14.googlebot.com
    I 145264 crawl-66-249-72-173.googlebot.com
    I 132805 crawl-66-249-72-172.googlebot.com
    I 131051 cmbg-cache-1.server.ntli.net
    I 100298 crawl-66-249-72-72.googlebot.com
    I 95580 wfp2.almaden.ibm.com
    I 90831 sv-crawlfw3.looksmart.com
    I 84546 crawler10.googlebot.com
    B 919814823566 /ongoing/ongoing.atom
    B 393012328499 /ongoing/potd.png
    B 297110748615 /ongoing/ongoing.rss
    B 95967470509 /ongoing/rsslogo.jpg
    B 70619295535 /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
    B 46373582976 /talks/php.de.pdf
    B 43559176904 /ongoing/When/200x/2006/05/16/J1d0.mov
    B 42428609673 /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
    B 38415215289 /ongoing/
    B 35603054785 /ongoing/moss60.jpg

    real    79m59.721s
    user    82m9.882s
    sys     0m42.510s

Given the same run completed in under 7 minutes on my x86 box, the lack of concurrency and slower Sun CPU is not(!) working in our favor. Not sure why this came in so slow (I doubt the Sun chip is in general the 10x slower this seems to show), but clearly performance is too far in the wrong direction.

So much for general purpose - we need to specialize combine for better performance. :)

As an aside, I tried a using fork() to run the sort for each section of the report in parallel. Given that the client IP count data is much larger than the data for the other sections, for this problem the gain would not be great, but at least we would get a bit more concurrency. The Perl code to process each report section in a child process:

    sub forkWorker {
        return unless 0 == fork;

    while ( my ( $c, $kv ) = each %v ) {

Oddly enough this did not work on either Linux or Solaris. I was counting on copy-on-write fork() to keep the physical memory use of each Perl child process quite small. On my Linux box (with 3GB real memory), execution slowed to a crawl as the disk chattered away madly. Seems Linux does not do copy-on-write. I expected better from Solaris, but instead got:

    bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only
    Cannot sort results!

Oh well ... this wasn't going to get the performance we wanted in any case ...

Next iteration will have to specialize combine for the "top 10" nature of this problem. Will have to limit the size of the data sorted to get acceptable (and reliable) performance.

Prior: Wide finder - first round