Wide finder - combine for top N
As the last exercise showed, doing the full combine of all the reduced data was just too fat to get acceptable performance. This iteration is to re-write combine.pl to limit the amount of data retained to not much more than the final "top N" result.
The notion here is to only keep a sample of the largest values found, adding to the sample only values above a minimum. When the pool of samples grows too large, sort the sample values, and trim to just N (the final number of samples), and choose the smallest as the new minimum value. This keeps sort operations small, and few. The vast majority of values (well over 99%) are rejected with a simple comparison, so this pass is essentially O(N) and very cheap per value.
sub sampleTop {
my ( $n, $kv ) = @_;
my $iLast = $n - 1;
my $nLimit = 100 + $n;
my $min = 0;
my @sample = ();
while ( my ( $k, $v ) = each(%$kv) ) {
next if $v < $min;
push( @sample, [ $v, $k ] );
next if ( scalar @sample ) < $nLimit;
@sample = sort { $b->[0] <=> $a->[0] } @sample;
$#sample = $iLast;
$min = $sample[$#sample]->[0];
}
@sample = sort { $b->[0] < => $a->[0] } @sample;
$#sample = $iLast;
return @sample;
}
Unfortunately, this step is necessarily a single process.
The single-process combine result and time for the full (reduced) data on the Sun test box:
bannister@wfind01$ time scripts/combine.pl tmp/_reduce_only
Top 10 URIs by total response bytes
919814823566: /ongoing/ongoing.atom
393012328499: /ongoing/potd.png
297110748615: /ongoing/ongoing.rss
95967470509: /ongoing/rsslogo.jpg
70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
46373582976: /talks/php.de.pdf
43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
38415215289: /ongoing/
35603054785: /ongoing/moss60.jpg
Top 10 URIs returning 404 (Not Found)
54271: /ongoing/ongoing.atom.xml
28030: /ongoing/ongoing.pie
27365: /ongoing/favicon.ico
26084: /ongoing/Browser-Market-Share.png
24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
24078: /ongoing/Browsers-via-search.png
24004: /ongoing/Search-Engines.png
22637: /ongoing/ongoing.atom'
22619: //ongoing/ongoing.atom'
20587: /ongoing/Feeds.png
Top 10 URIs by hits on articles
614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
561720: /ongoing/When/200x/2003/07/17/noIE.gif
321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
252828: /ongoing/When/200x/2004/02/18/Bump.png
242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
219569: /ongoing/When/200x/2003/09/18/NXML
204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
168652: /ongoing/When/200x/2003/03/16/XML-Prog
137457: /ongoing/When/200x/2006/03/30/IMG_4613.png
Top 10 client IPs by hits on articles
366634: msnbot.msn.com
192147: cmbg-cache-2.server.ntli.net
161867: crawler14.googlebot.com
145264: crawl-66-249-72-173.googlebot.com
132805: crawl-66-249-72-172.googlebot.com
131051: cmbg-cache-1.server.ntli.net
100298: crawl-66-249-72-72.googlebot.com
95580: wfp2.almaden.ibm.com
90831: sv-crawlfw3.looksmart.com
84546: crawler10.googlebot.com
Top 10 referrers by hits on articles
993394: http://www.google.com/reader/view/
243013: http://planet.xmlhack.com/
195861: http://tbray.org/ongoing/
194726: http://planetsun.org/
181280: http://planetjava.org/
158613: http://slashdot.org/
129256:
117228: http://www.chat.kg/
112469: http://planet.intertwingly.net/
89177: http://www.planetjava.org/
real 10m50.437s
user 10m32.404s
sys 0m17.107s
The above time may be slightly pessimistic, as there was some fairly heavy processing (a gcc-build apparently) going on at the same time. With so many CPUs there was no competition for CPU. There might have been some competition for disk usage.
This same test took 3 minutes (elapsed) on my local x86 box - about a 3x difference. Since the earlier test showed the reduce processing will take ~28 minutes, the ~11 minutes taken by combine is reasonably similar.
Good enough. On to the full test run... :)