[wrapup - added later]

From running my implementation of Tim Bray’s Wide Finder 2 on the Sun test box.

bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
    time scripts/combine.pl > _x30_reduce_combine
Fri Jun 13 04:03:32 2008
Scanning: logs/O.all
Done with: logs/O.all
Worker #21875 ended with status: 0
Worker #21874 ended with status: 0
Worker #21873 ended with status: 0
Worker #21872 ended with status: 0
Worker #21871 ended with status: 0
Worker #21870 ended with status: 0
Worker #21869 ended with status: 0
Worker #21868 ended with status: 0
Worker #21867 ended with status: 0
Worker #21866 ended with status: 0
Worker #21865 ended with status: 0
Worker #21864 ended with status: 0
Worker #21863 ended with status: 0
Worker #21862 ended with status: 0
Worker #21861 ended with status: 0
Worker #21860 ended with status: 0
Worker #21859 ended with status: 0
Worker #21858 ended with status: 0
Worker #21857 ended with status: 0
Worker #21856 ended with status: 0
Worker #21855 ended with status: 0
Worker #21854 ended with status: 0
Worker #21853 ended with status: 0
Worker #21852 ended with status: 0
Worker #21851 ended with status: 0
Worker #21850 ended with status: 0
Worker #21849 ended with status: 0
Worker #21848 ended with status: 0
Worker #21847 ended with status: 0
Worker #21846 ended with status: 0
Fri Jun 13 04:36:03 2008
Elapsed (ms): 1950971, total (MB): 43178
Scanned 22 MB/s

real    34:06.9
user    10:32.7
sys        11.1

real    34m6.930s
user    609m12.577s
sys     12m33.298s

Total elapsed time was a bit over 34 minutes to process the full 45GB log file.

The emphasis here is on a more general-purpose and re-useable solution, rather than something over-specialized to this one example problem, as described in the first round. Implementation in brief:

  1. The feed-workers process reads data from disk (at the fastest possible rate), and writes the data to a specified number of child processes.
  2. The reduce process parses the log file and computes subtotals for the fields of interest. This is the heaviest processing, and is best suited for distributing across many CPUs.
  3. The combine process reads the reduce subtotals and computes the final totals for each value of interest.

If you wanted to adapt this “wide-finder” to another purpose, you need only look at reduce and combine. Approximate line counts for each component:

<table width=auto>

linescomponentlanguage 349feed-workersC++ 24reducePerl 36combinePerl </table> Final result (that might even be correct): Top 10 URIs by total response bytes 919814823566: /ongoing/ongoing.atom 393012328499: /ongoing/potd.png 297110748615: /ongoing/ongoing.rss 95967470509: /ongoing/rsslogo.jpg 70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg 46373582976: /talks/php.de.pdf 43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov 42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov 38415215289: /ongoing/ 35603054785: /ongoing/moss60.jpg Top 10 URIs returning 404 (Not Found) 54271: /ongoing/ongoing.atom.xml 28030: /ongoing/ongoing.pie 27365: /ongoing/favicon.ico 26084: /ongoing/Browser-Market-Share.png 24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN 24078: /ongoing/Browsers-via-search.png 24004: /ongoing/Search-Engines.png 22637: /ongoing/ongoing.atom' 22619: //ongoing/ongoing.atom' 20587: /ongoing/Feeds.png Top 10 URIs by hits on articles 614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png 561720: /ongoing/When/200x/2003/07/17/noIE.gif 321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png 252828: /ongoing/When/200x/2004/02/18/Bump.png 242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png 241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png 219569: /ongoing/When/200x/2003/09/18/NXML 204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg 168652: /ongoing/When/200x/2003/03/16/XML-Prog 137457: /ongoing/When/200x/2006/03/30/IMG_4613.png Top 10 client IPs by hits on articles 366634: msnbot.msn.com 192147: cmbg-cache-2.server.ntli.net 161867: crawler14.googlebot.com 145264: crawl-66-249-72-173.googlebot.com 132805: crawl-66-249-72-172.googlebot.com 131051: cmbg-cache-1.server.ntli.net 100298: crawl-66-249-72-72.googlebot.com 95580: wfp2.almaden.ibm.com 90831: sv-crawlfw3.looksmart.com 84546: crawler10.googlebot.com Top 10 referrers by hits on articles 993394: http://www.google.com/reader/view/ 243013: http://planet.xmlhack.com/ 195861: http://tbray.org/ongoing/ 194726: http://planetsun.org/ 181280: http://planetjava.org/ 158613: http://slashdot.org/ 117228: http://www.chat.kg/ 112469: http://planet.intertwingly.net/ 89177: http://www.planetjava.org/ 55593: http://www.bloglines.com/myblogs_display?all=1