Wide finder - final result
From running my implementation of Tim Bray's Wide Finder 2 on the Sun test box.
bannister@wfind01$ time ./feed-workers -n 30 -r which perl
-s scripts/reduce.pl logs/O.all |
time scripts/combine.pl > _x30_reduce_combine
Fri Jun 13 04:03:32 2008
Scanning: logs/O.all
Done with: logs/O.all
Worker #21875 ended with status: 0
Worker #21874 ended with status: 0
Worker #21873 ended with status: 0
Worker #21872 ended with status: 0
Worker #21871 ended with status: 0
Worker #21870 ended with status: 0
Worker #21869 ended with status: 0
Worker #21868 ended with status: 0
Worker #21867 ended with status: 0
Worker #21866 ended with status: 0
Worker #21865 ended with status: 0
Worker #21864 ended with status: 0
Worker #21863 ended with status: 0
Worker #21862 ended with status: 0
Worker #21861 ended with status: 0
Worker #21860 ended with status: 0
Worker #21859 ended with status: 0
Worker #21858 ended with status: 0
Worker #21857 ended with status: 0
Worker #21856 ended with status: 0
Worker #21855 ended with status: 0
Worker #21854 ended with status: 0
Worker #21853 ended with status: 0
Worker #21852 ended with status: 0
Worker #21851 ended with status: 0
Worker #21850 ended with status: 0
Worker #21849 ended with status: 0
Worker #21848 ended with status: 0
Worker #21847 ended with status: 0
Worker #21846 ended with status: 0
Fri Jun 13 04:36:03 2008
Elapsed (ms): 1950971, total (MB): 43178
Scanned 22 MB/s
real 34:06.9 user 10:32.7 sys 11.1
real 34m6.930s user 609m12.577s sys 12m33.298s
Total elapsed time was a bit over 34 minutes to process the full 45GB log file.
The emphasis here is on a more general-purpose and re-useable solution, rather than something over-specialized to this one example problem, as described in the first round. Implementation in brief:
- The feed-workers process reads data from disk (at the fastest possible rate), and writes the data to a specified number of child processes.
- The reduce process parses the log file and computes subtotals for the fields of interest. This is the heaviest processing, and is best suited for distributing across many CPUs.
- The combine process reads the reduce subtotals and computes the final totals for each value of interest.
If you wanted to adapt this "wide-finder" to another purpose, you need only look at reduce and combine. Approximate line counts for each component:
lines | component | language |
---|---|---|
349 | feed-workers | C++ |
24 | reduce | Perl |
36 | combine | Perl |
Final result (that might even be correct):
Top 10 URIs by total response bytes 919814823566: /ongoing/ongoing.atom 393012328499: /ongoing/potd.png 297110748615: /ongoing/ongoing.rss 95967470509: /ongoing/rsslogo.jpg 70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg 46373582976: /talks/php.de.pdf 43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov 42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov 38415215289: /ongoing/ 35603054785: /ongoing/moss60.jpg
Top 10 URIs returning 404 (Not Found) 54271: /ongoing/ongoing.atom.xml 28030: /ongoing/ongoing.pie 27365: /ongoing/favicon.ico 26084: /ongoing/Browser-Market-Share.png 24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN 24078: /ongoing/Browsers-via-search.png 24004: /ongoing/Search-Engines.png 22637: /ongoing/ongoing.atom' 22619: //ongoing/ongoing.atom' 20587: /ongoing/Feeds.png
Top 10 URIs by hits on articles 614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png 561720: /ongoing/When/200x/2003/07/17/noIE.gif 321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png 252828: /ongoing/When/200x/2004/02/18/Bump.png 242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png 241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png 219569: /ongoing/When/200x/2003/09/18/NXML 204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg 168652: /ongoing/When/200x/2003/03/16/XML-Prog 137457: /ongoing/When/200x/2006/03/30/IMG_4613.png
Top 10 client IPs by hits on articles 366634: msnbot.msn.com 192147: cmbg-cache-2.server.ntli.net 161867: crawler14.googlebot.com 145264: crawl-66-249-72-173.googlebot.com 132805: crawl-66-249-72-172.googlebot.com 131051: cmbg-cache-1.server.ntli.net 100298: crawl-66-249-72-72.googlebot.com 95580: wfp2.almaden.ibm.com 90831: sv-crawlfw3.looksmart.com 84546: crawler10.googlebot.com
Top 10 referrers by hits on articles 993394: http://www.google.com/reader/view/ 243013: http://planet.xmlhack.com/ 195861: http://tbray.org/ongoing/ 194726: http://planetsun.org/ 181280: http://planetjava.org/ 158613: http://slashdot.org/ 117228: http://www.chat.kg/ 112469: http://planet.intertwingly.net/ 89177: http://www.planetjava.org/ 55593: http://www.bloglines.com/myblogs_display?all=1