Wider finder – final result
From running my implementation of Tim Bray’s Wide Finder 2 on the Sun test box.
bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
time scripts/combine.pl > _x30_reduce_combine
Fri Jun 13 04:03:32 2008
Scanning: logs/O.all
Done with: logs/O.all
Worker #21875 ended with status: 0
Worker #21874 ended with status: 0
Worker #21873 ended with status: 0
Worker #21872 ended with status: 0
Worker #21871 ended with status: 0
Worker #21870 ended with status: 0
Worker #21869 ended with status: 0
Worker #21868 ended with status: 0
Worker #21867 ended with status: 0
Worker #21866 ended with status: 0
Worker #21865 ended with status: 0
Worker #21864 ended with status: 0
Worker #21863 ended with status: 0
Worker #21862 ended with status: 0
Worker #21861 ended with status: 0
Worker #21860 ended with status: 0
Worker #21859 ended with status: 0
Worker #21858 ended with status: 0
Worker #21857 ended with status: 0
Worker #21856 ended with status: 0
Worker #21855 ended with status: 0
Worker #21854 ended with status: 0
Worker #21853 ended with status: 0
Worker #21852 ended with status: 0
Worker #21851 ended with status: 0
Worker #21850 ended with status: 0
Worker #21849 ended with status: 0
Worker #21848 ended with status: 0
Worker #21847 ended with status: 0
Worker #21846 ended with status: 0
Fri Jun 13 04:36:03 2008
Elapsed (ms): 1950971, total (MB): 43178
Scanned 22 MB/s
real 34:06.9
user 10:32.7
sys 11.1
real 34m6.930s
user 609m12.577s
sys 12m33.298s
Total elapsed time was a bit over 34 minutes to process the full 45GB log file.
The emphasis here is on a more general-purpose and re-useable solution, rather than something over-specialized to this one example problem, as described in the first round. Implementation in brief:
- The feed-workers process reads data from disk (at the fastest possible rate),
and writes the data to a specified number of child processes. - The reduce process parses the log file and computes subtotals for the fields of interest.
This is the heaviest processing, and is best suited for distributing across many CPUs. - The combine process reads the reduce subtotals
and computes the final totals for each value of interest.
If you wanted to adapt this “wide-finder” to another purpose, you need only look at reduce and combine. Approximate line counts for each component:
| lines | component | language |
|---|---|---|
| 349 | feed-workers | C++ |
| 24 | reduce | Perl |
| 36 | combine | Perl |
Final result (that might even be correct):
Top 10 URIs by total response bytes
919814823566: /ongoing/ongoing.atom
393012328499: /ongoing/potd.png
297110748615: /ongoing/ongoing.rss
95967470509: /ongoing/rsslogo.jpg
70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
46373582976: /talks/php.de.pdf
43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
38415215289: /ongoing/
35603054785: /ongoing/moss60.jpg
Top 10 URIs returning 404 (Not Found)
54271: /ongoing/ongoing.atom.xml
28030: /ongoing/ongoing.pie
27365: /ongoing/favicon.ico
26084: /ongoing/Browser-Market-Share.png
24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
24078: /ongoing/Browsers-via-search.png
24004: /ongoing/Search-Engines.png
22637: /ongoing/ongoing.atom'
22619: //ongoing/ongoing.atom'
20587: /ongoing/Feeds.png
Top 10 URIs by hits on articles
614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
561720: /ongoing/When/200x/2003/07/17/noIE.gif
321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
252828: /ongoing/When/200x/2004/02/18/Bump.png
242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
219569: /ongoing/When/200x/2003/09/18/NXML
204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
168652: /ongoing/When/200x/2003/03/16/XML-Prog
137457: /ongoing/When/200x/2006/03/30/IMG_4613.png
Top 10 client IPs by hits on articles
366634: msnbot.msn.com
192147: cmbg-cache-2.server.ntli.net
161867: crawler14.googlebot.com
145264: crawl-66-249-72-173.googlebot.com
132805: crawl-66-249-72-172.googlebot.com
131051: cmbg-cache-1.server.ntli.net
100298: crawl-66-249-72-72.googlebot.com
95580: wfp2.almaden.ibm.com
90831: sv-crawlfw3.looksmart.com
84546: crawler10.googlebot.com
Top 10 referrers by hits on articles
993394: http://www.google.com/reader/view/
243013: http://planet.xmlhack.com/
195861: http://tbray.org/ongoing/
194726: http://planetsun.org/
181280: http://planetjava.org/
158613: http://slashdot.org/
117228: http://www.chat.kg/
112469: http://planet.intertwingly.net/
89177: http://www.planetjava.org/
55593: http://www.bloglines.com/myblogs_display?all=1