Preston L. Bannister { random memes }

2008.06.12

Wider finder – final result

Filed under: General — Preston @ 9:21 pm

[wrapup - added later]

From running my implementation of Tim Bray’s Wide Finder 2 on the Sun test box.

bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
    time scripts/combine.pl > _x30_reduce_combine
Fri Jun 13 04:03:32 2008
Scanning: logs/O.all
Done with: logs/O.all
Worker #21875 ended with status: 0
Worker #21874 ended with status: 0
Worker #21873 ended with status: 0
Worker #21872 ended with status: 0
Worker #21871 ended with status: 0
Worker #21870 ended with status: 0
Worker #21869 ended with status: 0
Worker #21868 ended with status: 0
Worker #21867 ended with status: 0
Worker #21866 ended with status: 0
Worker #21865 ended with status: 0
Worker #21864 ended with status: 0
Worker #21863 ended with status: 0
Worker #21862 ended with status: 0
Worker #21861 ended with status: 0
Worker #21860 ended with status: 0
Worker #21859 ended with status: 0
Worker #21858 ended with status: 0
Worker #21857 ended with status: 0
Worker #21856 ended with status: 0
Worker #21855 ended with status: 0
Worker #21854 ended with status: 0
Worker #21853 ended with status: 0
Worker #21852 ended with status: 0
Worker #21851 ended with status: 0
Worker #21850 ended with status: 0
Worker #21849 ended with status: 0
Worker #21848 ended with status: 0
Worker #21847 ended with status: 0
Worker #21846 ended with status: 0
Fri Jun 13 04:36:03 2008
Elapsed (ms): 1950971, total (MB): 43178
Scanned 22 MB/s

real    34:06.9
user    10:32.7
sys        11.1

real    34m6.930s
user    609m12.577s
sys     12m33.298s

Total elapsed time was a bit over 34 minutes to process the full 45GB log file.

The emphasis here is on a more general-purpose and re-useable solution, rather than something over-specialized to this one example problem, as described in the first round. Implementation in brief:

  1. The feed-workers process reads data from disk (at the fastest possible rate),
    and writes the data to a specified number of child processes.
  2. The reduce process parses the log file and computes subtotals for the fields of interest.
    This is the heaviest processing, and is best suited for distributing across many CPUs.
  3. The combine process reads the reduce subtotals
    and computes the final totals for each value of interest.

If you wanted to adapt this “wide-finder” to another purpose, you need only look at reduce and combine. Approximate line counts for each component:

lines component language
349 feed-workers C++
24 reduce Perl
36 combine Perl

Final result (that might even be correct):

Top 10 URIs by total response bytes
        919814823566: /ongoing/ongoing.atom
        393012328499: /ongoing/potd.png
        297110748615: /ongoing/ongoing.rss
        95967470509: /ongoing/rsslogo.jpg
        70619295535: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        46373582976: /talks/php.de.pdf
        43559176904: /ongoing/When/200x/2006/05/16/J1d0.mov
        42428609673: /ongoing/When/200x/2007/12/14/Shonen-Knife.mov
        38415215289: /ongoing/
        35603054785: /ongoing/moss60.jpg

Top 10 URIs returning 404 (Not Found)
        54271: /ongoing/ongoing.atom.xml
        28030: /ongoing/ongoing.pie
        27365: /ongoing/favicon.ico
        26084: /ongoing/Browser-Market-Share.png
        24631: /ongoing/When/200x/2004/04/27/-//W3C//DTD%20XHTML%201.1//EN
        24078: /ongoing/Browsers-via-search.png
        24004: /ongoing/Search-Engines.png
        22637: /ongoing/ongoing.atom'
        22619: //ongoing/ongoing.atom'
        20587: /ongoing/Feeds.png

Top 10 URIs by hits on articles
        614255: /ongoing/When/200x/2005/05/01/Hammer_sickle_clean.png
        561720: /ongoing/When/200x/2003/07/17/noIE.gif
        321873: /ongoing/When/200x/2004/12/12/-tn/Browser-Market-Share.png
        252828: /ongoing/When/200x/2004/02/18/Bump.png
        242520: /ongoing/When/200x/2004/12/12/-tn/Browsers-via-search.png
        241340: /ongoing/When/200x/2004/12/12/-tn/Search-Engines.png
        219569: /ongoing/When/200x/2003/09/18/NXML
        204202: /ongoing/When/200x/2004/08/30/-big/IMGP0851.jpg
        168652: /ongoing/When/200x/2003/03/16/XML-Prog
        137457: /ongoing/When/200x/2006/03/30/IMG_4613.png

Top 10 client IPs by hits on articles
        366634: msnbot.msn.com
        192147: cmbg-cache-2.server.ntli.net
        161867: crawler14.googlebot.com
        145264: crawl-66-249-72-173.googlebot.com
        132805: crawl-66-249-72-172.googlebot.com
        131051: cmbg-cache-1.server.ntli.net
        100298: crawl-66-249-72-72.googlebot.com
        95580: wfp2.almaden.ibm.com
        90831: sv-crawlfw3.looksmart.com
        84546: crawler10.googlebot.com

Top 10 referrers by hits on articles
        993394: http://www.google.com/reader/view/
        243013: http://planet.xmlhack.com/
        195861: http://tbray.org/ongoing/
        194726: http://planetsun.org/
        181280: http://planetjava.org/
        158613: http://slashdot.org/
        117228: http://www.chat.kg/
        112469: http://planet.intertwingly.net/
        89177: http://www.planetjava.org/
        55593: http://www.bloglines.com/myblogs_display?all=1