For Tim Bray's Wide Finder, what you need is something that will lift data off the disk at the maximum possible rate, and distribute the processing across as many CPUs as needed. The end-to-end result cannot be generated faster than the data can be read from disk. The first part of the problem is to get the data off disk as fast as possible.
My notion was to find the smallest/simplest bit of code that would allow any reasonable scripting language (Perl, AWK, Python, Ruby, whatever) to hit optimal elapsed times on a many-CPU box (my take on Tim's notion). Ripped out most of the unneeded bits from the previous test program (wide-finder-0), wrote a more efficient buffer rather than line-oriented reader, added in getopt() (so the script run is no longer fixed), and created a simple feed-workers utility.
Usage: feed-workers [ options ] files... Where options are: -n N number of worker processes -r run name of executable to run as a worker (default is /usr/bin/perl) -s script name of script (passed as argument to worker) The lines read from given list of files are distributed equally among the workers.
The feed-workers program is the one bit of help you need to efficiently read from the disk and distribute processing across CPUs running scripts written in your favorite higher-order language. The full "wide finder" looks like:
feed-workers -n 4 -s scripts/reduce.pl logs/*| scripts/combine.pl
The results of a test run on a dual-CPU box:
preston@brutus:~/feed-workers$ time ./feed-workers -n 4 -s scripts/reduce.pl logs/* | scripts/combine.pl > _counts Thu Nov 29 17:22:09 2007 Scanning: logs/log-00 Scanning: logs/log-01 Scanning: logs/log-02 Scanning: logs/log-03 Scanning: logs/log-04 Scanning: logs/log-05 Scanning: logs/log-06 Scanning: logs/log-07 Scanning: logs/log-08 Scanning: logs/log-09 Scanning: logs/log-10 Scanning: logs/log-11 Scanning: logs/log-12 Scanning: logs/log-13 Scanning: logs/log-14 Scanning: logs/log-15 Scanning: logs/log-16 Scanning: logs/log-17 Scanning: logs/log-18 Scanning: logs/log-19 Worker #29298 ended with status: 0 Worker #29297 ended with status: 0 Worker #29296 ended with status: 0 Worker #29295 ended with status: 0 Thu Nov 29 17:23:36 2007 Elapsed (ms): 86565, total (MB): 4096 Scanned 47 MB/s real 1m26.738s user 0m48.311s sys 0m14.649s
This is run on a vanilla AMD Athlon 64 X2 4200 box with \~3GB usable memory . Since the test files total 4GB, the files cannot be cached.
preston@brutus:~/feed-workers$ uname -a Linux brutus 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64 GNU/Linux
The following measurements are to get some notion of overhead and performance limits.
Just running the same input through /bin/cat without feed-workers.
preston@brutus:~/feed-workers$ time cat logs/* > /dev/null real 1m23.065s user 0m0.096s sys 0m3.960s
Running just the Perl scripts:
preston@brutus:~/feed-workers$ time scripts/reduce.pl logs/* | scripts/combine.pl > __counts real 1m27.342s user 0m42.895s sys 0m4.288s
Feeding the same input to
/bin/cat instead of a Perl script.
preston@brutus:~/feed-workers$ time ./feed-workers -n 4 -r /bin/cat logs/* > /dev/null Thu Nov 29 17:26:22 2007 [snip] Thu Nov 29 17:27:50 2007 Elapsed (ms): 88373, total (MB): 4096 Scanned 46 MB/s real 1m28.698s user 0m0.440s sys 0m14.601s
Running the "wide finder" with only a single worker process.
preston@brutus:~/feed-workers$ time ./feed-workers -n 1 -s scripts/reduce.pl logs/* | scripts/combine.pl > _counts Fri Nov 30 08:14:59 2007 [snip] Fri Nov 30 08:16:22 2007 Elapsed (ms): 82939, total (MB): 4096 Scanned 49 MB/s real 1m23.009s user 0m43.723s sys 0m14.681s
Looked at together we get a consistent picture.
real u+s command
83 4 cat 87 47 reduce.pl 89 15 feed-workers + cat (x 4) 87 63 feed-workers + reduce.pl (x 4) 83 58 feed-workers + reduce.pl (x 1)
All times rounded to whole seconds. Given the limited tests, variance up to 10% may not be significant.
Since the elapsed times for single-process 'cat' and 'reduce.pl' are nearly identical, we can conclude that we are disk-limited, and throwing more CPUs at the problem will not help. Given the user+system CPU time is about half the elapsed time, as a benchmark, this example "wide finder" problem is a little too lightweight. To really show off a "wide finder", we want a processing task heavy enough that no one single CPU can process that data at disk rates. We probably want a task 3 or 4 times "heavier" in terms of CPU usage.
Looking at the 'feed-workers + cat' and 'feed-workers + reduce.pl' times, the elapsed times are about the same as the single process times. Looks like feed-workers disk reader is using about 13% of one CPU, so we could likely handle eight times the I/O rate before a single process feed-workers became a bottleneck (without any further optimization). This is a good thing, as you would need a very high-end disk subsystem before you would have to think about multiple readers. Also the reader would only become a bottleneck if the worker processes could keep up with the increased rate. Taken together we can expect a single reader to be optimal in almost all interesting cases.
Looks like starting a Perl (via fork/exec) costs us (very roughly) about 4/3 of second. A bit fat, but not too significant out of the entire time.
Interesting that the 'feed-workers + reduce.pl (x 1)' time slightly beats the 'reduce.pl' time, and matches the 'cat' time. This is exactly the result you hope for on a dual-CPU box.
There might be some room for optimization in the feed-workers code, but looks like we are already into diminishing returns for time invested. The overhead added feeding multiple workers is small, and can keep up with much higher than usual I/O rates. A solution based on feed-workers - given enough CPUs - can process data as fast as it can be read off disk (in the general case), and perform efficiently for file sizes much, much larger than memory.
- The example "wide finder" problem is too lightweight to show advantage on a multi-core CPU.
- A single instance of the generic feed-workers reader should prove optimal for almost all cases.
- The initial reader implementation is efficient enough, as we get a small performance improvement even on a dual-CPU box.
- For one sort of problem, processing line-oriented records where order does not matter, this a sufficient solution for distributing across many CPUs without specialized code.