Wide finder - first round
Took my feed-workers tool from my example of solving the prior wide-finder exercise, and updated the scripts to match(?) the new strawman wide-finder benchmark. Updated the scripts over the weekend (yes, I'm sufficiently warped to find this sort of exercise amusing). Got onto the Sun 32-CPU box on Tuesday, and did some full scale runs on the 45GB data. Caught a couple minor bugs on my part, and got results mostly in line with my expectations, with one major exception.
Note that my aim here is not to get the fastest possible time, or to illustrate the use of some new exotic programming language or technique. Rather I am interested in using the simplest possible general purpose tool (feed-workers) in combination with commonly available Unix tools (Perl in this instance) to solve the problem in the least programmer time while taking advantage of a large number of available CPUs (as in the Sun Fire T2000 box). Note that the reduce process fully parses each line of the log - no shortcuts assuming log lines are space-delimited - which means the script can easily be re-used for other log-scanning tasks (and should be dead easy to read and modify).
Counting the (approximate) lines of feed-workers code:
$ cat *.cpp *.h | tr -cd ';}' | wc -c
Counting the lines of one-pass report Perl code:
$ tr -cd ';' < scripts/report.pl | wc -c
Counting the lines of reduce Perl code:
$ tr -cd ';' < scripts/reduce.pl | wc -c
Counting the lines of combine Perl code:
$ tr -cd ';' < scripts/combine.pl | wc -c
Once written, you never have to look at the sources for feed-workers again. When changing the log scanner you need only read reduce and combine, each of which are amounts only to a rather sparse page of code - thus easy to modify and easy to re-use.
In the first set of measurements I was interested in CPU usage, not I/O speed, so I made sure the input files were cached in memory before collecting times. I also did more than one run for each test, just to verify that times were consistent (as they were).
First run - single process Perl script processing a small (100K line) log file.
bannister@wfind01$ time scripts/report.pl logs/O.100k > _report
Next, split the Perl script into two scripts. The first reduces the data to a smaller form, and the second generates a report from the reduced form. We are using two CPUs here, but we get slightly slower performance - and this makes sense. The reduce script only generates output after completely processing the input, so the two processes are essentially sequential. The generation and consumption of the intermediate (reduced) form should cost us a little, and it does.
First we time the reduce script:
bannister@wfind01$ time scripts/reduce.pl logs/O.100k | scripts/combine.pl > _reduce_combine
Next we time the combine script:
bannister@wfind01$ scripts/reduce.pl logs/O.100k | time scripts/combine.pl > _reduce_combine
As expected, the majority of the processing is in the reduce script, and the time needed for the combine script is small.
Note that splitting the original script into two scripts requires the generation and consumption of an intermediate form, but both scripts are still quite short. If you have spent a considerable amount of time analyzing large log files (as have I) then this is a common technique used, not for performance, but to break a complex transformation into more manageable chunks.
Put simply, if you do this sort of work, the approach used here should be familiar.
Next, use the same two scripts as the prior run, but invoke the first via feed-workers. There are three active processes now, but we expect and get pretty much the same times as the prior test.
bannister@wfind01$ time ./feed-workers -n 1 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x1_reduce_combine
Tue Jun 3 01:28:50 2008
Up to this point we really have not used more than one CPU. The prior measurements were to give us a baseline, and to show the processing load was in line with expectations. Now we want to show the effect of spreading out the heavy reduce processing over a greater number of CPUs.
Times for reduce processing on 4 worker processes:
bannister@wfind01$ time ./feed-workers -n 4 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x4_reduce_combine
Tue Jun 3 01:30:16 2008
Times for reduce processing on 10 worker processes:
bannister@wfind01$ time ./feed-workers -n 10 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x10_reduce_combine
Tue Jun 3 01:31:08 2008
Times for reduce processing on 20 worker processes:
bannister@wfind01$ time ./feed-workers -n 20 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x20_reduce_combine
Tue Jun 3 01:31:46 2008
Times for reduce processing on 30 worker processes:
bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x30_reduce_combine
Tue Jun 3 01:32:16 2008
Times for reduce processing on 40 worker processes:
bannister@wfind01$ time ./feed-workers -n 40 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x40_reduce_combine
Tue Jun 3 01:32:47 2008
Times for reduce processing on 50 worker processes:
bannister@wfind01$ time ./feed-workers -n 50 -r `which perl` -s scripts/reduce.pl logs/O.100k |
scripts/combine.pl > _feed_x50_reduce_combine
Tue Jun 3 01:33:27 2008
To be honest I was not exactly sure what sort of result we would get for the above on a Sun Fire T2000. Sun claims the single-chip UltraSPARC T1 processor contains 8 cores and supports 32 threads. That might mean 32 active threads in a multi-threaded application like Java, but what does it mean for a multi-process application? Guess we have an answer. :)
The above is pretty much exactly what you might expect. We get large improvements in elapsed time spreading the processing over 4 and 10 CPUs, and smaller improvements up to 30 CPUs. As the number of CPUs used increases, the amount of "wasted" processing also increases (as reflected in the "user" times). Elapsed times are slightly worse for 40 and 50 reduce processes - past the number of "threads" supported by the 32-way UltraSPARC T1.
Upshot: scaling the number of CPUs used improves elapsed time up to the number of physical CPUs, netting about a 3x overall speedup (for this problem) over the original single-process version. (Remember to add in the ~1.3 seconds needed for the combine processing, which is largely sequential to the reduce processing.)
Note also that for larger input data we might expect the overhead for many-processes to matter less. For the above 100K line input, we are very much into diminishing returns above 10-way replication on reduce.
The above is all about distributing the same problem over a varying number of processors. The next set of tests is about increasing the amount of data processed.
Bumping the amount of data processed from 100 thousand to 1 million lines, we get:
bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.1m |
scripts/combine.pl > _x30_1m
Tue Jun 3 01:35:10 2008
Elapsed time for reduce processing with 10x data takes only 7x as long. This makes sense as the overhead for the startup of a larger number of processes gets amortized over a longer run for each sub-process. Looks good thus far, so tried the full data set...
bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
scripts/combine.pl > _reduce_all
Tue Jun 3 21:34:20 2008
... manually aborted ...
I had to manually abort this test run. The multi-process reduce processing was complete, but the combine processing was running ... and running ... and running ....... so apparently something in combine was not scaling well to large data sets. Something to address in the next iteration.
Measured just the reduce processing on the full data set:
$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all > _reduce_all
Tue Jun 3 21:34:20 2008
Tue Jun 3 22:02:00 2008
Elapsed (ms): 1660139, total (MB): 43178
Scanned 26 MB/s
Note that the input data is no longer cached in memory for the full data set, so the processing rate now includes the time needed for reading the data off physical disk. Clearly the reduce processing is nicely distributable across multiple processes (and CPUs).
Object of the next iteration will be to find where combine processing is expensive.