Wide finder - first round

2008-06-08

Took my feed-workers tool from my example of solving the prior wide-finder exercise, and updated the scripts to match(?) the new strawman wide-finder benchmark. Updated the scripts over the weekend (yes, I'm sufficiently warped to find this sort of exercise amusing). Got onto the Sun 32-CPU box on Tuesday, and did some full scale runs on the 45GB data. Caught a couple minor bugs on my part, and got results mostly in line with my expectations, with one major exception.

Note that my aim here is not to get the fastest possible time, or to illustrate the use of some new exotic programming language or technique. Rather I am interested in using the simplest possible general purpose tool (feed-workers) in combination with commonly available Unix tools (Perl in this instance) to solve the problem in the least programmer time while taking advantage of a large number of available CPUs (as in the Sun Fire T2000 box). Note that the reduce process fully parses each line of the log - no shortcuts assuming log lines are space-delimited - which means the script can easily be re-used for other log-scanning tasks (and should be dead easy to read and modify).

Counting the (approximate) lines of feed-workers code:

    $ cat *.cpp *.h | tr -cd ';}' | wc -c
    349

Counting the lines of one-pass report Perl code:

    $ tr -cd ';' < scripts/report.pl | wc -c
    23

Counting the lines of reduce Perl code:

    $ tr -cd ';' < scripts/reduce.pl | wc -c
    23

Counting the lines of combine Perl code:

    $ tr -cd ';' < scripts/combine.pl | wc -c
    29

Once written, you never have to look at the sources for feed-workers again. When changing the log scanner you need only read reduce and combine, each of which are amounts only to a rather sparse page of code - thus easy to modify and easy to re-use.

In the first set of measurements I was interested in CPU usage, not I/O speed, so I made sure the input files were cached in memory before collecting times. I also did more than one run for each test, just to verify that times were consistent (as they were).

First run - single process Perl script processing a small (100K line) log file.

    bannister@wfind01$ time scripts/report.pl logs/O.100k > _report

    real    0m13.767s
    user    0m13.656s
    sys     0m0.108s

Next, split the Perl script into two scripts. The first reduces the data to a smaller form, and the second generates a report from the reduced form. We are using two CPUs here, but we get slightly slower performance - and this makes sense. The reduce script only generates output after completely processing the input, so the two processes are essentially sequential. The generation and consumption of the intermediate (reduced) form should cost us a little, and it does.

First we time the reduce script:

    bannister@wfind01$ time scripts/reduce.pl logs/O.100k | scripts/combine.pl > _reduce_combine

    real    0m14.584s
    user    0m14.552s
    sys     0m0.125s

Next we time the combine script:

    bannister@wfind01$ scripts/reduce.pl logs/O.100k | time scripts/combine.pl > _reduce_combine

    real       14.5
    user        1.3
    sys         0.0

As expected, the majority of the processing is in the reduce script, and the time needed for the combine script is small.

Note that splitting the original script into two scripts requires the generation and consumption of an intermediate form, but both scripts are still quite short. If you have spent a considerable amount of time analyzing large log files (as have I) then this is a common technique used, not for performance, but to break a complex transformation into more manageable chunks.

Put simply, if you do this sort of work, the approach used here should be familiar.

Next, use the same two scripts as the prior run, but invoke the first via feed-workers. There are three active processes now, but we expect and get pretty much the same times as the prior test.

    bannister@wfind01$ time ./feed-workers -n 1 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x1_reduce_combine
    Tue Jun  3 01:28:50 2008
    Scanning: logs/O.100k

    real    0m13.241s
    user    0m12.547s
    sys     0m0.402s

Up to this point we really have not used more than one CPU. The prior measurements were to give us a baseline, and to show the processing load was in line with expectations. Now we want to show the effect of spreading out the heavy reduce processing over a greater number of CPUs.

Times for reduce processing on 4 worker processes:

    bannister@wfind01$ time ./feed-workers -n 4 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x4_reduce_combine
    Tue Jun  3 01:30:16 2008
    Scanning: logs/O.100k

    real    0m5.046s
    user    0m13.063s
    sys     0m0.408s

Times for reduce processing on 10 worker processes:

    bannister@wfind01$ time ./feed-workers -n 10 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x10_reduce_combine
    Tue Jun  3 01:31:08 2008
    Scanning: logs/O.100k

    real    0m3.669s
    user    0m14.378s
    sys     0m0.535s

Times for reduce processing on 20 worker processes:

    bannister@wfind01$ time ./feed-workers -n 20 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x20_reduce_combine
    Tue Jun  3 01:31:46 2008
    Scanning: logs/O.100k

    real    0m3.322s
    user    0m17.300s
    sys     0m0.747s

Times for reduce processing on 30 worker processes:

    bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x30_reduce_combine
    Tue Jun  3 01:32:16 2008
    Scanning: logs/O.100k

    real    0m3.086s
    user    0m20.378s
    sys     0m0.997s

Times for reduce processing on 40 worker processes:

    bannister@wfind01$ time ./feed-workers -n 40 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x40_reduce_combine
    Tue Jun  3 01:32:47 2008
    Scanning: logs/O.100k

    real    0m3.090s
    user    0m21.265s
    sys     0m1.250s

Times for reduce processing on 50 worker processes:

    bannister@wfind01$ time ./feed-workers -n 50 -r `which perl` -s scripts/reduce.pl logs/O.100k |
        scripts/combine.pl > _feed_x50_reduce_combine
    Tue Jun  3 01:33:27 2008
    Scanning: logs/O.100k

    real    0m3.140s
    user    0m21.528s
    sys     0m1.459s

To be honest I was not exactly sure what sort of result we would get for the above on a Sun Fire T2000. Sun claims the single-chip UltraSPARC T1 processor contains 8 cores and supports 32 threads. That might mean 32 active threads in a multi-threaded application like Java, but what does it mean for a multi-process application? Guess we have an answer. :)

The above is pretty much exactly what you might expect. We get large improvements in elapsed time spreading the processing over 4 and 10 CPUs, and smaller improvements up to 30 CPUs. As the number of CPUs used increases, the amount of "wasted" processing also increases (as reflected in the "user" times). Elapsed times are slightly worse for 40 and 50 reduce processes - past the number of "threads" supported by the 32-way UltraSPARC T1.

Upshot: scaling the number of CPUs used improves elapsed time up to the number of physical CPUs, netting about a 3x overall speedup (for this problem) over the original single-process version. (Remember to add in the ~1.3 seconds needed for the combine processing, which is largely sequential to the reduce processing.)

Note also that for larger input data we might expect the overhead for many-processes to matter less. For the above 100K line input, we are very much into diminishing returns above 10-way replication on reduce.

The above is all about distributing the same problem over a varying number of processors. The next set of tests is about increasing the amount of data processed.

Bumping the amount of data processed from 100 thousand to 1 million lines, we get:

    bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.1m |
        scripts/combine.pl > _x30_1m
    Tue Jun  3 01:35:10 2008
    Scanning: logs/O.1m

    real    0m21.246s
    user    3m5.751s
    sys     0m4.180s

Elapsed time for reduce processing with 10x data takes only 7x as long. This makes sense as the overhead for the startup of a larger number of processes gets amortized over a longer run for each sub-process. Looks good thus far, so tried the full data set...

    bannister@wfind01$ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all |
        scripts/combine.pl > _reduce_all
    Tue Jun  3 21:34:20 2008
    Scanning: logs/O.all
    ... manually aborted ...

I had to manually abort this test run. The multi-process reduce processing was complete, but the combine processing was running ... and running ... and running ....... so apparently something in combine was not scaling well to large data sets. Something to address in the next iteration.

Measured just the reduce processing on the full data set:

    $ time ./feed-workers -n 30 -r `which perl` -s scripts/reduce.pl logs/O.all > _reduce_all
    Tue Jun  3 21:34:20 2008
    Scanning: logs/O.all
    Tue Jun  3 22:02:00 2008
    Elapsed (ms): 1660139, total (MB): 43178
    Scanned 26 MB/s

    real    27m40.180s
    user    597m45.533s
    sys     12m1.064s

Note that the input data is no longer cached in memory for the full data set, so the processing rate now includes the time needed for reading the data off physical disk. Clearly the reduce processing is nicely distributable across multiple processes (and CPUs).

Object of the next iteration will be to find where combine processing is expensive.