(Pulling prior examples forward)

How can you spread out processing of very large logs (or similar sequential texts) over all your CPUs?

That was the question Tim Bray asked in his Wide Finder exercise. This was my answer.

Take this as a reflection of my mind. Rather than write a complex solution exactly specific to the problem at hand, I looked for the simplest most flexible solution.

Pushed the code into Github.

I put the code in public in hope it might be used by others. From the emails received in the years since, at least a few folk have found use. This code also served as a base for an optimal CSV file reader (link:TBD).

###Test on current hardware

Running on current hardware (MacbookPro w/ quad-CPU, 16GB memory) using as input the logs from my webhost:

$ wc -lc logs/_access.log
  125948 25645720 logs/_access.log

The input file is small enough for the OS to cache in memory, so this pure single CPU processing:

$ time wide-finder/scripts/reduce.pl < logs/_access.log | wide-finder/scripts/combine.pl > _out
real    0m0.856s
user    0m0.859s
sys     0m0.021s

Using the feed-workers with a single CPU should be slightly slower than the above. The increase in time is the (small) overhead. Note the CPU-bound Perl scripts are the limit.

$ time wide-finder/Release/feed-workers -n 1 -r `which perl` -s wide-finder/scripts/reduce.pl logs/_access.log | wide-finder/scripts/combine.pl > _2
TIME Thu Jan  1 12:59:15 2015
Scanning: logs/_access.log
Done with: logs/_access.log
Worker #88301 ended with status: 0
TIME Thu Jan  1 12:59:16 2015
Elapsed (ms): 888, total (MB): 24
Scanned 27 MB/s

real    0m0.909s
user    0m0.912s
sys     0m0.043s

Since the reduce processing is CPU-bound, better to spread over the available CPUs.

$ time wide-finder/Release/feed-workers -n 4 -r `which perl` -s wide-finder/scripts/reduce.pl logs/_access.log | wide-finder/scripts/combine.pl > _2
TIME Thu Jan  1 12:59:22 2015
Scanning: logs/_access.log
Done with: logs/_access.log
Worker #88308 ended with status: 0
Worker #88307 ended with status: 0
Worker #88306 ended with status: 0
Worker #88305 ended with status: 0
TIME Thu Jan  1 12:59:23 2015
Elapsed (ms): 352, total (MB): 24
Scanned 69 MB/s

real    0m0.373s
user    0m1.193s
sys     0m0.052s

Note the user time has increased (Perl is not cheap), and the real time is less than half.

###Test on large input

The prior numbers were pure CPU-bound, without any time reading from disk. With a much larger input file (50GB) the contents cannot be cached. This is also indicative of real-use processing (where the input is not cached).

$ time wide-finder/Release/feed-workers -n 4 -r `which perl` -s wide-finder/scripts/reduce.pl logs/O.50gb | wide-finder/scripts/combine.pl > _2
TIME Thu Jan  1 13:10:47 2015
Scanning: logs/O.50gb
Done with: logs/O.50gb
Worker #88348 ended with status: 0
Worker #88347 ended with status: 0
Worker #88346 ended with status: 0
Worker #88345 ended with status: 0
TIME Thu Jan  1 13:18:44 2015
Elapsed (ms): 477083, total (MB): 50098
Scanned 105 MB/s

real    7m57.111s
user    30m59.569s
sys     0m59.584s

The result is near 4x faster, matching the number of actual CPUs. (I did try larger and smaller numbers, with lesser performance. Intel’s “hyperthreading” does not impact this usage.)

For reference, scanning w/o processing:

$ time wide-finder/Release/feed-workers -n 4 -r `which cat` logs/O.50gb > /dev/null
TIME Thu Jan  1 13:53:38 2015
Scanning: logs/O.50gb
Done with: logs/O.50gb
Worker #88454 ended with status: 0
Worker #88453 ended with status: 0
Worker #88452 ended with status: 0
Worker #88451 ended with status: 0
TIME Thu Jan  1 13:54:44 2015
Elapsed (ms): 65583, total (MB): 50098
Scanned 763 MB/s

real    1m5.591s
user    0m3.190s
sys     0m57.351s

The MacbookPro has a fast flash drive. This shows that feed-workers is efficient and I/O bound even on very fast storage.