I must be getting cranky in my old age. The discussion around Tim Bray’s “Wide Finder” bugs me.
ongoing · The Wide Finder Project
In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It’s a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.
First, the example problem chosen by Tim Bray is crap, but the underlying question is good. The example can be made to run at near-full I/O speed, and this fact is distracting too many folks. The underlying question – how to spread a common problem over lots of CPUs – is a worthy topic.
Second, given the above context, optimizing the regex use is irrelevant (though fun for folks with a twisted sense of fun – of which I am one).
Third, there are an amazing number of folk who still seem to be clueless when it comes to optimizing disk I/O. Memory-mapped I/O is not magic. Operating Systems optimize sequential I/O – have seen writings from both Sun and Microsoft (and older), but have no references to offer. Very likely stdio is faster than C++ streams (some history there). Seeks are not free – so reading off more than one place on disk is usually bad for performance.
In sum we can expect a single process/thread using stdio/fopen/fgets could suck up all (or near all) the I/O throughput and deliver near-optimal results. More than one thread doing disk reads is non-optimal for most machines (like 90%+). Any approach that starts with more than one thread doing disk reads is probably a Really Bad Idea. (Think about the physics of a disk drive – it is not complicated.)
Still, a guess is not as good as a measurement. As a first cut, wrote a simple program to measure reading a large file using stdio (no other processing), and to measure the effect of applying a regex as in Tim’s example.
preston@mite:~/workspace/wide-finder-0$ time Debug/wide-finder-0 log-* Scanning: log-00 Scanning: log-01 Scanning: log-02 Scanning: log-03 Scanning: log-04 Scanning: log-05 Scanning: log-06 Scanning: log-07 Scanning: log-08 Scanning: log-09 Scanning: log-10 Scanning: log-11 Scanning: log-12 Scanning: log-13 Scanning: log-14 Scanning: log-15 Scanning: log-16 Scanning: log-17 Scanning: log-18 Scanning: log-19 Elapsed (ms): 26120 total (MB): 4077 lines: 20000000 Scanned ?? MB/s Scanning: log-00 Scanning: log-01 Scanning: log-02 Scanning: log-03 Scanning: log-04 Scanning: log-05 Scanning: log-06 Scanning: log-07 Scanning: log-08 Scanning: log-09 Scanning: log-10 Scanning: log-11 Scanning: log-12 Scanning: log-13 Scanning: log-14 Scanning: log-15 Scanning: log-16 Scanning: log-17 Scanning: log-18 Scanning: log-19 Elapsed (ms): 258260 total (MB): 4077 lines: 20000000 matched: 14571780 Scanned 15 MB/s Final elapsed (ms): 284380 real 6m20.018s user 4m18.864s sys 0m25.522s preston@mite:~/workspace/wide-finder-0$
Sources are at: http://svn.bannister.us/public/wide-finder-0
Using stdio with a simple fgets() loop is able to suck in over ??MB/s on my not-cutting-edge hardware. Looks I/O not CPU bound. The input files total bigger than memory, so this is real not cached disk reads. My guess is that a single CPU on a Niagara box will do at least as well (or my faith in the Sun folk is misplaced).
What this tells us is – as expected – a single CPU doing disk reads is right for this problem.
The second part adds applying a regex as in Tim’s example. Throughput drops to 15MB/s and process is (very!) CPU bound. For the purpose of Tim’s exercise, this is excellent! The next logical step would be to find a way to distribute the CPU-hungry regex processing across multiple CPUs.
Just to be clear … for the purpose of Tim’s topic, none of the following is relevant.
- Comparisons between the performance of my box and another box.
- Optimizing the file reader. (It is already I/O bound.)
- Optimizing the regex processing. (Do I have to explain this?)
What is relevant is distributing the regex processing across multiple CPUs. My next iteration would be for the reader to place lines into some sort of in-memory queue (or simple pipes), with a scalable number of threads (~1 per CPU or as needed for the task) reading from the queue. For this problem, outputs from each regex-thread by a single(?) post-processing thread (merging the counted request strings). This is where the problem gets interesting. It is easy to come up with very specialized solutions. What would be more generally useful is a pipeline something like:
(one instance of reader ) -> (N instances of regex/counter) -> (one instance of merge process)
The reader should probably be C++ code (so we can get full I/O throughput), but I’m tempted – just to prove Tim’s point – to write the intermediate and backend processes in AWK.
Lacking access to a T2 (or the like), I can write the code, but not do any really interesting experiments. Oh well.