Wide Finder 2 - over to the dark side

2008-06-18

My first solution for Tim Bray's Wide Finder 2 problem was aimed at ease of re-use before performance. Simple Perl scripts (run in a large number of worker processes) did the bulk of the processing.

Got my times, posted the results, and was going to leave it at that.

The damn problem got stuck in my head. Clearly a multi-threaded shared-memory solution would be much more efficient. How would a more specialized solution perform? Found I was thinking about the specialized solution, how to minimize synchronization issues, and was curious how it would perform.

My preference would be to use Java for a multi-threaded shared-memory solution. I have not had cause to use Java for anything similar, so did not know exactly what to expect. Turned out I was in for some surprises.

The first surprise was file I/O. Open a FileInputStream, loop on read() passing a 1MB buffer each call ... and I only get ~32MB/s (from cache) on the Sun test box. Not good. The native code feed-workers reader manages ~124MB/s (from cache) and ~85MB/s (from disk) while piping through cat. Java should manage about the same, but does not - am I missing something?

The second surprise was byte to character transformation. Seems you have to create a lot of garbage to do a simple UTF-8 to Unicode conversion. The Java garbage collector may be pretty good, but better not to create huge masses of garbage in the first place. Maybe I missed an API somewhere? Wrote a simple buffer-to-buffer byte to character converter, to stay away from large GC's. This sort of byte-bashing would be far better in native code.

The third surprise was java.util.regex performance (or lack). Add in regex matching, and suddenly we are down to ~1MB/s processing. Not just a problem with performance - on one long input line regex processing overflowed the Java stack! Ouch.

Not really a surprise - the standard Java general purpose hash tables are not well suited for the required usage: adds and update, but no deletes. Wrote a specialized hash table for which synchronization is only needed on leaf nodes when doing an add (very little chance of one thread blocking another). Still need to properly benchmark this bit.

The poor performance of Java regular expression processing was a disappointment. (I think we have Henry Spencer to blame ... indirectly.) Found the Java Regular expression library benchmarks interesting. Sorting out the java.util.regex or better entries:

Regular expression library Time taken for 10,000 iterations

java.util.regex.Pattern 1.4 609ms kmy.regex.util.Regex 0.1.2 468ms com.karneim.util.collection.regex.Pattern1.1.1 297ms dk.brics.automaton.RegExp1.7-1 172ms

The results are a few years old, so timings may have changed. Also the log parsing regex may benchmark differently. Any sort of definite statement would require a benchmark.

The two fastest listed regular expression libraries: dk.brics.automaton and Jrexx (com.karneim.util.collection.regex), do matching only, no group extractions - and so are not useful for my purpose. For the all the faster entries, documentation is sparse.

There is recent traffic among the JRuby folk about regex performance, the upshot of which is the recent integration of Joni into JRuby. Without a proper benchmark, there is no way of knowing how Joni compares, and the code is pretty much undocumented.

Regex processing can be spread over many concurrent threads, and is thus less of a bottleneck. The worry is the poor single-thread file read performance through FileInputStream. Hunted around the Internet looking to see if anyone got better results. Pretty dismaying how much garbage information is out there. Lots of programmers who think single-char read() calls are a good idea (or small-buffer reads). Did not find any substantially better results.

I expect the overhead for calling from Java to be a higher, but not enough to matter when reading a megabyte in a single call. Somewhat disappointed that no one has worked through the simple exercise of allowing Java to process bulk data efficiently. Tempted to dig into the Java sources - there must be something funky in the FileInputStream.readBytes() native method.

From the test run with 32 thread threads scanning the log and updating the hash table:

Time for - grok (32 workers) File: /export/home/bannister/wfj/logs/O.10m Size: 2029140422 Read: 2029140422 Time: 57216 Rate: 33 MB/s