Example - general purpose Trie in Java

2009-10-04

Run across mention of the Trie data structure in a slightly random discussion. Had no notion of how performance of a Trie compares with the usual hash table, so wrote a general-purpose Trie implementation in Java (sources in an Eclipse project) with a bias toward performance.

The results are not encouraging. From a test run:

=== words 14 ms - 98569 words loaded

=== hash 3837/second - hash map loaded {4007 ms, 15376764 operations = 260 ns/op} 7953/second - hash map re-loaded {4003 ms, 31837787 operations = 125 ns/op} 12958/second - access each item in hash map {4001 ms, 51847294 operations = 77 ns/op}

=== trie (wide) 343/second - loaded trie (wide) {4022 ms, 1379966 operations = 2914 ns/op} 2089/second - re-loaded trie (wide) {4010 ms, 8378365 operations = 478 ns/op} 2165/second - access each item in trie (wide) {4005 ms, 8674072 operations = 461 ns/op} 41533124 slots of trie (wide) 225890 nodes of trie (wide)

=== trie (tall) 2254/second - loaded trie (tall) {4022 ms, 9068348 operations = 443 ns/op} 2446/second - re-loaded trie (tall) {4029 ms, 9856900 operations = 408 ns/op} 2429/second - access each item in trie (tall) {4017 ms, 9758331 operations = 411 ns/op} 6982694 slots of trie (tall) 446075 nodes of trie (tall)

=== string to UTF8 conversion 4708/second - word to UTF8 (stock) {4019 ms, 18925248 operations = 212 ns/op} 10503/second - word to UTF8 (fast) {4007 ms, 42088963 operations = 95 ns/op}

The test data is a file of 98569 words found in Linux installation. The words are read from a file, and loaded into an array of strings. The words are loaded into a HashMap (a library class) for comparison, and loaded into both a "tall" and "wide" Trie (differing space/time trade-offs).

The notion used to generate a "fast" Trie is for each node in the Trie to index off a byte in the UTF8 encoding of key String. Along the way, found that the stock String to UTF8 conversion to be a bit slow, so wrote a faster version. Even with the faster encoding, measured Trie performance is not good.

(Bit of a shock to find the stock String-to-UTF8 conversion slow in Java, as this is used a lot - especially in web applications - and may be significant enough to effect benchmarks. Java has been around long enough, so that I expected better. In fact, I expected UTF8 to be in native code in the JVM. Byte-bashing code is better suited to C/C++.)

If strings were already in UTF8 (or ASCII) format - as in C/C++ code - a similar Trie implementation would perform somewhat better, but still use too much memory. Whether a Trie in C++ could more closely approach the performance of a hash table is a bit of an open question.

Not sure why the "wide" Trie (using one node to index one UTF8 byte) is slightly slower than the "tall" Trie (using two nodes to index 4-bits each from one UTF8 byte). Expected the "wide" version to be faster. Suspect the larger memory use is enough to bust the CPU cache, and on smaller data the "wide" variant may be faster. Also saw a lot of jitter in the times - presumably due to Java GC activity (garbage collection) - though the relative results were consistent.

Clearly this "fast, general-purpose" Trie implementation is not overly useful - at ~5 times slower than a hash table, and huge use of memory. Alternate forms can be more space-efficient, but are likely to be slower.

As a guess, a Trie is not going to find much use in my code. :)

[Note that I revisited this topic, later.]