A short diversion into hash tables and searching (blame Tim Bray for mentioning an old article On the Goodness of Binary Search).

Most hash tables involve lookup using character strings. A long time ago I ran into a very simple hash function for character strings in an old CACM article that is both so simple I can re-create it from memory and efficient enough to hold up very well in testing.

//  Simple hash function derived from:
//  Communications of the ACM
//  Volume 33, Number 6, June, 1990
//  Peter K. Pearson
//  "Fast hashing of variable-length text strings"

static unsigned char T[] = {
    64, 67, 45, 157, 253, 237, 236, 112, 17, 69, 61, 182, 173, 44, 235, 153,
    102, 223, 251, 95, 166, 136, 160, 25, 60, 198, 146, 62, 87, 200, 71, 169,
    220, 5, 131, 133, 138, 99, 73, 48, 129, 96, 130, 139, 233, 246, 248, 41,
    150, 175, 98, 214, 74, 177, 66, 219, 105, 78, 65, 32, 94, 16, 240, 14, 68,
    174, 37, 81, 238, 107, 88, 135, 13, 180, 132, 28, 155, 222, 228, 70, 92,
    232, 163, 168, 103, 167, 190, 91, 206, 205, 24, 179, 161, 255, 83, 31, 208,
    12, 148, 189, 79, 106, 51, 137, 122, 159, 178, 224, 72, 191, 225, 11, 18,
    193, 8, 126, 84, 50, 114, 140, 247, 215, 9, 221, 171, 152, 196, 59, 242, 35,
    244, 56, 213, 164, 76, 209, 245, 226, 197, 75, 158, 216, 147, 211, 52, 186,
    1, 195, 201, 46, 172, 165, 85, 170, 185, 0, 254, 7, 80, 144, 141, 6, 89,
    217, 128, 86, 30, 121, 113, 184, 199, 252, 110, 57, 188, 54, 231, 183, 187,
    156, 63, 47, 39, 118, 124, 90, 207, 43, 116, 2, 26, 212, 202, 234, 203, 218,
    93, 210, 34, 55, 29, 101, 192, 21, 40, 49, 249, 82, 100, 230, 204, 20, 125,
    239, 145, 108, 111, 27, 176, 4, 53, 241, 229, 3, 143, 77, 15, 149, 23, 154,
    58, 36, 109, 123, 134, 119, 162, 104, 243, 127, 22, 42, 10, 19, 227, 115,
    117, 33, 250, 38, 97, 151, 120, 194, 142, 181
};

static inline unsigned hashOf(unsigned h,const char* s)
{
    for (int c = *s++; c; c = *s++) {
        h = T[h ^ (255 & c)];
    }
    return h;
}

The table T[] is just the values 0-255 shuffled randomly (give me a few minutes and I’ll write you an AWK script to generate as many unique T[]’s as you need). The hash function is very cheap to compute. As the original article mentioned the distribution of hash values is excellent.

The limitation of the hash function is that it only generates hash values in the range 0-255. Turns out this works rather well (at least for my purposes) in building a hash table.

Start with the observation that - if you make a lot of use of hash tables - most hash tables are small. What you want is a hash table implementation that is inexpensive and efficient at small sizes, and that also scales well to reasonably large sizes. My approach is to use the above cheap hash function to select from 256 hash buckets, and each hash bucket is a binary tree. (If you really need to save memory and the number of items hashed is small, you can reduce the number of hash buckets to a smaller power-of-two value by simply masking the hash value).

My approach is similar to Tim’s in that I use a binary tree, but front-ended by a hash and lookup on a 256 slot array. For small to moderate sizes this tends to win over most everything else I’ve tested. For large numbers of items (last time I measured I think the cross-over was around a hundred thousand items) this is not the best solution … but the performance is only mildly off from the better solutions. For the smaller number of items I usually find in my hash tables, this simpler implementation is faster and uses less memory.

Algorithmically this solution is roughly O(1) for smaller tables, changing to O(log(N)) for large tables. What you need to consider is that the actual runtime of the algorithm is likely dominated (when doing name lookups) by the string comparison time. Replacing (roughly) eight string comparisons with one cheap hash function is a good deal.

Take one common degenerate case where the saved string comparisons are a huge win - looking up data associated with filenames. If you have a map of absolute filenames (the equivalent of “find / -type f” on Unix or “dir/s/b/a-d \” on Windows) you have a large number of long strings with many long common prefixes. You save a lot of CPU avoiding those ~8 string comparisons on every lookup.

As an aside - I tried further minimizing the number of string comparisons by computing a second hash (same algorithm with a different seed), storing the hash in each tree node, and comparing the hash before performing a string compare. To my surprise the result actually benchmarked slower than without the secondary hash. (Naturally, your results may vary…).

Note that if you know up-front that your hash table is going to be huge, the best hash function I found in my testing is from Bob Jenkins (who has clearly done his own research).

Heh. Found an old USENET posting (and wonder how long the link into Google Groups will remain valid…).