Extending the Pearson hash function to larger values

2009-08-18

My favorite hash function is without doubt the Pearson hash. In my measurements, for my usage, custom hash tables built using a Pearson hash to index into a power-of-two-sized table, have always performed better than the best alternatives. (For not-small tables each table slot is the root of a binary tree.) Certainly this result is going to be somewhat specific to the use-case and current CPU architecture, but each time performance was important, my solution - using binary trees in 256 hash buckets, using the Pearson hash - won the benchmarks.

The main limitation of the Pearson hash is that it only computes an 8-bit value (i.e. the range 0..255). For some problems a bigger range of values would be good. The means offered in Pearson's original paper (as I recall - threw out the old magazine long ago, and the paper is not on the web) was to run the algorithm more than once to add additional 8-bit chunks to the hash.

Just now I'm feeling like a bit of an idiot, as it seems the Pearson hash could be easily extended up to 16-bits (or a bit beyond) with a ridiculously small change.

The original Pearson algorithm:

static unsigned char T[] = {
    // The values 0..255 shuffled into random order.
};

static inline unsigned hashOf(unsigned h,const char* s) {
    for (int c = *s++; c; c = *s++) {
        h = T[h ^ (255 & c)];
    }
    return h;
}

The algorithm modified to generate a 16-bit hash:

static unsigned short T[] = {
    // The values 0..65535 shuffled into random order.
};

static inline unsigned hashOf(unsigned h,const char* s) {
    for (int c = *s++; c; c = *s++) {
        h = T[h ^ (255 & c)];
    }
    return h;
}

Note that only the type and size of the table T changed!

With the large cache sizes of current CPUs, the 128KB table will fit nicely into cache, and perform very well. On CPUs with smaller caches, you might want to use a smaller table. Also clearly the table size for a 32-bit hash would be huge. You might want some sort of test for degenerate patterns in T[] (though I'd guess the probability of getting a bad table is very small).

This notion is so simple, I am a bit surprised that no one (so far as I can tell) suggested this approach before.

For large numbers of keys, each bit added to the hash value saves the equivalent of a key comparison. Adding 8-bits to the key saves 8 key comparisons. If the expense added to the hash function is less than the cost of one key comparison (certainly true of string compares), then you are ahead from the first bit added.