Breakage - or misapplication?

2006-06-15

So .... there's this thread across weblogs discussing the "breakage" of binary search when the search algorithm uses an array, and the index into the array approaches the maximum value for a 32-bit signed integer ... and I am having trouble classifying this as a bug.

Let's talk about sizes a bit. Programs on 32-bit machines can address at most 4GB of memory - with most desktop machines limited to less. Given that the smallest possible meaningful object is an int (which requires 4 bytes), no 32-bit program could create an array over ~1 billion items - which makes the "bug" found impossible.

Of course, doing a binary search of anything close to a billion integers is ... dubious. Other algorithms could yield much better results.

For integers, address-based lookup on a bitmap would be O(1) and require at most 0.5GB storage. (There is also a huge saving in the initial sort, and any re-sort on changes.) On pure performance this is the winner. Just on the basis of storage this is the better solution (over an array) when the number of integers approach 128 million.
For smaller (but still large) numbers of integers, you could "snap" the value - use half for address-based lookup (replacing up to 16 comparisons), and half for binary search (over at most 64K values). Storage needed is a few times 256KB (pointers plus heap overhead) plus 2 bytes per value - half the storage needed for single array. Performance is less than (1), but less storage is needed if the number of values is less than ~250 million.
For even smaller numbers of integers, I might be tempted to "snap" out the low byte for use in address-based lookup into 256 buckets, and binary search an array of integers in each bucket. Storage is slightly more than a single array, but we've saved ~8 comparisons on lookup. For a long-lived application where the set of values to be searched may change over time, we are going to save some time when maintaining the sort order on smaller buckets. We might want to benchmark to discover the cross-over points in performance.

In most applications what you are searching over is objects, not integers. Objects are referenced through pointers, contain more than an int probably cost a couple words of heap overhead. So in the usual case you limited to ~250 million objects (or much less).

Now with 64-bit machines and a matching 64-bit JVM we have the possibility of Java programs that could address more than 4GB of memory. We are talking about server-class machines. (The number of desktops configured with more the 4GB is statistically insignificant - much smaller than the number of crooked politicians). On a server-class machine if you are going to the trouble of loading billions of objects into memory, odds are you are going to do a lot of searches, and performance matters.

While binary search scales well, with billions of objects some form of hashing and address-based lookup is certain work very well and chop a huge chunk off the runtime - and the time saved is likely critical to the application. When time is critical, and data and the number of repetitions is large, you also need to think about the impact of sparse data on the processor cache. If the frequently-accessed data can fit within the processor cache, odds are you gain another healthy boost in performance.

With long-lived applications you also need to be concerned with the performance cost of maintaining your data structures when the values searched change over time. A large single array is almost certainly an expensive choice.

Taking the above into account, my suggested "fix" for binarySearch would be something like:

        IF count of objects remotely close to 1 billion
        THEN throw AbuseOfAlgorithmException() AND terminate programmer.

Seems to me the "bug" is more with the programmer's choices than the algorithm.