I am currently faced with a set of tasks that calls for the batch manipulation of potentially tens of millions of records (more in future) over a total pool of data that could approach or even exceed the total addressable memory on a 32-bit machine. Clearly I want to do whatever I can to provide good performance.
This is not a one-shot deal - the software will eventually end up at hundreds or thousands of sites. I cannot call for significant hardware upgrades just for this one task.
I have to assume that the total size of the problem may well be a few to several times the available physical memory. This rules out solutions that call for all of the data to reside in memory at one time. For anything other than straight sequential processing, this calls for a database of one form or another.
On the other hand only one program (written in Java) will need access to the data. There are no cross-language conversion issues. There are limited concurrency issues. There is limited need for transactions. This means a lot of the overhead in a big-name database (Oracle, MS SQL), or even a lightweight database (HSQL) is not really needed.
Seems like I am caught in a middle ground.
Tools like Prevayler offer one sort of approach to building an application database. This is a terrific approach to solving some sorts of problems. Aside from the need for memory, this would be a great solution to my current problem - but data bigger than memory does pretty much completely rule out Prevayler as an approach.
What is really disturbing is reading the discussion about Prevayler. There are folks that somehow think the whole idea is evil. There are the folks responsible for the code that seem unable to understand or articulate the tradeoffs involved. The last is especially disturbing as you have to wonder about this lack getting expressed in the implementation.
Looked at ozone, which looks like a nice piece of work, but seems to add more overhead than I need (especially in comparison to Prevayler).
Guess I am also not convinced that Java serialization is a good idea in the face of a need for upwards/downwards compatibility. Sometimes you need to read data created by either an older or newer version than the current software. Seems I could override the default implementation of readObject() … or it might be better to avoid the default serialization framework altogether.
I am leaning towards a solution specific to this particular application - a Prevayler-like solution for everything but one large array (table), with the large array stored on disk and cached on reference. This has me re-inventing yet another limited database implementation (about which I am not enthused). Might yet punt, go with a lightweight SQL database (like HSQL), and pay the price in performance.