Convert at the edges

2008-03-21

My time writing software spans from 16-bit DOS and Windows, through 32-bit Windows and Unix. I have written a fair amount of code or ported code to run in different environments. Along the way I picked up practices that proved useful.

One practice I call: "Convert at the edges".

A data structure may have a pre-specified fixed form when stored to disk, or shipped across the network. You could choose to keep the data in the same fixed form when in-memory. There seemed some logic to this choice - but experience proved otherwise. Long ago, I took this path in one client/server application. Lots of data structures going across network. Lots of data structures going to and from disk. When in-memory you needed accessor functions to read and write fields into the fixed form. The result was a great deal of tedious and not very efficient code.

Ever since I found it much simpler and more effective to perform all conversions at the edges. Data structures stored in-memory are kept in the most natural and efficient form for use by the program. Any conversion needed to more fixed or limited formats is done only when needed. Benefits from using the most-natural in-memory include:

More efficient for the programmer to write and read.
More efficient to use when the program is run.
Less likely to result in program errors.

The last bit was far from immediately obvious ... and a bit involved.

The strongly typed Ada programming language was very much a current topic when I was in school. Extensive support for user-specified types, numeric types with exactly defined range, and built-in range checking - this all seemed to make a great deal of sense. Experience both before and since clearly show that typing and conversion is the source of common and serious problems (like Mars probes that slam into the planet). In that context, strong type-checking at compile and run time seems to make sense. Better to exactly define the values and range of each numeric type ... or at least so it seemed.

There were counter-indications. On old 16-bit machines the most natural form for an integer was 16-bit word. On current 32-bit machines the most natural form for an integer is a 32-bit word. Compiler-generated code tends to be more efficient for that most natural form than for any of the smaller types. Unsigned integers are the source of occasional program errors. Different compilers on different processors tend to have slightly different rules when unsigned integers are used. Even when running on only one platform, the conversion rules applied by the compiler in expressions are an occasional surprise. For the most part, the additional range offered by unsigned integers is not worth the increase in program errors.

Taken together - in C/C++ using "int" for all integers is more efficient and less error prone. When the disk or network formats require a smaller form, "convert at the edges". In practice I found it much easier to write portable, efficient code with fewer errors by stripping most of the very-specific numeric types (UINT8, UINT16, UINT32 and the like), using generic "int" through the bulk of the code, and carefully checked type (and range) conversion going from memory to disk or network.

What bought this to mind was a peek into the OpenOffice developers world. The question was what limits Calc (the OpenOffice spreadsheet application) placed on the number of rows in a spreadsheet. The answer lead past this page:

sc: Increasing the row limit above 32000 rows

Position accessing methods There are a lot of class ScDocument methods (see inc/document.hxx and source/data/documen*.cxx accessing a cell position by MethodName( USHORT nCol, USHORT nRow, USHORT nTab ); where at least the nRow parameter will have to be changed to long, but it should be evaluated if all methods taking separated col/row/tab parameters couldn't be changed to take one ScAddress (see below) parameter instead. Similar, the class ScTable methods (see inc/table.hxx and source/core/data/table*.cxx) MethodName( USHORT nCol, USHORT nRow ); and the class ScColumn methods (see inc/column.hxx and source/core/data/column*.cxx) MethodName( USHORT nRow ); and the class ScAttrArray methods (see inc/attarray.hxx and source/core/data/attarray.cxx) MethodName( USHORT nRow ); row parameters have to be changed, also all MethodName( ..., USHORT nStartRow, USHORT nEndRow, ... ); of all of those classes. Search( USHORT nRow, short& nIndex ); of ScColumn and ScAttrArray and similar are special in a way that the short& nIndex reference is used to return a position within an array where the position may at most be the number of rows used. This of course has to be changed as well.

My first inclination on reading the above is to want to strip out all the overly-specific numeric types. Doubtless that would make me completely unacceptable to the OO developer community. At the same time this specific example helps prove my general point. The wide use of very-specific numeric types means that a change to raise (say) the number of rows in a spreadsheet involves massive changes to the source code.

In fact, I would argue that at a design level this usage is flat wrong:

MethodName( USHORT nCol, USHORT nRow );

The row and column numbers in a spreadsheet are exactly that - numbers. At this level in the design there is no reason to use anything more specific than an "int". Certainly the in-memory representation of a spreadsheet might be limited to a smaller range. Certainly the on-disk format chosen might be limited to a smaller range. But ... either could change. A different algorithm or structure used for in-memory storage could raise the limit. A different on-disk format could raise the limit. Oddly enough the simpler declaration is much more useful:

MethodName( int nCol, int nRow );

First, the compilers tend to generate slightly more efficient code. Second, changes that raise limits for the in-memory or on-disk formats will not require changes to this declaration. Third, the range limit implied makes sense for the machine. On a 32-bit machine "int" limits you to addressing 2 billion rows and columns. Even with sparse storage you just are not going to be able to have 2 billion of anything on a machine with (at most) 4GB of memory. On a 64-bit machine you could use bigger memory to support bigger limits - without changing the code - as long as you were not over-specific in your declarations.

I started out using specific/exact types. After years of practice and varied experience, what I eventually found more effective is simple use of the machine's "natural" types combined with careful and exact checks "at the edges". The result is easier to write code with fewer errors that is both portable and efficient.