Segregation - Partition the problem

2011-04-15

My question (in 2002) to Steve Theby, who originally told this story:

You were part of a project deploying a large database application. Response/processing times were not reliable or predictable until each computer was dedicated to a specific kind of transaction.

At least this is my somewhat hazy collection.

Steve's answer:

Yes...I did work on a very large project that essentially constructed a predictable response framework for American Express credit card processing. I worked on it under a contract with McDonnell Douglas in St. Louis (we had the world's largest computer room with over 100 IBM system engineers permanently on site). The tape hanger personnel used to wear roller skates so they could mount/unmount drives faster (I'm not making this up). We had at least 300 tape drives the size of a four drawer, 30" filing cabinet and a tape archive storing more than 30,000 tapes. Anyway...I digress...

Application

This was a world-wide application responsible for evaluating and authorizing credit transactions. They planned to have several large IBM mainframes located in the US and the UK to balance out the US and European transaction loads. Our work headed us to Brighton, UK for about a one year period. The application was written in COBOL using the CICS transaction monitor plus the IMS shallow network database. CICS gave us a terminal handling front end that allowed the application to scale up with respect to connections and IMS gave us a very quick, indexed hierarchical lookup capability.

Architecture

The initial architecture for the application pushed all transactions through a single machine in a FIFO queue. Although this somewhat worked for small loads (when the machine was below 40% utilization), it failed quite predictably at medium to large loads. Even at small loads we could not reliably predict closure times across transactions. This was primarily because each transaction had different resource requirements in terms of database I/O, CPU and communication transit time and they were mixed together.

The goal was for a single mainframe to handle 2000 terminals. As I recall, the 40% mark was reached at about 200 terminals (a bit shy of expectations). This was a monster machine (something like a 30MHz CPU with 512K of awesome memory and 30MB 2119 removable disk packs). I will say that the thing really ran due to its wonderful I/O channels....something PCs still don't have... I believe that each merchant had a special dialup 1200 baud modem for super fast connectivity. Our Solution - separate transactions into like sets and route to different machines

We started a transaction QA group that put together a process called CPA for Call Pattern Analysis. This process evaluated each transaction sequence for its CPU, I/O and database resources required. We ended up strongly typing each transaction based on where it fit in this resource profile. If I recall correctly, it went something like this:

  Type        CPU                     I/O                     DB
  ----------- ----------------------- ----------------------- -------------------
  1           0.0 \< seconds \< 0.1   0 \< bytes \< 500       00 \< calls \< 05
  2           0.1 \< seconds \< 0.5   500 \< bytes \< 1000    05 \< calls \< 20
  3           0.5 \< seconds \< 3.0   1000 \< bytes \< 2000   20 \< calls \< 50
  4 (batch)   3.0 \< seconds          2000 \< bytes           50 \< calls

We then routed transactions of the same type to a designated machine so that we could predict when a transaction would finish (this put all the same size stones in the same hour glass instead of mixing sand, stones and boulders).

This really helped things out and wound up meeting the expectations of American Express. Even though this meant that some machines were frequently idle, each type of transaction was fairly dealt with and dispatched in a uniform, consistent manner.

Before this arrangement, we had small transactions executing very quickly (< 1 sec) one time and then stalling out for 10 seconds the next time because of a 'boulder' transaction hogging everything.

Hopefully this helps. I looked briefly for some of our old documentation but couldn't find anything (that was 12 years, 10 managers and 3 companies ago).

Stephen Theby (stheby@drsys.com), PRO-IV Tools, Architect 1-714-724-5640 (Irvine, Ca), 1-314-214-4025 (St. Louis, Mo)

Implications

Applied to the present, I believe this story has interesting implications. Start with the fact that the inexpensive PCs of today are remarkably powerful, compared to the machines of 10 or 15 years ago. Almost ridiculously so.

Add to this the lesson from the above story about partitioning the problem, so that similar operations are on one machine.

In the past it made sense to run lots of services (mail, news, database, files, etc.) off one machine, as computers were expensive. This is no longer the case. Computers are cheap. Today you are likely better off using each server for a single purpose, and no other purpose. Once a server is running reliably, lock the box in a closet, and leave it alone.

Note that this model also applies to the desktop.

The Windows PCs of today are in a sense turning into the mainframes of yesteryear. Fully install Windows NT and Microsoft Office and you get hundreds of megabytes of seldom used services. Address books, spelling and grammar checkers, database engines, and lots of other seldom (if ever) used services. Hundreds of megabytes of code and data, most of which will never be used.

Imagine instead that we carve off chunks of functionality to dedicated servers.

Spelling checkers run on a dedicated spelling check server. All that runs on the client is a small bit of code to make requests of the server. Performance is likely better as a properly configured (and now inexpensive) server is likely able to turn around requests in less time than it would take to load the spelling checker and word lists from local disk. Not incidentally a spelling check server is likely to have current lists of company specific words, and so is able to prove not only faster but also better answers.

Apply the same model to address books. An address book server (LDAP is of interest here) can have individual, department and company address books. With proper configuration, you again can get both faster and better answers.

Repeat for every activity that is not inherently local to the desktop.

Note that unlike the mainframes of yesteryear, the goal for a "properly configured" dedicated server, is to optimize response time. Where once we would try to make use of every spare mainframe cycle, with dedicated servers the goal is to respond to incoming requests as quickly as if the server were idle.

One way of looking at this notion is as a natural step in the evolution from shared mainframes, though desktop PCs, to network computers.

With mainframes the idea was for all users and tasks to share a single machine.
With desktop PCs the idea was for all tasks for one user to share a single machine.
With network computers and dedicated servers the idea is for any single task use as many machines as needed to complete quickly.

All this inferred from an old application that today could run off a single server with an attached RAID :-).