Monday, September 21, 2009

On BA Chapter 8 (Tandem Architecture)

Tadem is a fault-tolerant computer system, marketed to the transaction processing customers, using ATMs, banks, stock exchanges, etc. Guardian, the OS running on Tandem machines of the NonStop series, was designed in parallel with the hardware, in order to provide fault tolerance with minimal overhead costs.

In many ways, Tandem/16 was a revolutionary system, however with an unexpectedly low impact on the industry and design of modern machines. This is most probably because Tandem/16 was very different from most systems and it was developed in a purely commercial development.

A key design principle of the Tandem’s fault-tolerant architecture was modularity, both hardware and software being decomposed into modules, acting as units of failure, diagnosis, repair and growth. Modularity was very important for Tandem as a fault tolerant system, because individual modules had to be replaceable online. Furthermore, the isolation that comes with modularity decreases the chances that the failure of one module affects the operation of another. In Tandem, the process model and the messaging system are the two important mechanisms used in implementing fault isolation.

Furthermore, each module is designed based on Fail Fast principle. By implementing a mechanism of self checking, each module is designed to either work properly or stop, first time when it detects a fault. This is imperative for guaranteeing data integrity in the event of a failure.

Another important design principle for Tandem as a fault tolerant system was Single Failure: when a hardware or software module fails, its functionality is immediately taken over by another one, given a mean time to repair measured in milliseconds. For instance, for a CPU, there is always a second CPU, ready to assume duty in case the first one fails. The same goes for a running processes, that always run in process pairs, a primary and a backup process.

Tandem is also designed to support online maintenance. Hardware and software can be diagnosed and repaired while the rest of the system continues to deliver services to the user. Hardware components, data and programs can be reintegrated into the system without interrupting the service.

The general feeling is that Tandem was a revolutionary machine, but it was the small things that got in Tandem’s way of imposing its visions. Naming issues and certain incompatibilities, as for instance the interprocess communication unusual concept, are just a few of the issues that prevented Tandem from being broadly accepted. In the nineties, factors as computer hardware becoming generally more reliable and significantly much faster accelerated the decline of the Tandem architecture.

No comments:

Post a Comment