Tuesday, November 3, 2009

On Armstrong Thesis Chapter 5

Dugan says that in order “to design and build a fault-tolerant system, you must understand how the system should work, how it might fail, and what kinds of errors can occur”. Also “error detection is an essential component of fault tolerance”. Following Dugan’s advice Armstrong designed a software structure, which detects and tolerates errors. The three important parts of this design are: a supervision hierarchy for tasks, a strategy for programming fault tolerance that applies to this hierarchy, and an implicit mechanism (well behaved functions) that corresponds to our intuitive idea of what an error is which compensates the lack of explicit specification.

The idea of decomposing an application into a hierarchical structure of tasks is of primary importance in designing any type of fault tolerant system. This is because in tolerant systems the components should be easily restartable to a stable previous state. Structuring the application as a hierarchy of tasks inherently supports this property. Monolithic, tightly coupled applications become crippled and cannot be restarted without losing all the work in progress. Tightly coupled operating systems belong in this category as well. Decomposition and fault isolation are so central to fault tolerance that they were employed even by system software. The ability to treat operating system services as separate components enables OSs to tolerate failures, as evidenced by true microkernels.

The other important aspect of programming a fault-tolerant system is what kind of strategy we employ when an error is discovered. Typically, when an error is detected in an isolated component, besides trying to handle the error, the recovery procedure will most commonly try to restart the failing component. In Erlang for instance, the error recovery procedure is to restart the worker associated to a task, or failing this would try to do something simpler. This approach perfectly matches the concept of performing error handling outside the components so that the error recovery code does not get compromised and the component can be safely brought to an accepted state. However the one potential issue with this approach is that, even though Erlang provides language support for task supervisors, the supervisors still have to be written by the programmer, who can introduce bugs. With this in mind, I wonder if it is still safe to say that Erlang’s approach is more robust than the traditional ones (e.g. using exceptions handling to reset to an acceptable state).

Other systems employ the idea of moving error recovery outside the component to an extreme where the developer is completely removed from the action of designing recovery code. For instance, Crash-only systems are systems built of software components that crash safely and recover quickly. Faults are handled by crashing and restarting the faulty component and retrying any requests which have timed out, no recovery sequence other that crash and/or restart is employed. The developer is only concerned with writing components that comply with the crash-only philosophy. The resulting system is often more robust and reliable because crash recovery is a first-class citizen in the development process.

The task hierarchy in Erlang allows for different relationships between tasks. Tasks with the similar complexity are arranged on the same level on the hierarchy. OR and AND supervisors can model dependencies between tasks on the same level: OR supervisors between independent tasks on the same level, AND supervisors between dependent, coordinated processes on the same level. Supervisor – Parent supervisor relationship can be used to describe relationships between tasks on different levels. Recursive restartability (RR) is the ability of a system to tolerate restarts at multiple levels. Erlang, through its Supervisor – Parent supervisor relationship, supports recursive restartability. Such systems possess a number of valuable properties that by themselves improve availability. For instance, a RR system’s fine granularity permits partial restarts to be used as a form of bounded healing, reducing the overall time-to-repair, and hence increasing availability. On top of these desirable intrinsic properties, we can employ an automated, recursive policy of component revival/rejuvenation to further reduce downtime.

No comments:

Post a Comment