Wednesday, October 14, 2009

On CHESS

“Heisenbugs” are bugs resulted from unexpected interference among threads, which are typically extremely difficult to reproduce and eliminate. This type of bugs may show up in systems that otherwise have been running for months and their cause may be as trivial as for example adding some extra debug statements. CHESS is a tool able to reproduce such bugs, which has been integrated into many code bases inside Microsoft.

The core design principle in CHESS is that, when attached to a concurrent program, it takes complete control over the scheduling of threads and asynchronous events, thereby capturing all the interleaving nondeterminism in the program. There are two main benefits resulting from such a design. First, in case of executions resulting in errors, CHESS is able to reproduce the erroneous thread interleaving. And secondly, CHESS offers methods for systematic exploration of possible thread interleavings. Consequently it can force every run of the program to follow a different thread interleaving, thus increasing the chances for finding errors in existing tests. Another consequence is that CHESS can find in simple configurations bugs that would otherwise be exhibited by more complex configurations, thus rendering useless artificial system stress tests. The tradeoff for having this design is that CHESS has to deal with the complexity of the concurrency APIs; it should understand the precise signatures of those threading and synchronization functions.

CHESS’s primary goals are capturing all the nondeterministic choices during the execution of the program and exposing them to a search engine that systematically enumerate the range of all possible executions of the concurrent program. This is the basis of reproducing and replay of a chosen concurrent execution. The challenge here is not only to integrate CHESS with complex APIs but also to make sure that it does not introduce extra behavior that is not related to the concurrent program.

The most important component of CHESS, the scheduler, is implemented by redirecting calls to concurrency primitives to alternate implementations from a wrapper library. Including an implementation of these primitives as part of the program is a design choice that may simplify the schedule which now needs to understand only simpler system primitives that the lock implementation uses. However the tradeoff with this approach is that it does not allow the scheduler to expose all the nondeterminism in the lock implementation.

Another design decision meant to simplify the implementation of the scheduler is to abstract the execution of the concurrent program using Lamport’s happens-before graph, which is a graph that captures the relative execution order of threads in a concurrent program. This approach offers a common framework for reasoning about all synchronization primitives used in the program which abstracts away the timing of instructions in the execution. In order to build the tree, CHESS determines whether a task may be disabled by executing a potentially blocking API call, then it labels each call to API by a triplet of task, synchronization variable and operation, and finally informs the scheduler about the creation and termination of a task.

One CHESS innovation is the way it deals with data races: Instead of using a dynamic data-race detection tool, CHESS schedules the outcome of data-races indirectly by enforcing single threaded execution. Consequently since no two threads can concurrently access memory locations, all data-races occur in the order in which CHESS schedule the threads. This may slow down the execution a bit but it can be compensated by running multiple CHESS instances in parallel, each exploring a different part of the interleaving state space.

No comments:

Post a Comment