Towards Implementing System-Level Fault-Tolerant Solutions in the Gurkh Framework

Fault-Tolerance is a long adopted concept in engineering discipline used to improve dependability of systems. The related intuitive principle is to increase design robustness by adding redundant resources and the mechanisms necessary to make use of them when needed. When these mechanisms are designed to act in order to optimize system-wide performance, we talk about system-level Fault-Tolerance.

Fault-Detection is one of these mechanisms and its importance is key, especially in the computer and software world. ESL is currently looking at using the tools and resources made available by the Gurkh framework to provide real-time system designs with system-level Fault-Tolerance. Safety-Critical Real-Time applications typically have stringent dependability requirements, which are usually captured as reliability and determinism.

ESL's ongoing Draper project is researching methodologies enabling capture of transient software faults. The monitoring chip contains a model of the expected system behavior. This behavior is captured as timing constraints.

The first phase of this work breaks down in three steps that together build a proof-of-concept implementation of an application with system-level Fault-Detection capabilities by using formal methods and System-On-Chip techniques.

The first step consists of deriving an Ada Ravenscar compliant version of the application software along with its corresponding formal model in Uppaal.

The second step of this work is the development of the application's fault model, in both software and timed-automata once more.

The last step of this first phase requires the development of a coordination mechanism to handle the operation model change-over.

The Uppaal models being developed are interfaced with the Uppaal model of the hardware kernel. This allows us to prove the correctness of the integrated system and helps us understand the timing behavior of the bounded sequence of interactions that take place within the system.

The first phase aims at proving that it is possible to carry out fault detection of the system based on a formal model of both the application and the run-time kernel. The next phase aims at providing degraded performance guarantees through multiple fault models.

One of the most challenging problems in hardware/software co-design is to provide operating system support for hardware components. The second phase of the project involves creating dual operating system support for system-level reconfiguration. RavenHaRT acts as the regular operating system and hands control over to the reconfiguration OS when the monitoring chip detects an error. The overall design method developed in the context of the Gurkh project, combined with the ability given to a system to degrade gracefully by means of system-level monitoring and reconfiguration, will allow designers to build more dependable System-on-Chip designs.