Fault-tolerant system architecture characteristics
By definition, an FT system is a system that continues to function the way it is designed even in the event of component failure. The basics are:

1. Any fault that has an impact on human life or revenue is handled by FT systems. However, the level and intensity of fault-tolerance varies with criticality of applications.

For example, in a telephone exchange, when a controller fails and the system is not fault-tolerant, all subscribers connected to the exchange suffer, leading to customer dissatisfaction as well as loss of revenue due to the loss of data. So, most controllers in a telephone exchange are fault-tolerant. Whereas, if a subscriber line or the interface connecting to the exchange fails, only that subscriber is affected.

Fig. 4: Sequence of events
Fig. 4: Sequence of events

A88_box

In applications in nuclear plant systems and avionics, implementation of fault-tolerance is based on the criticality of the function.

2. Fault-tolerance is implemented as a combination of hardware and software in the system.

3. Non-life-threatening FT systems are designed to handle single faults, at any given time. Technically, handling multiple faults is feasible but the cost and complexity of the system is directly proportional to the number of faults concurrently handled. Higher the number of faults handled, more expensive and complex the system will be.

In this article we will discuss FT systems that handle only single faults. However, the philosophy of design for multiple faults is the same as single faults, barring the complexity of the design.

Understanding fault-tolerant architecture
Fault-tolerant architectures are classified into three major categories.
1. Duplication of frequently failing units (typically, power supply units)
2. Duplication of CPUs
3. CON-MON architecture (not a full-fledged FT system)

In all these systems, we need to keep in mind that the solution is a combination of hardware and software. Let us look at each of these in detail.

Fig. 5: All RTOSs/RTKs have a hardware-abstraction layer (HAL) between the hardware and OS, so that porting OS becomes easy
Fig. 5: All RTOSs/RTKs have a hardware-abstraction
layer (HAL) between the hardware and OS, so that
porting OS becomes easy

Duplication of power supply units (PSUs). The simplest part of FT architecture is designing a system with duplicated power supplies. This approach works in systems where power densities are high and PSUs fail frequently because of heavy load. This architecture is easy to implement as it does not call for big design changes in the main system. Fig. 2 shows the FT architecture of duplicated PSUs.

In this architecture, when one PSU fails, the other takes over. This means that, despite two PSUs being present, only one will take the full load, stressing the active PSU. This mode is also known as hot-stand-by mode.

SHARE YOUR THOUGHTS & COMMENTS

Please enter your comment!
Please enter your name here