Home Engineering Projects For You Design Guides An Introduction to Fault-Tolerant Embedded Systems

An Introduction to Fault-Tolerant Embedded Systems

An Introduction to Fault-Tolerant Embedded Systems
Fig. 7: Berg strip soldering on track side of the PCB

Duplicated CPU based system uses two CPU circuit boards (which are identical) with two types of interconnections between them. Each CPU has a signal called watchdog out (WD OUT) and watchdog in (WD IN), and high-speed interlink in (HS-ILNK-IN) and high-speed interlink out (HS-ILNK-OUT). Since the system has only duplicated controllers, these are connected to common functions that they control (Fig. 3).

These common functions receive control and data from both CPUs. Hence, the common function should have the facility to be controlled by either CPU. Also, if an FT system has to be implemented, it has to be done from the concept phase covering both hardware and software.

How a duplicated CPU FT system works. When the system starts, both CPUs are good and either of them can control the system. This is implemented by a hardware based dice mechanism, which is similar to a coin-toss function. At the start, the dice circuit will randomly make one CPU active.

Typically, in a duplicated CPU FT system, CPUs are called copy #0 and copy #1. So, at the start itself, one of them is actively controlling the system. Both CPUs will be punching the watchdog, so that no time-out occurs. At the same time, the system software through the high-speed interlinks updates all critical data on a run-time basis, so that both CPUs are in identical states. This mode is called duplex mode.

When the active copy develops a fault (Fig. 4), it fails to punch the watchdog and a time-out occurs. This triggers a signal to the other copy, which straightaway takes control of the system and raises an alarm, indicating that the CPU switch-over has happened. This mode is called simplex mode.

In the meanwhile, the faulty CPU with a watchdog timer restarts and runs self-diagnostics to identify whether the problem is related to hardware or software. If the problem lies with hardware, it displays the fault and calls the attention of the user for a replacement. Meanwhile, the copy that took over the system controls and runs as usual, so that the main functionality of the system does not suffer. Fig. 4 shows the sequence of events.

It is the system software’s responsibility to maintain the log that the faulty unit has been replaced with a good one and the system has returned to duplex mode.

At this stage we need to understand that dual CPU implementation has two variations in their working, based on software implementation.

An FT system is fully-controlled by one CPU and the other CPU takes over when the active one fails. This is known as hot-stand-by architecture. Here, the software is simple and two critical elements need to be handled, namely, the take-over portion of the software and system/user parameters update through the interlink.

The second variation, which is complex in software implementation but more precise, is known as load-sharing architecture. In this architecture, both CPUs execute the system software, while the active one controls the system. The status of the system is almost identical and the CPU load is shared for better functionality. This architecture helps in mission-critical systems for faster take-over and control.

The system software is very critical, despite the choice of architecture. Since most FT systems are real-time systems, these use real-time operating systems (RTOSs) or real-time kernels (RTKs). This complicates the development of FT software. Typically, all RTOSs/RTKs have a hardware-abstraction layer (HAL) (Fig. 5) between the hardware and OS, so that porting OS becomes easy. This software is written specifically for each processor, so that application developers are completely decoupled from the processors used.

When an FT system is being implemented, its performance is dependent on the way FT software is implemented in a classical RTOS/RTK application. Integrating it with HAL will give the fastest performance when handling WD timer run-out. However, updates through interlink need a complete link-handling driver.