So, to improve the architecture, modifications are made so that PSUs share the load equally, and when one of them fails, the other takes 100 per cent of the load. This mode is known as load-sharing mode. Since PSUs in load-sharing mode are loaded only to half their capacities. This approach needs a current-sharing feature in the power supply, which makes the PSU design complex.

But there is no restriction on the number of PSUs that can be added. In telecom applications, some core systems tend to have three PSUs in the current-sharing mode so that PSUs are not loaded more than 60 per cent of their capacities to ensure reliable working.

To implement a duplicated PSU FT system, some system features need to be incorporated, such as:

1. In hot-stand-by mode or current-sharing mode, the main system should be intimated about the failure through hardware signal so that the system raises an alarm and the faulty system can be repaired.

2. When the system runs in hot-stand-by mode, the main system needs to periodically switch the PSU so that both PSUs are tested continuously.

3. In both cases, the system should have the option of PSU hot-plug-in. This ensures smooth running of the main system when a new PSU is introduced.

Duplicated control unit based FT systems. When the system has to be completely fault-tolerant, duplication beyond the PSU is needed as other faults may still cause problems despite having duplicated PSUs in place. In the duplicated controller architecture, CPUs are duplicated so that, if one CPU or associated logic fails, the other CPU takes over. Duplicated CPU based FT systems are based on a combination of hardware and software.

 Fig. 6: High-level software architecture for FT systems

Fig. 6: High-level software architecture for FT systems

The FT mechanism works on two essential features that the processors have—watchdog timer and high-speed serial or parallel link between two CPUs. Fig. 3 shows how duplicated CPU based FT systems are implemented.

When an FT system based on duplicated CPUs is implemented, the following three aspects need to be understood well:
1. Time taken for the good CPU to take over from the faulty one (known as switch-over time)
2. Consistency of system data and user data between the two CPUs (data integrity)
3. Interface to the common control element by the two CPUs (redundant CPU bus interface)
4. Built-in diagnostics to identify and isolate problems in the system periodically (built-in self-test)

As we can see, duplicated CPUs work as a combination of hardware and software elements. Let us see how the system is implemented to help us understand the FT operation well.

SHARE YOUR THOUGHTS & COMMENTS

Please enter your comment!
Please enter your name here