## TACKLING OF SINGLE-EFFECT UPSET IN MICROPROCESSOR-BASED **ARCHITECTURE**

## Abhishek Jadhav

Department of Electronics & Telecommunication, Fr. C. Rodrigues Institute of Technology, Navi Mumbai, India.

Abstract: Since last five decades, research and development has taken place in fault tolerant microprocessor. There are increasing reliability issues for embedded systems in aerospace applications. Not only aerospace but telephone switching systems, industrial embedded systems also face this issue. In this paper, we will discuss how various processors tackled Single-Event Upset by various mitigation techniques like SEC/DED ECC, double modular and triple modular redundancy, parity, and even use of fault tolerant functional unit.

IndexTerms - Single-Event Effect, Single-Event Upset, Modular redundancy, Error correcting code, Arithmetic Login Unit, Reduced Instruction Set Computer.

Introduction: Faults are basically introduced at any point in the process of manufacturing. It can take place from hardware to software. Faults introduced can be specification errors, implementation errors, component failures, external disturbances, which can be from architectural level, algorithmic mistakes and even external factors like electromagnetic interference. In aerospace applications, the high energy particles from sun or cosmic rays can penetrate the IC packaging causing electrical disturbances which is called as Single-Event Effect. Single-Event Effect can be soft error (no permanent damage) or hard error (permanent damage to device or circuit). Formation of SEE errors takes place in three steps: charge deposition, charge collection and charge response. Single -Event Upset is one of the types of Single-Event Effect which occurs in memories like SRAMs. SEU is change in state of a storage element inside a device or system. Since it is soft error it can be resolved, and doesn't damage the hardware permanently. In 1975, first observation of SEUs in flip- flops on a communications satellite was reported. According to the researchers there was charging of the base-emitter capacitance of sensitive transistors to turn-on voltage due to the galactic cosmic rays [1]. After first observation, research in building microprocessor-based architecture which are fault tolerant has increased.

Microprocessor based on SPARC V8 Architecture: LEON-FT is a fault-tolerant microprocessor which is 32-bit processor based on the RISC SPARC V8 ISA (Instruction Set Architecture). The processors tackle transient Single-Effect Upset errors. It uses techniques such as TMR registers, on-chip EDAC, parity. It was seen that one SEU error can cause errors in nearby cells [2], so in dense RAM blocks, two parity bits per data word is used to detect double errors. Out of the two parity bits one for odd data bits and one for even data bits. Two parity bits can detect errors in the nearby cells as well which was a problem according to [2] where there are adjacent bits belonging to the same data word [3].

This processor has around 2500 flip flops which are temporary storage or used for state machines. To protect from SEU errors each on-chip register using triple modular redundancy are implemented [3].



Figure 1: Separate clock trees [3]

To increase the effectiveness, separate clock trees (figure 1) are implemented wherein when SEU hit in one clock-tree can be tolerated even if the data of a complete 2,500 registers is corrupted. In the next clock edge, all errors will be removed as new data is clocked in and corrupted data is removed. An SEU hit in the clock pad cannot be tolerated since it will propagate to all three clock-trees as shown in figure 1. SEU in clock is very unlikely and has very less probability [3].

The external memory tolerates SEU by on-chip error detection and correction unit (EDAC). It implements a standard (32,7) BCH code. For 32-bit word only correcting one and detecting two errors are present [3].

Fault Tolerant LEON 3 Processor: It is designed for harsh space environment in such a way that it is fault tolerant for space application by detecting and correcting the single-event upset (SEU). LEON 3 is also 32-bit processor based on the RISC SPARC V8 ISA (Instruction Set Architecture). The memory tolerates SEUs by error-correction of up to 4 errors per tag or 32-bit word [4]. It has fault tolerant memory controller of 32-bit PROM/SRAM which also provides error detection and correct (EDAC) with one correcting and two detecting. The caches have tag array and data array. To protected against SEU errors, register file has been implemented in flip-flops with triple-module-redundancy (TMR) on all flip-flops [4]. There is no timing impact due to error detection or correction.

LEON 3 was tested for fault tolerance using Californium (Cf-252) which was carried out for 3 hours, with a flux of 25 particles/s/cm2 at the device surface. The processor reported 281 effective SEU errors, out of which 99% were corrected [4].

IBM z990 Processor: The IBM z990 is designed to detect and recover from several of soft and permanent errors. It detects SEU errors and to recover from almost all upsets. Error correction code are used to protect against hard errors as well as soft errors. In this processor each 32-bit word in each cache/data chips (SCD) [5] are been protected by (39/32) single error correction/double error detection (SEC/DED) error correction code [6]. Without mitigation techniques there would be increase in soft error rate in DRAMs, SRAMs, register files, etc. When data is stored or fetched from wrong address, it still gives correct ECC. So, to avoid this extra ECC code space are used to detect errors and avoid such silent data errors in memory [7]. When there are correctable errors in cache, line is changed. If line is not changed, the data is not considered and the data is again fetched. The register files are either protected by ECC, parity. Sometime the errors do not affect the system, so no protection is needed.



Figure 2: Multichip module with 8 cores, 4 cache/chip (SCD), 2 main storage controller (MSC) and clock [5]

ECC is not used to protect from soft errors or hard errors in case of buses. The fetch data bus and I/O bus of the processor are parity protected. When a parity error is detected the central processor is sent into checkpoint recovery and the cache can be cleared. It is again refetched as needed from the other cache and memory. If refetch is unsuccessful then the processor will set threshold and cause the central processor to checkstop and central processor sparing event. The command and address controls at the interfaces are ECC protected.

SHAKTI F-Class RISC Microprocessor: It is an in-order 5-stage processor, based on the RISC-V Instruction Set Architecture. This ISA is completely open source and royalty free [8]. It is fault tolerant version of C-class processor which is RISC based microprocessor. It uses error correcting codes (ECC) to protect registers and memories, while combination of space and time redundancy-based techniques to protect from errors in the Arithmetic Logic Unit (ALU) [9]. It also proposed new ideas for mitigation of SEUs, that are re-computation techniques for detecting errors for the addition/subtraction and multiplication modules. Single-Error Correction and Double-Error-Detection (SEC-DED) are used to protect instruction memory, data memory, program counter, and register files. In these 7-bit are added to 32-bit long instruction, data and program counter to become 39 bit which is fetched. Since ALUs are large in size they are more prone to SEEs, it is divided into five functional units. Dual modular redundancy is used to protect ALU from errors.



Figure 3: Block diagram of fault tolerant functional unit [8]

As seen in figure 3, each fault tolerant functional units have primary  $(FU_P)$  and redundant  $(FU_R)$  functional unit. These functional units can operate in two modes, normal and re-computational. In normal mode when inputs are fed and output is generated by  $FU_P$  and  $FU_R$  are compared to give no error, transient error and permanent error. When the outputs are not same for less than three consecutive cycles, it is transient error where as in more than three consecutive cycles it is permanent. In transient error, the inputs are fed again till it is rectified and if it sustains for more than three cycles it is classified as permanent errors. As mentioned earlier, re-computational mode is used for permanent error.

Conclusion: In this paper, I have discussed the ways microprocessor-based architecture have tackled Single-Event Upset which has become reliability issue for aerospace applications. Various mitigation techniques can be seen in these processors where

combination of space and time redundancy [9] was used in SHAKTI F-Class processor. There is always tradeoffs, the penalty is more in DMR and TMR compared to ideas proposed in SHAKTI F-Class processor. In future, we can see fault tolerant super scaler processor using more open ISAs like RISC-V and OpenPOWER.

## References:

- 1. D. Binder, E. C. Smith and A. B. Holman, "Satellite Anomalies from Galactic Cosmic Rays," in *IEEE Transactions on Nuclear* Science, vol. 22, no. 6, pp. 2675-2680, Dec. 1975, doi: 10.1109/TNS.1975.4328188.
- 2. J. A. Zoutendyk, L. D. Edmonds and L. S. Smith, "Characterization of multiple-bit errors from single-ion tracks in integrated circuits," in IEEE Transactions on Nuclear Science, vol. 36, no. 6, pp. 2267-2274, Dec. 1989, doi: 10.1109/23.45434.
- 3. J. Gaisler, "A portable and fault-tolerant microprocessor based on the SPARC v8 architecture," Proceedings International Conference on Dependable Systems and Networks, Washington, DC, USA, 2002, pp. 409-415, doi: 10.1109/DSN.2002.1028926.
- 4. Stamenkovic, Zoran & Wolf, C. & Schoof, G. & Gaisler, Jiri. (2006). An Implementation Study on Fault Tolerant LEON-3 Processor System.
- 5. P. J. Meaney, S. B. Swaney, P. N. Sanda and L. Spainhower, "IBM z990 soft error detection and recovery," in IEEE Transactions on Device and Materials Reliability, vol. 5, no. 3, pp. 419-427, Sept. 2005, doi: 10.1109/TDMR.2005.859577.
- 6. Lala, P.K. & Thenappan, P. & Anwar, M.T.. (2005). Single error correcting and double error detecting coding scheme. Electronics Letters. 41. 758 - 760. 10.1049/el:20050614.
- -, "Detecting address faults in an ECC-protected memory," U.S. Patent 6 457 154, Sep. 24, 2002.
- 8. The RISC-V Instruction Set Manual, Volume I: Base User-Level ISA
- 9. S. Gupta, N. Gala, G. S. Madhusudan and V. Kamakoti, "SHAKTI-F: A Fault Tolerant Microprocessor Architecture," 2015 IEEE 24th Asian Test Symposium (ATS), Mumbai, 2015, pp. 163-168, doi: 10.1109/ATS.2015.35.

