Triple-Modular Redundancy Deployment Optimization in the Sensor Readout System of the CBM Micro Vertex Detector

3 Sept 2019, 17:20
20m
Poster Radiation Tolerant Components and Systems Posters

Speaker

Yue ZHAO (IPHC)

Description

This paper describes the deployment and optimization process of triple-module redundancy (TMR) under high design constraints against single-event upset (SEU) and single-event transient (SET). It includes modeling of single-event effects (SEE) pulses with TCAD mesh model, TMR deployment strategies, and verification methods. The simulation result shows that the prototype with optimized TMR deployment has high reliability with respect to design requirements. The system can run for more than 5 years without crucial errors. And the equivalent error rate in the working environment is lower than $10^{-9}$.

Summary

MIMOSIS-1 is a CMOS pixel sensor now being designed for the Micro Vertex Detector (MVD) of the Compressed Baryonic Matter (CBM) experiment. CBM will record data from gold-gold and proton-gold collision system. Highly ionization particles generated by collisions, such as gold, carbon and proton, may induce single-event effects (SEE), which are temporary or permanent circuit functional errors such as single-event upset, single-event transient, single-event latch-up, etc.
In logic circuits, triple-modular redundancy (TMR) allows achieving high reliability against SEE. However, circuits hardened by TMR feature at least three times the power and area of original circuits. Therefore, balancing design parameters is crucial.
To match the large hit rate, the MIMOSIS-1 readout architecture implements a 3-layer-buffer structure. Two types of digital circuits appear in this architecture: control logic and data buffer path. With calculation, the control logic of readout system is vulnerable to SEE pulse, while the data path is not.
A control logic is described as a finite state machine (FSM) where the internal status counter is driven an assignment loop. If SEE occurs in the assignment loop, the state will not be automatically restored until reinitialized and may induce a crucial error in the system. If SEE pulse occurs in the sequential processing, the status will be flushed during processing.
The time to recover from a SEE pulse provides guidance for deploying TMR or not. An FSM controlled by a pair of Enable and Disable signals can recover periodically back to idle state. If protected by TMR, the recovery time is only one clock period, much before the Disable signal. This shorter recovery time helps to decrease error rate.
The reliability of the TMR deployment design is evaluated by digital post-simulation. SEEs are modeled in the TCAD tool as transient pulses. The pulse amplitude is randomly generated according to the LET distribution of the incident particles. The pulse width depends on the driver load ratio of the impacted node. Design and verification are done iteratively on a module-by-module basis. At the end of the design, the system reliability verification is performed on the netlist with parasitic parameters that are output after layout and routing. Both fanout and clock trees are considered.
We found that without TMR, the system is susceptible to SEE at an unacceptable level. After our TMR optimization, the system can run for 5 years without crucial errors induced by SEEs. The equivalent error rate in the operational environment is lower than 10-9. Meanwhile the area and power cost are only 20% and 50% higher compared to the original design.

Primary authors

Presentation materials