## Single Event Effect Mitigation Techniques, Testing and Verification of the RD53 chips

**EP-ESE Electronics Seminar** 

Jelena Lalic On behalf of the RD53 Collaboration

May 2, 2023

## **RD53 Collaboration and the Final Chips**



### RD53 collaboration

#### ATLAS and CMS

- 24 institutes
- next-generation pixel chips for phase 2 LHC upgrade
- common design framework for ATLAS and CMS
- 65 nm CMOS technology

Design requirements:

- High hit rate 3 GHz/cm<sup>2</sup>
- High trigger rate: 1 MHz (ATLAS), 750 kHz (CMS)
- Trigger latency 12.5 us
- Hostile radiation environment: 1 Grad over 10 years, 10<sup>16</sup> hadrons/cm<sup>2</sup>
- High SEE tolerance (200 Hz/chip SEU rate in the inner layer)



Figure: Different analog front-end and the pixel array size, but 99% functionalities are the same.

## **RD53 Timeline**



What are we covering today?

- RD53 SEE-related design challenges and SEE mitigation approach
- Problems during beam testing Preproduction chips
- Identifying design issues
   Preproduction chips
  - Two-Photon-Absorption (TPA) testing of critical analog IP blocks
  - Single Event Upset (SEU) Verification
- Final ATLAS chip Estimates based on the SEE verification

## Part I

## SEE Mitigation Approach in RD53 and Design Challenges

## **Complex chip architecture**





- ${\sim}150$ k pixels
- multi-level processing, buffering, and event building
- time-tag-based latency buffering
- high-density logic and data buffers (500 million transistors!)

- 12 million FFs and latches
- 55 mil standard cells
- PLL and other critical IPs optimized against SETs
- handshaking between control and data path



## **TMR protection**





## TMR 1: Only partial pixel conf. bits



#### TMR 2: In the digital bottom

#### TMR 1

- critical pixel conf. bits
- implemented @ RTL
- SEU has a limited effect
- 100 times more SEE tolerant than a simple latch (based on the proton beam tests)
- continuous reconfiguration (can be as high as  ${\sim}10$  Hz, 0.1 Hz seems sufficient)

- TMR 2
  - global conf. bits and critical data (state machines, look-up table, handshaking signals, etc.)
  - SET protection (triplicated clock and time skew)
  - logic and voters are not triplicated
  - 400 times more SEE tolerant than a simple latch (based on the proton beam tests)

\*\*measured on the preproduction chips, and time skew has improved since (from avg. 250 ps to avg. 350 ps for the final ATLAS chip)\*\*

## How this compares to LpGBT

### Why LpGBT?

- very strict SEE tolerance requirements (used by man systems)
- $\sim$  90% design fully redundant (rest protected with FEC, temporal redundancy, or not protected test features)
- Extensively verified and tested against SEEs

LpGBT

Standard cells

Sequential cells

|           |   |      |       |       | -  |     |      |
|-----------|---|------|-------|-------|----|-----|------|
| 919       | % | sea  | cells | used  | in | the | full |
| <b>JT</b> | 0 | JCq. | cens  | uscu  |    | the | run  |
|           |   | т    | MR s  | schem | ıe |     |      |

Many thanks to Szymon K. for providing LpGBT data.

Count

455k

34k



majority

majority

comb majority D Q logic voter



RD53: TMR with time skew

| RD53             | Count  |
|------------------|--------|
| Standard cells   | 56 mil |
| Sequential cells | 15 mil |
|                  |        |

15% seq. cells used in TMR schemes (TMR1 and TMR2)





## Why Not Full TMR?



A bit of history:

- TMRG tool was released when RD53 design was already well underway
- Usage of SystemVerilog interfaces was not allowed in the TMRG tool but they had been heavily used in RD53

Technical reasons:

- RD53 control path relies on a data path feedback (many handshaking signals)
- TMR with skew provides SET filtering for non-triplicated data path signals going to a control path

## Part II

## SEE Verification of Digital Design

## **SEE** verification



#### Requirements



## **RD53 SEE Testing and Verification**



- SEE testing of the preproduction chips carried out with limited SEE verification results (due to lack/loss of people during the project)
- SEE failure conditions debugged during beam tests (not time-effective, black box..)
- New SEE UVC integrated into the RD53 verification framework



It is very hard to do SEE tests under realistic hit/trigger conditions.

## Stuck Hit Readout in the Beam Tests Preproduction chips





2 independent output channels. Hit and service data are time-multiplexed on the output serial link.



Different pixel array configurations during beam tests.

#### Hit readout issue:

- Hit data readout channel can get stuck Chip does not respond to sent triggers
- CLEAR cmd always recovers hit readout link
- Service data readout link and input CMD link always work reliably

Irradiation tests @PS-CERN (24 GeV protons).

#### From requirements perspective:

- Stuck hit readout is not a surprise (time-tag-based latency buffering and trigger table, TMR not 0 cross-section)
- Failure rate should be as low as possible
- Extensive SEE verification is needed to assure the above is correct

## **RD53 SEE Tolerance Requirements**



| SEE Failures                            | Status       | Comment                                                                                                |
|-----------------------------------------|--------------|--------------------------------------------------------------------------------------------------------|
| Lost or ghost hit                       | Accepted     | up to $\sim 0.1\%$ .                                                                                   |
| Missing/corrupted event                 | Accepted     | up to $\sim 0.01\%$                                                                                    |
| Stuck hit readout                       | Tolerated    | must recover by CLEAR<br>(Tolerable to send<br>global CLEAR up to 10-100 Hz,<br>but should be avoided) |
| Anything that requires<br>power cycling | Not accepted | serial powering                                                                                        |

## **RTL** versus gate level simulations



| RD53 SEE Verification Tasks                                                  | can be done<br>@RTL | can be done<br>@GL | Comment                                                    |
|------------------------------------------------------------------------------|---------------------|--------------------|------------------------------------------------------------|
| SEU fault injections on the<br>outputs of unprotected<br>sequential elements | yes                 | yes                | naming convention to<br>determine if<br>triplicated or not |
| SET fault injection<br>simulations                                           | no                  | yes                | supported only at GL                                       |
| Implementation of<br>pixel config. bits TMR is OK                            | yes                 | yes                | triplication at RTL                                        |
| Implementation of<br>TMR with skew is OK                                     | no                  | yes                | triplication at GL                                         |
| Disabling one of<br>the triplicated clocks is OK                             | no                  | yes                | triplication at GL                                         |

Simulation level used for a specific verification task

Strict SEE checkers

Relaxed SEE checkers

Tasks with relaxed SEE checkers are most difficult to debug.

Everything feasible to do at RTL should be done at RTL.

Gate-level simulations require 10-100 times more time and resources.

Jelena Lalic

## Nodes for Fault Injections And Simulated Chip Size





- SEE simulations are done separately in pixel array and digital chip bottom to facilitate and speed up debugging
- Majority of the simulations done on *BABY* chip 25 more time/resources needed for *Full* chip (efficient management of simulation resources is important)

## **RD53 Fault Injection Simulations**







Double time/resources for SEE simulations. Fault injection and a reference simulation always run together.

Reference simulation: Fault injection-free simulation

Events comparison between fault simulation and a reference simulation for the same SEED.

## **SEU Fault Injection Simulations**





# High hit/trigger rate (conditions in the inner layer):

- 3.5 GHz/cm<sup>2</sup> hit rate
- 1 MHz trigger rate

### • 2 Main Tests:

- Random Test: Randomization across all chip conf.
- Standard Test: Default chip configuration in the inner layer

SEU acceleration factor per FF: Pixel Array: 70 million DCB: 100 million

## Hit Readout Stuck in Simulations



#### SEU fault injections in non-TMR nodes in pixel array at RTL



If Token gets asserted during 9 clock cycles it will cause a chip to get stuck.

| Trig state | Start state | Latency mem. |
|------------|-------------|--------------|
| 0          | 0           | idle         |
| 0          | 1           | counting     |
| 1          | 1           | triggered    |
| 1          | 0           | toRead       |

2-bit state register for each latency memory (8 latency memories per pixel region)

Cross-section of this failiure 1s/chip. (Check backup for scaling details.)

Can not be triplicated.

## Hit readout stuck



#### **PROBLEM:**

```
if (DataLast & ~Token)
TriggerProcessing <= '0';</pre>
```

Rate of the hit readout stuck state 1s/chip.

#### FIX:

```
if (DataLast )
TriggerProcessing <= '0';
```

Rate of the hit readout stuck state  ${\sim}5\text{--}10~\mathrm{days/chip.}$ 

There is no *small* RTL change.

One RTL line can cause severe SEE failures.

RTL needs to be verified against SEEs after every change or new feature.

No way to get this right while writing RTL.

## SEU injections in non-triplicated nodes





Random Test SEU regression



© 100 mil FF acceleration factor

## Digital Chip Bottom - Final Chip Estimates Scaling for the inner layer:

Chip gets stuck every ~55k events (at 1MHz trigger rate) SEU injection rate per FF: 500MHz/300k=1.5E3 Acceleration factor per FF: 1.5E3/1.5E-5=1E8 (HEH rate inner layer: 1GHz/cm2 FF HEH cross section: 1.5E-14 cm2 FF SEU rate in inner layer: 1E9\*1.5E-14 = 1.5E-5 Hz)

Events per stuck in the inner layer:  $55k*1E8{=}55E11$  At 1MHz trigger rate:  $55E11/1E6{=}55E5$  seconds  ${\sim}50$  days

|                                  | Read Events<br>Faultsim | Read Events<br>Reference | Ghost Events | Lost Events |
|----------------------------------|-------------------------|--------------------------|--------------|-------------|
| Random conf.                     | 160k                    | 160.5k                   | 3%           | <0.5%       |
| Standard conf.                   | 210k                    | 250k                     | 21%          | <16%        |
| @ 100 mil FF acceleration factor |                         |                          |              |             |

|                | Avg. events between hit stuck |                                  |
|----------------|-------------------------------|----------------------------------|
| Random conf.   | 55k                           | @ 100 mil FF acceleration factor |
| Standard conf. | / (no stuck states)           |                                  |

Jelena Lalic

## SEU fault injections in non-triplicated nodes







Readout events [%] Standard Test SEU regression

### © 70 mil FF acceleration factor

#### **Pixel Array - Final Chip Estimates**

### Scaling for the inner layer:

Simulations: Chip gets stuck every  $\sim$ 9k events Inner layer: At 1MHz trigger rate:  $\sim$ 8 days

|                | Read Events<br>Faultsim | Read Events<br>Reference | Ghost Events | Lost Events |
|----------------|-------------------------|--------------------------|--------------|-------------|
| Random conf.   | 165k                    | 167k                     | 0            | 1%          |
| Standard conf. | 250k                    | 251k                     | 0            | 0.3%        |

@ 70 mil FF acceleration factor

|                                 | Avg. events between hit stuck |  |
|---------------------------------|-------------------------------|--|
| Random conf.                    | 9k                            |  |
| Standard conf.                  | 30k                           |  |
| @ 70 mil FF acceleration factor |                               |  |

## SEU fault injections in non-triplicated nodes



- Failure-rates uncertainty factor:
  - SEU faults not evenly injected across all FFs (and latches). Bias in a random generator
- Regressions with regular CLEAR cmd sending (confirms that chip always sends hit data after CLEAR)
- SEU fault coverage
  - Pixel array: >240 SEU/node (effectively 50 times higher, same node injected in all core columns)
  - Digital Bottom: >300 SEU/node
  - 2 weeks of running 30 parallel simulations are needed for the above coverage
- Resources management and simulation planning
  - several hours to simulate 1 hit readout stuck on BABY chip
  - for Full chip it would take several days to get 1 failure!
- Chip does not get stuck if hits are not injected in simulations (very important to have a high hit rate during beam tests)

## SET fault injections at gate level





RD53 TMR Time Skew

| SET width           | events to get stuck<br>voter outputs | events to get stuck<br>FFs input |  |
|---------------------|--------------------------------------|----------------------------------|--|
| 100 ps              | 0                                    | 200k                             |  |
| 250 ps              | 19 k                                 | 100k                             |  |
| 500 ps              | 4 k                                  | 14k                              |  |
| @5 mil acceleration |                                      |                                  |  |

- SET: ideally injecting in all nets/pins
- Not feasibly for RD53 (millions of gates)
- Fixed SET width (100 ps, 250 ps, 500 ps)
- Injections in voter outputs
- Injections in FFs input
- Injections in FFs output
- SET simulations are still running for final chip

## **SEE Verification Key Takeaways**



- SEE verification approach is defined by SEE tolerance requirements of a DUT Verification of a huge, complex design with tolerated and accepted SEE failures completely differs from verification of a relatively small 0 fault-tolerance design
- More SEE tolerance requirements are relaxed, debugging of SEE failures gets more complicated (several days of waveforms debugging to understand conditions of one hit readout stuck failure)
- Failure rate in a real system needs to be estimated based on verification results for tolerated failures
- Understanding conditions under SEE failures occur and communicating these conditions to the testing team is essential (hit rates, trigger rate, a specific feature enabled, ...)

## Part III

Analog Chip Bottom SET-induced link dropouts in preproduction chips and TPA laser testing

## Link dropouts in the test beams





- Monitoring 640 MHz clock (PLL output clock is divided by 2 and routed to the chip output)
- Readout link dropouts during ion and proton beam tests caused frequent DAQ-chip de-sync and event readout loss.
- Estimated time between link dropouts in the inner layer (based on the ion beam measurements): 0.2s

## **Two-Photon Absorption laser**

- A single node injection
- Near-infrared imaging
- Spatial and temporal resolution
- Beam focus through the substrate
- Charge collection only at the beam focus





- Pulse duration 430 fs
- Pulse energy up to 2.2 nJ
- 2 TPA systems were used for the RD53 testing.







RD53 test card preparation and chip bonding for the TPA tests.

## Link Dropouts: Root Cause Analysis

# CERN

## Shortly on analog chip bottom and powering:



The core bandgap generates the main reference current Iref. Iref is further used for generating analog and digital voltage references (VrefA and VrefD).



The Clock and Data Recovery (CDR) circuit is powered by analog voltage (2\*VrefA).



Core bandgap reference circuit with 3 marked transistors found SET sensitive (chip thickness 250um).



VrefA voltage drop induced by TPA laser beam shooting into any of the 3 sensitive transistors. voltage drop 200mV; transient 20us Much bigger effect than we had expected.

## **Analog SET Simulations**



#### SET analog simulations were done with a simulation tool from Sevilla.



- Voltage drop issues seen in testing were reproduced in simulations
- SET compensating transistors were added (VGATE1 to GDNA, VGATE2 to GNDA)
- SET simulations after design hardening confirms that SET sensitivity is mitigated
- RD53C is now expected to have much lower link-dropout cross-section

- tool used for characterization of analog IPs
- SET-sensitivity in the LVDS circuit was discovered and fixed
- This was later reproduced in TPA testing
- TPA testing confirmed that LVDS hardened design is SET robust



## SET-senstivity map of analog circuits



 Identifying an output signal for monitoring and a reference signal used for comparison



VCO example: SER\_CLK/2 and 640 MHz reference clock from an FPGA

- Set optimal values for pulse energy, repetition rate, and spatial resolution
   VCO example: (1.2 nJ, 5 Hz, 0.5 um)
- Record circuit response to each laser pulse VCO example: 2us of monitored clocks
- Recorded data analysis and SEE map VCO example: Find a maximal phase difference deviation between 2 monitored clocks for all saved laser injections and assign each value to a scanned circuit point







Example of the VCO output clock phase deviation



## **TPA** Testing of the VCO





A deviated phase difference is always corrected by the VCO circuit.

SET sensitivity of the VCO circuit can cause one/two-bit transmission errors. DAQ needs to be capable of correcting this.

## **TPA Study Key Takeaways**



- TPA laser testing should be used to determine the root cause of a problem that has already been identified during a test beam
- TPA was successfully used for SET hardening of the Core Bandgap circuit in the final RD53 design.
- TPA study of critical analog blocks to be used by DAQ developers to optimize a DAQ receiver for expected bit-transmission errors.
- TPA study has enabled hardening of RD53 final chip and confirmed a solid SET-robustness of analog chip bottom.

## Conclusions



- SEE design hardening, verification, and testing played a crucial role in shaping the development timeline of RD53 chips
- Making a robust design that allows protecting only its small part is a big challenge
- Understanding a design hardening and verification approach used for RD53 chips requires an understanding of the chip's complex architecture and system requirements
- Very reassuring SEE estimates for final RD53C chips
- Currently
  - Awaiting for RD53C-ATLAS production wafers
  - RD53C-CMS -in a final development stage

# THANK YOU

## More about RD53



- RD53B manual, CERN-RD53-PUB-19-002 (2019), http://cds.cern.ch/record/2665301.
- RD53B Users guide https://cds.cern.ch/record/2754251

## BACKUP

# TMR mitigation schemes and their effective gain



config. memory) are randomly distributed (no global signal effects)

#### Heavy ion testing:





CÉRN

## Voltage references and clock generation





- CDR circuit (PLL based) recovering the 160 MHz clock from the 160 Mbs CMD inputs and generating from it all clocks needed inside the chips
  - Voltage Controlled Oscillator (VCO ): essential part of the CDR: generates the output clock
- Other SET critical blocks: CMD receiver (LVDS), CML driver(s), serializer(s)

## **TPA Testing of the CDR-CML Bias DACs**





# Differential Receiver in final Chip SET robust





#### Test:

- 2 us of the receiver input (sent by DAQ) and 2 us of the serial CMD input straight from the differential receiver (routed to GP\_LVDS)
- · Input CMD stream: sync and trigger commands

*New receiver with 600 mV common mode – much more SEE robust!* Voltage drops or other effects of the laser pulse are NOT found in the signal waveforms

## TPA Testing of the RD53 prepegulator



The preregulator is divided into 2 sub-blocks and each is TPA-scanned separately. SET robust.

#### May 2, 2023



## **TPA** Testing of the Clock and Data Recovery (CDR) circuit



|   | Block       | Lower left corner               | Upper right corne | H .            |
|---|-------------|---------------------------------|-------------------|----------------|
|   | CDR CORE    | 10320, 440                      | 10520, 700        |                |
|   | PD          | 10330, 590                      | 10415, 625        | 1 um, 1.2 nJ   |
|   | CP          | 10415, 585                      | 10455, 620        | 1 um, 1.2 nJ   |
|   | CP FD       | 10415, 623                      | 10455, 660        | 1 um, 1.2 nJ   |
|   | LPF         | 10520, 445                      | 10550, 690        | 1 um, 1.2 nJ   |
| < | vco The     | 10415, 470<br>only sensitive bl | ock 10480, 580    | 0.9 um, 1.2 n. |
|   | VCO DIG BUF | 10415, 440                      | 10505, 470        | 0.5 um, 1.2 n. |
|   | DIV         | 10330, 430                      | 10415, 590        | 1 um, 1.2 nJ   |
|   | CNT         | 10330, 430                      | 10370, 525        | 1 um, 1.2 nJ   |
|   | ov          | 10490, 550                      | 10510, 580        | 0.5 um, 1.2 n. |
|   |             |                                 |                   |                |





#### Without a major circuit redesign, the sensitivity of the VCO can't be eliminated.

The phase error will always be corrected by the circuit, while its SEE sensitivity can cause only one-two bit transmission error/s (for 1.2 nJ laser pulse, which is already extreme).

Any bigger implication of these effects on the chip behavior can't be expected. DAQ needs to be capable of handling this bit-loss.

Other parts of the CDR core didn't show SEE sensitivity.

| Scaling of the H | High Rate | Hit Readout | Stuck |
|------------------|-----------|-------------|-------|
|------------------|-----------|-------------|-------|

| Trig_state                    | <b>300 K FFs</b><br>300k (FFs) * 1.5E-14 cm2 (HEH crosssection) * 1GHz/cm2 (HEH rate) = 4.5Hz                             |
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| (from idle to toRead )        | Sensitive time per event: 9*25ns = 225 ns<br>Fraction of time column readout active at 1MHz trigger: 225ns/1000ns = 0.225 |
|                               | 4.5 Hz * 0.225 * 0.998 (assuming 10 pixels per cc not in idle state ) $ 1HZ$                                              |
| Start_state                   | <b>300 K FFs</b><br>300k (FFs) * 1.5E-14 cm2 (HEH crosssection) * 1GHz/cm2 (HEH rate) = 4.5Hz                             |
| (from triggered to<br>toRead) | Sensitive time per event: 9*25ns = 225 ns<br>Fraction of time column readout active at 1MHz trigger: 225ns/1000ns = 0.225 |
|                               | 4.5 Hz * 0.225 * 0.002 (assuming 10 pixels per cc in the triggered state ) 0.002 HZ                                       |

May 2, 2023

## SEUs in the Digital Chip Bottom



One handshaking signal was left unprotected in the preproduction chips. This is fixed for the final chip. This example demonstrates the importance of good SEE coverage per node. This problem was found after having >40 SEU/node.



SEUs in the periphery: 1 unprotected ff (per core column) was identified in encoder the that can hit readout cause stuck state if flipped in a specific time. Low cross-section. Triplicated now.



## SET fault injections at gate level



#### SET injections in voter outputs:

