Csaba Soos PH-ESE-BE SEU effects in FPGA How to deal with them?

# Outline

- Introduction
  - Radiation environment (LHC), definitions
- SEE in FPGA devices
  - Impact on device resources
- SEU testing
- Mitigation techniques
  - SM encoding, memory protection, reconfiguration, TRM etc.
- Commercial FPGAs
  - SRAM-based FPGAs, flash-based FPGAs, antifuse FPGAs
- Applications

## **Radiation environment**



- Beam beam interactions (near IPs)
- Beam residual gas interactions
- Beam losses

## **Radiation environment**



Comparison between Space environment and the CMS at the LHC *Source: F. Guistino's PhD thesis* 

# Single Event Effects (SEE)

# Heavy ion striking a transistor and creating charge along its path



# Single Event Effects (SEE)

### Single Event Upset (SEU)

- State change, due to the charges collected by the circuit sensitive node, if higher than the critical charge (Qct)
- For each device there is a critical LET
- Single Event Functional Interrupt (SEFI)
  - Special SEU, which affects one specific part of the device and causes the malfunctioning of the whole device
- Single Event Latch-up (SEL)
  - Parasitic PNPN structure (thyristor) gets triggered, and creates short between power lines
- Single Event Gate Rupture (SEGR)
  - Destruction of the gate oxide in the presence of a high electric field during radiation (e.g. during EEPROM write)

# **Definitions and Units**

- Flux: rate at which particles impinge upon a unit surface area, given in particles/cm<sup>2</sup>/s
- Fluence: total number of particles that impinge upon a unit surface area for a given time interval, given in particles/cm<sup>2</sup>
- Total dose, or radiation absorbed dose (rad): amount of energy deposited in the material (1 Gy = 100 rad)

# **Definitions and Units**

- Linear Energy Transfer (LET): the mass stopping power of the particle, given in MeV/mg/cm<sup>2</sup>
- Cross-section (σ): the probability that the particle flips a single bit, given in cm<sup>2</sup>/bit, or cm<sup>2</sup>/device
- Failure in time rate (in 1 billion hours):
  FIT/Mbit = Cross-section\*Particle flux\*10<sup>6</sup>\*10<sup>9</sup>
- Mean Time Between Functional Failure: MTBFF = SEUPI\*[1/(Bits\*Cross-section\*Particle flux)]

# Failure rate calculation

### Example:

- FIT/Mb = 100
- Configuration size = 20 Mb
- FIT = FIT/Mb \* Size = 2000,

i.e. 2000 errors are expected in 1 billion hours (Note: fluence above is 14 n/hour) Expected fluence: 3 x 10<sup>10</sup> n/10 years

# of errors in 10 years =  $2000 \times (3 \times 10^{10}/ 14 \times 10^{9}) = 4286$ Taking into account the SEUPI factor:

# of errors in 10 years = 4286 / 10 = 428

## Failure rate calculation

### ALICE Detector Data Link:

- Fluence (10 years):  $F = 3.9 \times 10^{11} \text{ n/cm}^2$
- Cross-section:  $\sigma = 8.2 \times 10^{-13} \text{ cm}^2/\text{LC}$  (i.e. per logic cell)
- # of configuration errors per LC:  $F \times \sigma = 0.32$  error/LC
- # of LCs in the design : 2500
- # of configuration errors per device: 2500 x 0.32 = 800

In other words, ~1 error per hour in one of the 400 link cards

#### Introduction

- Radiation environment (LHC), definitions
- SEE in FPGA devices
  - Impact on device resources
- SEU Testing
- Mitigation techniques
  - SM encoding, memory protection, reconfiguration, TRM etc.
- Commercial FPGAs
  - SRAM-based FPGAs, flash-based FPGAs, antifuse FPGAs
- Applications

## Sample FPGA architecture



# **FPGA logic cell and routing**



## **Sensitive FPGA resources**

### Configuration memory

- It defines the logic functions (LUT) and the routing
- Large devices contain several megabits of configuration memory
- Large fraction of this memory is not used by a design (SEU Probability Impact, SEUPI)

### User logic

- User RAM, flip-flops
- Additional FPGA resources (JTAG, POR etc.)
  - Single-event Functional Interrupt (SEFI)

# **Configuration memory vs. SRAM**

- Configuration memory is more robust
  - Size constraints are not the same; SRAM cells must be smaller, hence more sensitive
  - Configuration memory is based on a static latch
- Configuration memory has higher critical charge
  - Configuration memory does not have to be fast
  - Manufactures can improve the design (e.g. by maximizing the capacitive load)
- However, there are much more configuration memory cells in the device; the chance of an upset is higher
- Embedded RAMs follow the standard manufacturing trends, but they can be protected by ECC (or other techniques)

# SEU in configuration memory

- May change the programmed combinatorial logic by rewriting the LUT
  - e.g. A & B => A & !B
- May create internal open, or short circuit (will not damage the device)
  - e.g. Q = GND or 'floating'
- May have no impact on the device operation (don't care configuration cell)
  - 10 is a good (pessimistic) derating factor (can be 100 !)

# SEU in user logic



User RAM (static)



| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
|   |   |   |   |   |   |   |   |

### Introduction

- Radiation environment (LHC), definitions
- SEE in FPGA devices
  - Impact on device resources

### SEU Testing

- Mitigation techniques
  - SM encoding, memory protection, reconfiguration, TRM etc.
- Commercial FPGAs
  - SRAM-based FPGAs, flash-based FPGAs, antifuse FPGAs
- Applications

### Rosetta experiment

- Real-time experiment with atmospheric neutrons
  - Link between accelerated testing (proton or neutron) and the real effects of atmospheric neutrons
- Experimental sites at different locations and at different altitudes
  - Sets of 100 devices are monitored constantly
  - Altitudes from -488 m to 4023m
- Verification carried out using simulation and by tests done at the Los Alamos Neutron Science Center

# Rosetta experiment

| Family,<br>process  | Neutron @ 10 MeV        |                         | Rosetta (atmospheric) |               |  |
|---------------------|-------------------------|-------------------------|-----------------------|---------------|--|
|                     | CRAM (cm <sup>2</sup> ) | BRAM (cm <sup>2</sup> ) | CRAM (FIT/Mb)         | BRAM (FIT/Mb) |  |
| V2, 150 nm          | 2.50E-14                | 2.64E-14                | 401                   | 397           |  |
| V2 <b>P,</b> 130 nm | 2.74E-14                | 3.91E-14                | 384                   | 614           |  |
| S3, 90 nm           | 2.40E-14                | 3.48E-14                | 199                   | 390           |  |
| V4, 90 nm           | 1.55E-14                | 2.74E-14                | 246                   | 352           |  |
| S3E/A. 90 nm        | 1.31E-14                | 2.63E-14                | 108                   | 306           |  |
| V5, 65 nm           | 6.67E-15                | 3.96E-14                | 151                   | 635           |  |

Note: configuration FIT/Mb does not include SEUPI=10 derating factor. Reference flux at NYC =14 n/hour. Reminder: FIT = number of errors in 1 billion hours. Source: Xilinx

# **Accelerated testing**

- High-energy proton or neutron beam
  - proton: package shadowing and TID dependence
- Heavy-ion irradiation
- Static or dynamic testing
  - Configuration or application memory read back
  - Large shift-registers
- See for example: <u>ATLAS policy</u>
- Or consult the JEDEC JESD89 standards
  - JESD89A, JESD89-1A, JESD89-3A



### Introduction

- Radiation environment (LHC), definitions
- SEE in FPGA devices
  - Impact on device resources
- SEU Testing
- Mitigation techniques
  - SM encoding, memory protection, reconfiguration, TRM etc.
- Commercial FPGAs
  - SRAM-based FPGAs, flash-based FPGAs, antifuse FPGAs
- Applications

# **Configuration management**



# **Reconfiguration: Altera**

- Built-in CRC detection reports about flips in the configuration memory
- Location information can help to filter out the 'don't care' changes and to act upon critical errors only



# **Reconfiguration:** Xilinx

- Partial reconfiguration (scrubbing)
- The system remains fully operational
- Some parts of the device cannot be refreshed
  - Half-latch
  - Full configuration can refresh everything
- Combine with TMR to reduce the error rate



# **Triple-module redundancy**

- It works, if the SEU stays in one of the triplicated modules, or on the data path
- It fails, if the errors accumulate, and two out of the three modules fail, or the SEU is in the voter



# Functional TMR (FTMR)

- VHDL approach for automatic TMR insertion
- Configurable redundancy in combinatorial and sequential logic
- Resource increase factor: 4.5 7.5
- Performance decrease

Ref.: Sandi Habinc

http://microelectronics.esa.int/techno/fpga\_oo3\_o1-o-2.pdf

# Improved TMR by Xilinx



Supported by the XTMR Tool from Xilinx

# **Multiple-Bit Upsets**



### Ref.: H. Quinn et al, "Domain Crossing Errors: Limitations on Single Device Triple-Modular Redundancy Circuits in Xilinx FPGAs"

### **State-machines**

- Used to control sequential logic
- SEU may alter/halt the execution
- Encoding can be changed to improve SEU immunity (be careful with optimization)

| SM type   | Speed   | Resources | Protection |
|-----------|---------|-----------|------------|
| Binary    | Fast    | Smallest  | None       |
| One-hot   | Slow    | Large     | Poor       |
| Hamming 2 | Good    | Moderate  | Fair       |
| Hamming 3 | Slowest | Largest   | Good       |

Ref.: G. Burke and S. Taft, "Fault Tolerant State Machines", JPL

### User memory

- Very sensitive resource
  - Optimized for speed/area -> Low Q<sub>ct</sub>
- Errors can easily accumulate
- Mitigation
  - Parity, ECC, EDAC, TRM, scrubbing





### Introduction

- Radiation environment (LHC), definitions
- SEE in FPGA devices
  - Impact on device resources
- SEU Testing
- Mitigation techniques
  - SM encoding, memory protection, reconfiguration, TRM etc.
- Commercial FPGAs
  - SRAM-based FPGAs, flash-based FPGAs, antifuse FPGAs
- Applications

# Altera HardCopy devices

- SRAM-based FPGA is used as prototype
  - Using a HardCopy-compatible FPGA ensures that the ASIC always works
- Design is seamlessly converted to ASIC
  - No extra tool/effort/time needed
- Increased SEU immunity and lower power 3
- Expensive 8 and not reprogrammable 8
  - We loose the biggest advantage of the FPGA

## **Xilinx Aerospace Products**



- Virtex-4 QPro V-grade
  - Total-dose tolerance at least 250 krad
  - SEL Immunity up to LET > 100 MeV/mg-cm<sup>2</sup>
  - Characterization report (SEU, SEL, SEFI):

http://parts.jpl.nasa.gov/docs/NEPPo7/NEPPo7FPGAv4Static.pdf

Expensive 8, but reprogrammable 3

# Xilinx's SIRF products

- SIRF Single-Event Immune Reconfigurable FPGA
- Radiation hardened by design (RHBD)
- Design goals:
  - Total-dose > 300 krad
  - SEL immune > 100 MeV/mg-cm<sup>2</sup>
  - SEU rate < 1E-10 errors/bit-day</p>
  - SEFI rate < 1E-10 errors/bit-day</p>
- It will be certainly expensive 8

# Actel ProASIC<sub>3</sub> FPGA

- Flash-memory based configuration
- 0.13 micron process
- SEL free<sup>1</sup>
- SEU immune configuration<sup>1</sup>
- Heavy Ion cross-sections (saturation)
  - 2E-7 cm<sup>2</sup>/flip-flop
  - 4E-8 cm<sup>2</sup>/SRAM bit
- Total-dose
  - Up 15 krad (some issues above)
- Not expensive ③ and reprogrammable ③

Note 1: Tested at LET = 96 MeV/mg-cm<sup>2</sup>





# **Actel Antifuse FPGA**

- Non-volatile antifuse technology (OTP)
- 0.15 micron process
- SEU immune configuration
- SEU hardened (TMR) flip-flop
- Heavy lon cross-section (saturation)
  - 9E-10 cm²/flip-flop
  - 3.5E-8 cm<sup>2</sup>/SRAM bit (w/o EDAC)
- Total-dose
  - Up to 300 krad
- Expensive 😕 and not reprogrammable





### Introduction

- Radiation environment (LHC), definitions
- SEE in FPGA devices
  - Impact on device resources
- SEU Testing
- Mitigation techniques
  - SM encoding, memory protection, reconfiguration, TRM etc.
- Commercial FPGAs
  - SRAM-based FPGAs, flash-based FPGAs, antifuse FPGAs

### Applications

## **ALICE TPC Readout Control Unit**

- Measured cross-section (Xilinx FPGA): 2.8E-9 cm<sup>2</sup>/device
- Expected flux: 100 400 p/cm2-s
- Number of boards (i.e. FPGA devices): 216
- Expected SEFI in 4 hours: 3.5 failures
- It is at the limit of what can be tolerated
- Active Partial Reconfiguration has been implemented

Ref.: K. Røed et all, "Irradiation tests of the complete ALICE TPC Front-End Electronics chain"

### ALICE TPC RCU Active reconfiguration

- Functionality of both DCS and RCU board can experience errors due to radiation effects in the FPGAs
- Simple reloading of configuration data causes downtime and is thus not applicable to RCU board (interruption of data-flow)
  - Active error detection and reconfiguration scheme using an FPGA capable of refreshing firmware w/o interrupting operation

#### Active Partial Reconfiguration "scrubbing"



Altera FPGA

w/ ARM cpu

EXILINX\*

XC2VP4<sup>Tu</sup> FF672CGB0429 D1314938A

215,145,0071-0

Bank 0

Bank 1

Bank 2

Bank 3

FLASH mem

w/ Linux

### ALICE TPC RCU Test results



Test carried out by G. Tröger, KIP

# **ALICE DDL Source Interface Unit**

- Prototype design (Altera FPGA)
  - Expected failure rate: ~ 1 failure /1 hour / 400 SIU cards
- This was not accepted
  - Every time there is a failure, the run needs to be restarted
- Several mitigation techniques were discussed
  - Reconfiguration => complex board design, size constraints
- Design has been migrated to flash-based FPGA
  - No configuration loss
  - TID tolerance meets the requirements

#### *Read more at: <u>http://cern.ch/ddl/radtol</u>*

## Summary

- Make sure you understand the requirements
  - Simulation of the environment is essential
- Try to select the components/technologies
  - Pay attention to the requirements
- Test your components
  - Look around, you may find some information about the selected components
- Try to assess the risk
  - SEU may not be critical, or it can be catastrophic
- Mitigate
- Verify

## **Additional documentation**

### Radiation hardness assurance

Link: http://lhcb-elec.web.cern.ch/lhcb-elec/html/radiation\_hardness.htm

### Report on "Suitability of reprogrammable FPGAs in space applications" by Sandi Habinc, Gaisler Research

Link: http://microelectronics.esa.int/techno/fpga\_002\_01-0-4.pdf

# Thank you!

2/6/2009

R2E Radiation School: SEU effects in FPGA

## Spare slides

2/6/2009

R2E Radiation School: SEU effects in FPGA

## **TID trends**



\*See "CMOS SCALING, DESIGN PRINCIPLES and HARDENING-BY-DESIGN METHODOLOGIES" by Ron Lacoe, Aerospace Corp 2003 IEEE NSREC Short Course 2003

## **Typical cross-section curve**



R2E Radiation School: SEU effects in FPGA

# Half-latches (Xilinx)



- Half-latches are used across the device to drive constants
- Upset in the pull-up can change the state of the inverter
- Partial configuration cannot restore the original state
  - Latch can recover, after several seconds, due to the leakage of the pull-up transistor
- Mitigation requires the removal of the half-latches

# **Typical workflow**



# **CMS** mitigation example



#### Radiation Test Results (63.3 MeV Protons)



CSC Trigger Motherboard (TMB)

CMS CSC ESR at CERN November 6, 2003 by J. Hauser

25