#### Soft Error Rate Estimations of the Kintex-7 FPGA within the ATLAS Liquid Argon (LAr) Calorimeter

TWEPP 2013, 23-27 Sept, 2012, Perugia, Italy

<u>Helio Takai</u> Brookhaven National Laboratory Upton, New York, USA Michael Wirthlin and Alex Harding Brigham Young University, CHREC Provo, Utah, USA







## Thanks to collaborators:

- Mauro Citterio, INFN Milan
- Austin Lesea, Xilinx
- Luis Hervas, CERN
- Ketil Røed, University of Oslo













#### **Question: Are FPGAs Suitable for LAr?**

- Liquid Argon Calorimeter High Luminosity Upgrade/Phase I
  - Currently using custom ASIC
    - Collect data from ADCs
    - Transfer data with optical links
  - Additional processing and flexibility needed
    - Higher data throughput (higher luminosity)
    - Simple, high throughput data processing
- FPGAs considered as replacement for some ASIC logic
  - Flexibility (through reconfiguration)
  - Perform low-level processing



lauid Magneti, Salenaid Magnet, SCT hacker, Pael Detector, MT hocker







## Xilinx Kintex7

- Commercially available FPGA
  - 28 nm, low power programmable logic
  - High-speed serial transceivers (MGT)
  - High density (logic and memory)
- Built-In Configuration Scrubbing
  - Support for Configuration Read back and Self-Repair
  - Auto detect and repair single-bit upsets within a frame
  - SEU Mitigation IP for correcting multiple-bit upsets
- Proven mitigation techniques
  - Single-Event Upset Mitigation (SEM) IP
  - Configuration scrubbing
  - Triple Modular Redundancy (TMR)
  - Fault tolerant Serial I/O State machines
  - BRAM ECC Protection
- Demonstrated success with previous FPGA generations in space
  - Virtex, Virtex-II, Virtex-IV, Virtex 5QV



<u>Kintex7 325T</u>

•407,600 User FFs •326,080 logic cells •840 DSP Slices •445 Block RAM Memory • 16.4 Mb •16 12.5 Gb/s Transceivers





# **Challenge: Radiation Background**

- Total Ionizing Dose (TID)
- Single Event Upsets (SEU)
  - Configuration memory: determines the logic/routing of the design
  - Block Memory: used by circuits for temporary storage and buffering
- Single Event Transient (SET)
  - Impact user FF state
- Single Event Latch-up (SEL)
  - High-current state caused by parasitic bipolar short from power to ground
- Single Event Functional Interrupt
  - Single event that causes functional interrupt of FPGA
    - Power-On Reset (POR) reconfigure device
    - SelectMAP, FAR Global Signal, Readback, and Scrub SEFIs
- Can FPGA operate with high availability in the presence of single event upsets (SEU)?







# **Kintex7 Testing Goals and Plans**

- Estimate FPGA upset rate within ATLAS LAr
  - Determine static cross section of configuration (neutrons and protons)
  - Understand LAr environment (energy spectrum)
  - Model upset rate (used to direct appropriate mitigation methods)
- Estimate device lifetime within LAr
  - Measure/Estimate Total Ionizing Dose of Kintex7
  - Look for unknown/unexpected failure mechanisms
- Single Event Functional Interrupts (SEFI)
  - Observe and measure Kintex7 SEFI modes
  - Verify SEFI detection and response methods
- Validate Mitigation Methods
  - Self Scrubbing and neighborhood watchdog
  - TMR and other detection methods
  - I/O: High Speed MGTs and Conventional I/O





# **Kintex7 Radiation Testing**

- LANSCE, Los Alamos, NM, USA
  - October 9-16, 2012 (ICE house)
    - White spectrum neutrons
    - 12 hours of testing (5.7E10 neutrons)
  - Estimate neutron BRAM/CRAM cross section
    - 10446 Configuration upsets (6.89E-15)
    - 2252 BRAM upsets (6.15E-15)

#### H4IRRAD, Geneva, Switzerland

- November 15-19, 2012 (West Area beam stop)
  - 50 hours of testing (1.8E9 hadrons)
- Estimate "environment" cross section
  - 1857 Configuration upsets (1.5E-14)
  - 432 BRAM upsets (1.4E-14)
- TSL, Uppsala, Sweden
  - May 15-18, 2013
    - High Energy Protons (180 MeV), White Spectrum Neutrons
  - Estimate proton cross section
    - Correlate cross section estimates with LANSCE
  - Validate scrubber and TMR











## **Kintex7 Radiation Testing**

- Texas A&M, College Station, TX, USA
  - September 6, 2013
    - Heavy Ion Testing (Nitrogen, Xenon, Argon)
    - 16 hours of testing
  - Single Event Latchup (SEL) Testing
  - Wide range LET testing
    - Space Rate Upset estimation
  - Results being evaluated

#### LANSCE, Los Alamos, NM, USA

- September 17-24, 2013 (ICE House)
  - Over 30 hours of neutron testing
- Mitigation Validation
  - Enhanced scrubber testing
  - Multi-Gigabit Transceiver Testing
  - TMR validation
- Results being evaluated









## **Static Cross Section Results**

- Large amount of data on CRAM/BRAM static cross section
  - Neutrons: LANSCE and TSL (>5.7E10 neutrons)
  - Protons: TSL (1.3E13 protons)
  - Mixed field: H4IRRAD (1.8E9 hadrons)

| Facility                   | CRAM     | BRAM     |  |  |  |
|----------------------------|----------|----------|--|--|--|
| LANSCE (WS Neutron)        | 6.89E-15 | 6.15E-15 |  |  |  |
| CERN H4 (HE Hadron)        | 1.50E-14 | 1.40E-14 |  |  |  |
| TSL (180 MeV Proton)       | 8.29E-15 | 8.19E-15 |  |  |  |
| TSL (WS Neutron)           | 6.55E-15 | N/A      |  |  |  |
| Cross Section Measurements |          |          |  |  |  |

**Cross Section Measurements** 





### LAr Static Cross Section Estimation

Experimentally determined cross-section are defined as:

 $\sigma_{SEU} = \sigma_0 \frac{\int_{E_0}^{\infty} w(E) \frac{dn}{dE} dE}{\int_{E_{thresh}}^{\infty} \frac{dn}{dE} dE} \quad \text{where} \quad w(E) = 1 - exp((\frac{E - E_0}{W})^{\alpha})$ 

σ<sub>0</sub> can be determined from experimental measurement if w(E) is known. We use three different parameterizations for w(E) and calculate the range of results.

| $E_0$           | W                | α                   | $\sigma_0$ (BRAM)      | σ <sub>0</sub> (CRAM)  |
|-----------------|------------------|---------------------|------------------------|------------------------|
| $(0.5 \pm 0.4)$ | $(63.6 \pm 4.6)$ | $(0.986 \pm 0.038)$ | $7.98 \times 10^{-15}$ | $7.13 \times 10^{-15}$ |
| 4               | 80               | 0.586               | $8.74 \times 10^{-15}$ | $7.80 \times 10^{-15}$ |
| 1               | 20               | 1.546               | $6.24 \times 10^{-15}$ | $5.57 \times 10^{-15}$ |

B. Bergmann et al, 2013 "Time of flight measurement of fast neutron interactions in silicon by means of timepix detectors", in IWorid2013, France





### LAr Static Cross Section Estimation

|      | Timepix                               | V-4VQ(1)              | V-4VQ(2)             | Simple                |
|------|---------------------------------------|-----------------------|----------------------|-----------------------|
|      | (bit <sup>-1</sup> fb <sup>-1</sup> ) |                       |                      |                       |
| CRAM | $1.87 \times 10^{-6}$                 | $2.04 	imes 10^{-6}$  | $1.82\times10^{-6}$  | $1.96 \times 10^{-6}$ |
| BRAM | $1.67 	imes 10^{-6}$                  | $1.82 \times 10^{-6}$ | $1.63 	imes 10^{-6}$ | $1.75\times10^{-6}$   |

<sup>1</sup>obtained by multiplying the measure cross section by the fluence of particles above 20 MeV ( $2.84 \times 10^8$  cm<sup>-2</sup>fb<sup>-1</sup>)

- Phase 2 will integrate 2 fb in 10 h (5.56E-5 fb/s)
  - CRAM: 1.01E-10 upsets/bit/s
  - BRAM: 9.06E-11 BRAM upsets/bit/s
- Estimate accuracy: ± 50%





## **Kintex7 Device Upset Estimates**

|                    | 7K70     | 7K160    | 7K325    | 7K355    | 7K410    | 7K420    | 7K480    |
|--------------------|----------|----------|----------|----------|----------|----------|----------|
| CRAM               | 1.7E+07  | 3.7E+07  | 6.8E+07  | 7.7E+07  | 8.7E+07  | 1.0E+08  | 1.0E+08  |
| BRAM               | 5.0E+06  | 1.2E+07  | 1.6E+07  | 2.6E+07  | 2.9E+07  | 3.1E+07  | 3.5E+07  |
| CRAM Upset<br>Rate | 1.72E-03 | 3.75E-03 | 6.85E-03 | 7.77E-03 | 8.83E-03 | 1.04E-02 | 1.04E-02 |
| CRAM MTTU          | 5.8E+02  | 2.7E+02  | 1.5E+02  | 1.3E+02  | 1.1E+02  | 9.7E+01  | 9.7E+01  |
| BRAM Upset<br>Rate | 4.51E-04 | 1.08E-03 | 1.49E-03 | 2.39E-03 | 2.65E-03 | 2.79E-03 | 3.19E-03 |
| BRAM MTTU          | 2.2E+03  | 9.2E+02  | 6.7E+02  | 4.2E+02  | 3.8E+02  | 3.6E+02  | 3.1E+02  |

**Upset Rate**: Device upsets / s

**MTTU**: Mean Time to Upset (seconds) = 1/Upset Rate

7K325 (Device Under Test):

•6.84E-3 CRAM upsets per second (every 150 s)

•1.49E-3 BRAM upsets per second (every 670 s)





# **Implications of Upset Estimations**

- Configuration RAM (CRAM) : 1 upset/150 s
  - Continuous configuration scrubbing is required
    - Prevent build-up of configuration errors
    - Scrub rate > 10x upset rate ( > 1/15 s)
  - Active hardware redundancy required
    - Mitigate effects of single configuration upset
    - Example: Triple-Modular Redundancy (TMR)
- BRAM : 1 upset/670 s
  - Exploit BRAM ECC (SEC/DED)
  - Employ BRAM scrubbing
    - Prevent build-up of errors to "break" SEC/DED code







# Multi-Bit Upsets

- Single event may upset more than one cell
  - Charge sharing by adjacent circuit nodes
  - More common with smaller process technology (28 nm)
- MBUs may "break" mitigation methods
  - Error Correction Codes (FrameECC, SEC/DED)
  - Triple Modular Redundancy (assumes single fault)
- Multi-bit upset analysis
  - Multi-bit CRAM and BRAM events were observed
  - Mitigation methods must anticipate some MBU events





# **Configuration Frame Interleaving**

CRAM bits interleaved to avoid intra-frame MBUs







# **Configuration Frame Interleaving**

CRAM bits interleaved to avoid intra-frame MBUs







# **Configuration Frame Interleaving**

CRAM bits interleaved to avoid intra-frame MBUs



<sup>-</sup>rame #0

<sup>-</sup>rame #1



Physical

MBU is spilt between two frames: FrameECC still operates

Example MBU: -two "intra-frame" upsets -four "inter-frame" upsets





# **CRAM MBU Testing Results**

| Intra-Frame MBUs |           | Inter-Frame MBUs |           |  |
|------------------|-----------|------------------|-----------|--|
| Upsets/<br>event | Frequency | Upsets/<br>event | Frequency |  |
| 1                | 90.1%     | 1                | 65.0%     |  |
| 2                | 7.5%      | 2                | 26.8%     |  |
| 3                | 1.4%      | 3                | 2.9%      |  |
| 4                | .60%      | 4                | 3.5%      |  |
| 5                | .26%      | 5                | .61%      |  |
| 6+               | .16%      | 6+               | 1.3%      |  |

\*results based on 2012 LANSCE neutron test

- 9.9% of events cause multiple upsets within a frame
  - Estimated CRAM MBU rate: 1.02E-11
  - 7K325 rate : 1515 seconds (~25 min)





## **Configuration Scrubbing**

- Configuration Scrubbing Constraints
  - Must repair single and multiple-bit upsets quickly
    - Accumulation of upsets will break mitigation (such as TMR)
    - Accumulation of upsets will increase static power
  - Minimize external circuitry (avoid radiation hardened scrubbing HW)
- Kintex7 FPGA contains internal "Frame" Scrubber
  - Continuously monitors state of configuration memory (FrameECC)
  - Automatically repairs single-bit errors within a frame
  - Identifies multi-bit errors and configuration CRC failures
- Additional scrubber support needed to repair MBUs
  - JTAG connection to host controller (slow, limited hardware)
  - Configuration controller and on-board memory (fast, complex hardware)
- Several Configuration Scrubbing approaches currently being validated







# **TID and SEFI**

- Total Ionizing Dose (TID): two FPGAs tested
  - High energy protons (180 MeV)
  - Two FPGAs tested for TID
    - FPGA #1: 340 kRad (cause of FPGA failure not yet determined)
    - FPGA #2: 446 kRad (FPGA still operational at end of test)
  - Initial results are very positive
    - More TID testing needed for sufficient statistics (TID is expensive: parts, beam)
- Single-Event Functional Interrupt (SEFI)
  - No SEFIs observed during neutron testing
  - 3 likely SEFIs observed during proton testing
    - Loss of configuration (Power On Reset?)
    - Limited capability to identify and characterize SEFI (scrubber needed)
  - SEFI cross section appears very small
    - Additional SEFI cross section testing needed (better SEFI characterization)





# **Future Testing Plans**

#### SEFI testing

- Obtain more SEFI events to improve statistics
- Observe more SEFI event types
- Understand how to characterize/respond to SEFIs
- TID testing
  - Obtain more data on TID failure (more FPGAs, much more beam time)
- Mitigation
  - Test TMR for larger circuits and more data
    - High flux / scrub ratio needed to break TMR
  - Test BRAM ECC
- High-Speed Serial I/O (GTX)
  - Understand failure mechanisms of serializer
  - Test serial I/O mitigation methods (AuroraFT)
  - Estimate bit-error rate, availability, and overall throughput





# Summary

- Significant radiation testing completed on Kintex7
  - Static cross-section well understood (CRAM, BRAM)
  - Multi-bit upset behavior identified
- Upset estimates completed for ATLAS LAr environment
- CRAM Scrubber approaches developed and currently being tested
  - Inner Kintex7 self-scrubber (Single-bit upsets)
  - Outer low-resource JTAG scrubber (Multi-bit upsets)

All results suggest that, with proper SEU mitigation, the Kintex7 may be used within the ATLAS Liquid Argon Calorimeter





## "Earworm of the day"

- The best place to test an FPGA is actually inside the detector itself, near the place your electronics will be located. Therefore having a test board that is accessible remotely could be the best development platform for mitigation techniques.
- Collaborative work could speed up development of mitigation techniques. Could start with a workshop.





## **Contact Information**

- Mike Wirthlin <u>wirthlin@byu.edu</u>
- Helio Takai <u>takai@bnl.gov</u>







**CERN** TimePix detector