# On The Exploration of SRAM-based FPGAs

G. Tsiligiannis, <u>R. Ferraro</u>, S. Danzeca



#### Outline

- Electronics Development FPGA families
- Candidate for different CERN applications
- CRAM Problem Workaround
- Artix7 Arty board
- First setup
- First Results
- Second setup
- Second Results
- Next steps to consider



#### Electronics Development – FPGA families

- The digital part of the electronics operating at CERN is usually controlled by either microcontrollers, processors, FPGAs and PLCs.
- FPGAs are preferred as they allow=>
  - high speeds of operation
  - high capacity for logic designs
  - numerous I/Os compatible with different protocols
- Three families of FPGAs:
  - SRAM-based ± (TID, speed, size, cost)
  - FLASH-based ± (configuration memory, speed, security)
  - Anti-Fuse ± (configuration memory, TID, noreprogrammable)







#### Candidate for various CERN applications

- We need a component that:
  - Can withstand high TID levels (>1kGy)
  - Does not lose its configuration memory (one time programmable)
  - Is reprogrammable
  - Can operate at high speeds (>100MHz)
  - Has a high capacitance (LUTs, FFs, DSP etc)
  - Does not latch
  - Is cheap!
- In other words: the perfect FPGA!
- Mainly for low intensity radiation environments/non-critical
- FLASH-Based FPGAs have been used for a long time (and are still used)
  -> TID problems on several parts of the FPGA
- Xilinx FPGAs are sensitive mainly due to the SRAM-based configuration memory.
- What is the solution?



# Working around the CRAM problem

- The main drawback for SRAM-based FPGAs is the vulnerability of the configuration memory.
- Xilinx proposes a standalone solution to the problem: Soft Error Mitigation IP (SEM)
- SEM makes use of built-in primitives for the detection and correction of SEUs on the configuration memory
- It is a controller that brings together the ECC FRAME, the ICAP and provides an external interface.
  - ECC/CRC error detection error correction
  - 13-bit Hamming code
  - UART interface for status/error reporting
- SEM can correct two bit failures per frame from CRAM
  - It cannot correct BRAM and Flip Flops
- Three options for correction:
  - Single error correction ECC
  - Double error correction ECC/CRC
  - Correction by replace External Storage



# Arty board

- Artix7 7-Series of Xilinx xc7a35ticsg324
  - 33,280 logic cells in 5200 slices (each slice contains four 6-input LUTs and 8 flip-flops)
  - 1,800 Kbits of fast block RAM, 90 DSP slices, On-chip analog-to-digital converter (XADC).
  - 14,953,046 bits approximately of CRAM
- Arty Board:
  - Many peripherals interfaces:
    - 16MB Quad-SPI Flash
    - 10/100 Mbps Ethernet PHY
    - and many others
- Why?
  - Large enough to host Microblaze + custom design/IPs
  - High TID endurance: according to the study of BYU up to 400krad (4kGy)
  - Availability reduced development cost/time
  - · Low cost development board
  - Several peripherals to "play with" at the system level => very good vehicle to drive system level testing and qualification of this component (and why not the board itself!!!)





#### Artix7 – How do we qualify it?

- Starting from the weakest link => CRAM
  - SEM IP
  - Comparison with readback method of XS
- Application level
  - Simple applications (eg. Counter)
  - More sophisticated (eg. FSMs)
  - Full application (eg. µBlaze)
- System level
  - Board level usage of peripherals (PHY, FLASH, power converter, ADC, memories etc)



# First Setup

- Simple design with the SEM IP operating with the monitoring interface provided by Xilinx (MonShim)
  - Enhanced repair used (CRC/ECC)
  - ICAP/ECC FRAME
  - MONITOR SHIM (UART interface)
- A counter triggering a separate UART is our "application" pinging if the board is functional or not
- UART and counter are instantiated using the Distributed TMR directive of the Synplify Premier Synthesizer of Synopsis for Xilinx FPGAs.
  - Communication via USB repeater, and two UARTs.





## Irradiation at CHARM





CHARM Configuration : Copper target without shielding CHARM rates in position 0 : HEH ≈ 10<sup>5</sup> HEH.s<sup>-1</sup>cm<sup>-2</sup>



# **First Results**

- 1 week of irradiation
- 16 Gy total dose
- **1.6-10<sup>10</sup> HEH** total fluence (corresponding to the logs we recorded)
- During the runs, the UART/counter application never failed
- SEM IP Failure modes:
  - Stuck at failures the SEM keeps sending the same corrected bit
  - Stuck at failures the SEM keeps sending the same character
  - No-Data from SEM IP no response to the Status or the Reset command
  - **Corrupted data** coming from the SEM UART interface (UART garbage)
- Same bit corrected twice in a row
- Configuration memory Cross Section: 2.34E-14cm<sup>2</sup>/bit
- Non Correctable Bit Cross Section: 2.71E-16cm<sup>2</sup>/bit



### **Second Setup**

- Upgrade of the SEM monitoring interface using our own custom monitoring module
  - Triplicated FIFOs, simple UART module (dTMRed), simple UART operating FSMs (dTMRed)
- Sophisticated approach as a realistic case study concerning Finite State Machines (FSMs)
- Clusters of FSMs each one with a different sensitivity
  - Each FSM uses one-hot encoding
  - Three identical FSM types in terms of functionality (code), differentiating only the directives:
    - none
    - dTMR
    - Hamming 3
  - The syn\_safe\_case directive is used to keep the default state in case of an SEU
  - 6 consecutive states in a cyclic path. Transitions occur with an enable signal (fed by a counter)
  - Internal counter verifying the number of states that have been accessed.
  - Error signal triggered in case of erroneous state, or mismatch between expected number of states (XOR between the bits of the state register)
- Copper Target with shielding



#### **Second Setup**





### Second Results

- 1 week of irradiation
- 1.86E+09 HEH total fluence
- 1.6 Gy total dose
- Improvement of the SEM IP behavior
- Failure modes:
  - No-Data from SEM IP interface
  - Garbage from the UART/SEM-IP
- Same bit corrected twice in a row
- 5 no-TMR failures of the FSM. No reset was needed. Major failures involving large number of upsets.
- One Hamming 3 major failure requiring reset.
- Memory Configuration Cross Section: 6.67E-14 cm<sup>2</sup>/bit
- Non Correctable Bit Cross Section: **5.03E-16 cm<sup>2</sup>/bit**



# **Ongoing and Future work**

- This is just the beginning! Results are interesting but lots of things to be tested
- SEM IP with simple repair
- SEM IP with repair by replace (most promising)
- POST\_CRC directive with repair and continue operation exploration
  - Xilinx suggests to go directly with the SEM IP
- Readback Cross Section Calculation => waiting for the calibration!
- Microblaze setup using 10/100 Mb/s Ethernet link
  - Preliminary results are not so encouraging
- FSM setups using simple structure without the directives of Synopsis:
  - Experts form the field have pointed out that the do not preferusing such tools, but rather doing things by hand
- Harsh conditions testing: move towards a position inside CHARM with higher flux/dose
- Use applications to explore Mean Time Between Failure
  - CERN applications: nanoFIP?
  - Generic applications: AXI based architecture using the available peripherals of the board?



#### Thank you for your time!



#### **Questions?**



R. Ferraro, TWEPP 2016, Karlsruhe Germany