## SEU Risk Analysis of the LHC Collimators low level control rack in UJ14,UJ16,UJ56

A. Masi

## **Contents**

- 1. The LHC Collimator Low level Control system control architecture
- 2. The electronics location
- **3. SEU effects and probability according to the CNRAD results**
- 4. SEU effects: Risks Analysis
- 5. How SEU can be detected
- 6. Improvements of the control system reliability against SEU
- 7. Conclusions





FESA server constantly monitors the communication with the low level systems and receives errors if CPU and FPGA are stuck

## MDC is responsible for the motors drive and control.

- ✓ Control is open loop.
- $\checkmark$  Resolvers are used to detect steps lost.
- $\checkmark$  The jaw movement is blocked if the difference resolver/controller exceed 50 um

**PRS** is responsible for the positioning Readout and Survey

# Only the PRS is connected to the BIC system



1st remark:

Only the PRS is part of machine protection. Failure of the MDC does not compromise machine safety, and operation can generally continue without interruption if the collimators do not need to be moved.

•We have experience of failures of the motor drivers in stable beams

• TCP.D6L7.B1

• Event Timestamp: 26/08/11 06:30:10.776 Fill Number: 2056,

• Event Timestamp: 19/09/11 23:59:01.024 Fill Number: 2127,

• Event Timestamp: 20/09/11 01:22:17.718 Fill Number: 2127

-The jaws retract because of the auto-retraction system (MDC cannot stop the jaws !!!!!!)

- The PRS <u>must</u> dump the beam if the position limit thresholds are violated (i.e. The ones function of the time and/or of the energy and/or of the ß\*)

## The failures analysis is focussed only on the PRS system



2et remark:

The LHC Collimator control system is an "alive" system. Every year improvements and new functionalities are applied on the base of the experience of the previous year operation.

**Engineering specifications are not updated anymore** 

We have started a complete review of the engineering specifications of the low level control system "as built". This should help to well classify anomalous behaviours and dissipate unjustified worries





## **PRS: the protection features for the machine**



Two regimes: discrete ("actual") and time-functions (100 Hz survey frequency )

Inner and outer thresholds as a function of time for each motor axis and gap (24 per collimator). Triggered by timing event (e.g. start of ramp).

- $\mathbf{V}$  "Double protection"  $\rightarrow$  BIC loop broken AND jaw stopped
- Redundancy: maximum allowed gap versus energy (2 per collimator). Interpolation performed on the FESA gateway and on fly values sent via network
- Redundancy: beta-squeeze factor for TCT interlocking. Interpolation performed on the FESA gateway and on fly gap values sent via network











A Masi, SEU Risk Analysis of the LHC Collimators low level control rack in UJ14, UJ16 and UJ56

1-The LHC Collimator Low level Control system control architecture

















#### 2-The electronics location





#### 2-The electronics location

# The Collimators Control Electronics racks exposed to radiation effects are located in the sensitive areas UJ14, UJ16, UJ56





| LHC point | Rack<br>Name | Place | PXI Name (MDC) | PXI Name (PRS) | Collimator   |  |
|-----------|--------------|-------|----------------|----------------|--------------|--|
| Point 1 L |              |       | MDC 1 NT 001   | DDC 1 NT 001   | TCL.5L1.B2   |  |
|           | TYCELO1      | 11114 | WDC-1-N1-001   | PR3-1-N1-001   | TCLP.4L1.B2  |  |
|           | TYCFLUI      | 0114  |                | DDC 3 NT 003   | TCTH.4L1.B1  |  |
|           |              |       | MDC-2-W1-003   | PRS-2-N1-003   | TCTVA.4L1.B1 |  |
| Point 1 R |              | UJ16  | MDC-2-WT-004   | PRS-2-NT-004   | TCTVA.4R1.B2 |  |
|           | TYCFL01      |       |                |                | TCTH.4R1.B2  |  |
|           |              |       | MDC-1-NT-002   | PRS-1-NT-002   | TCLP.4R1.B1  |  |
|           |              |       |                |                | TCL.5R1.B1   |  |
| Point 5 R |              |       |                | DDC 2 NT 012   | TCTVA.4R5.B2 |  |
|           | TYCFL01      | UJ56  | WIDC-2-W1-012  | PR5-2-INT-012  | TCTH.4R5.B2  |  |
|           |              |       |                | DDC 1 NT 010   | TCLP.4R5.B1  |  |
|           |              |       | MDC-1-N1-010   | PK2-1-N1-010   | TCL.5R5.B1   |  |



#### **2-The electronics location**





## **Collimators Control rack in UJ14**

**Collimators Control rack in UJ16** 



- ✓ According to the radiation tests performed in CNRAD last April-May 2010 different failures on a PXI control system have been observed already starting from a fluence of some 10^6 p/cm^2 up to 3.92 p/cm^2 :
  - Operational errors (e.g. register or memory cells value corrupted)
  - CPU stuck
  - FPGA errors (e.g. a bit stuck or flip)
  - PXI rebooted itself
  - Network communication temporarily lost
- Those failures can be quickly fixed via a remote intervention but can provoke on a MDC:
  - Collimator operation not possible
- Those failures can be quickly fixed via a remote intervention but can provoke on a PRS:
  - Collimator operation not possible
  - False dumps
  - Collimator survey out of order (machine protection impact)



## LHC Collimators control in IP1 risk analysis

Legend

**Risk probability:** 

- > High: more than 100 events experienced during the CNGS test
- > Medium: between 10 and 100 events experienced during the CNGS test
- Low: less than 10 events experienced during the CNGS test
- > Really low: only 1 event experienced during the CNGS test

### Severity:

- Low: Collimator operation not possible
- Medium: False dumps
- > High: Collimator survey out of order (machine protection impact)

**Corrective action:** *action to take to restore the correct control system operation* **System downtime:** *The time the control system is not operational/ out of order* 







#### 4-The worst case (PRS): SEU Critical effect for the machine safety





### 4-SEU effects: what we experienced so far

| Date       | IP   | SEE type | Beam<br>dumped | PXI system   | Collimator<br>names         | Fill number | Problem                                                                          |
|------------|------|----------|----------------|--------------|-----------------------------|-------------|----------------------------------------------------------------------------------|
| 27/04/2011 | UJ14 | Soft SEE | NO             | MDC-2-WT-003 | TCTH.4L1.B1<br>TCTVA.4L1.B1 | 1740        | Communication<br>lost with the<br>MDC                                            |
| 01/05/2011 | UJ14 | Soft SEE | YES            | PRS-2-NT-003 | TCTH.4L1.B1<br>TCTVA.4L1.B1 | 1753        | PRS rebooted by itself                                                           |
| 01/06/2011 | UJ56 | Hard SEE | YES            | PRS-2-NT-010 | TCL.5R5.B1<br>TCLP.4R5.B1   | 1835        | PRS power supply failed                                                          |
| 13/06/2011 | UJ14 | Hard SEE | YES            | MDC-2-WT-003 | TCTH.4L1.B1<br>TCTVA.4L1.B1 | 1865        | MDC power<br>supply failed/<br>rack circuit<br>breaker off                       |
| 30/07/2011 | UJ16 | Hard SEE | YES            | MDC-2-WT-004 | TCTH.4R1.B2<br>TCTVA.4R1.B2 | 1992        | MDC power<br>supply failed/<br>rack circuit<br>breaker off                       |
| 12/09/2011 | UJ16 | Soft SEE | NO             | MDC-2-WT-004 | TCTVA.4R1.B2                | 2102        | Bit stuck in the<br>counter register<br>likely on the<br>FPGA output<br>register |



### 4-SEU effects: Risk analysis

| Risk<br>Probability | Failure Scenario                                                   | Severity | Coll.<br>not<br>operati<br>onal | False<br>Beam<br>dumped | Machine<br>un-<br>protected | Corrective action                    | System<br>Downtime |
|---------------------|--------------------------------------------------------------------|----------|---------------------------------|-------------------------|-----------------------------|--------------------------------------|--------------------|
| High                | MDC FPGA error                                                     | L        | Х                               |                         |                             | FPGA remote reset                    | 15`                |
| High                | MDC CPU error                                                      | L        | Х                               |                         |                             | MDC remote reboot                    | 15`                |
| Medium              | MDC rebooted itself                                                | L        | X                               |                         |                             |                                      | 2`                 |
| Medium/high         | <i>MDC</i> power supply failure                                    | L        |                                 | Х                       |                             | Power supply replacement             | 2 h                |
| Really Low          | Stepping motor<br>driver failure                                   | М        |                                 | Х                       |                             | Stepping motor driver<br>replacement | 2 h                |
| Low                 | PRS CPU<br>communication lost<br>but survey loops still<br>running | Μ        |                                 | Х                       |                             | Remote PRS reboot                    | 15`                |
| Medium/high         | <i>PRS</i> power supply failure                                    | М        |                                 | Х                       |                             | Power supply replacement             | 2 h                |
| Medium/low          | PRS rebooted by itself                                             | М        |                                 | Х                       |                             |                                      | 2`                 |



### 4-SEU effects: Risk analysis

| Risk<br>Probability                    | Failure Scenario                          | Severity | Coll.<br>not<br>operati<br>onal | False<br>Beam<br>dumped | Machine<br>un-<br>protected | Corrective action                       | System<br>Downtime |  |
|----------------------------------------|-------------------------------------------|----------|---------------------------------|-------------------------|-----------------------------|-----------------------------------------|--------------------|--|
| Really low                             | <i>PRS</i> memory values corrupted        | М        |                                 | Х                       |                             | Remote PRS reboot                       | 15`                |  |
| Really low                             | <i>PRS</i> FPGA interlock logic corrupted | Н        |                                 | Х                       | Х                           | Remote PRS reboot                       | 15                 |  |
| Really Low                             | PRS CPU stuck                             | Н        |                                 |                         | Х                           | Remote PRS reboot                       | 15`                |  |
| Really low                             | PRS FPGA errors/<br>communication lost    | Н        |                                 |                         | Х                           | FPGA remote reset/<br>Remote PRS reboot | 15`                |  |
| Even if really rare these effects must |                                           |          |                                 |                         |                             |                                         |                    |  |

be detected and mitigate...



#### 5-How the SEU critical effects can be prevented: PXI power supplies failure



The 12 PXI chassis in the sensitive area will be likely replaced with new High Reliability PXI chassis during next Xmas break



Two prototypes of the new PXI chassis successfully tested in a lab collimator rack



5-How the SEU critical effects can be prevented: PRS unsafe states





## **Assumptions:**

- The probability of experiencing on the same system at the same time a CPU and FPGA stuck is negligible....
- On the PRS system if the CPU is stuck the collimator positions survey is compromised
- On the PRS system if the FPGA is stuck or affected by errors the beam dump functionality can be compromised

**Proposal:** 

- The CPU watch dog timer on the PRS FPGA should be able to trigger an interlock
- ✓ A Software Interlock can be added at the FESA level on the PRS FPGA stuck error and communication time out



#### 6-Improvements of the control system reliability against SEU





- ✓ In the failure analysis of the SEU effects on the collimator electronics the worst cases have been taken into account (<u>PRS in unsafe state</u>)
- The sensitivity of the PXI power supply to hard SEU has been proved in radiation test at PSI. We will install during the next Xmas break some new high availability PXI chassis. If the test is successful we will replace all during LS1.
- Improvements to the PRS software on the collimator control systems in UJ14, UJ16, UJ56 have been proposed to mitigate the SEU effects reducing the impact on the machine protection
- The proposed improvements are being tested and preliminary results confirm the choice of the time out values as good compromise between detecting dangerous situations for the machine protection and avoiding false dumps
- ✓ The software will be ready to be deployed in the tunnel for the next technical stop
- ✓ An update of the engineering specifications of the low level control system taking into account the last upgrades is in progress
- ✓ We propose to perform a detailed specification and software review performed with the help of an external company to dissipate any doubt on possible weaknesses on the low level control system.











## PRS Watch dog timer setting up

When does the autoret

The auto-retraction a power cut in the t

## Which is the probabilit

From January 2010 in UJ14/UJ16 and 9

## Which is the maximum

Depend on the coll case is represented the effect of the gra We can assume as



TCTVA.4L1.B1 autoretraction following the power cut on 28-05-2010

V max = 4 mm/s





2.