

### Validation and Reliability Tests for the new VFC-HD

V. Schramm

on behalf of

S. Eitelbuß, M. Gonzalez Berges, J. O. Robinson, M. Saccani, M. A.Stachon, W. Viganò, C. Zamantzas and all other involved

BI Seminar 11.01.2019

Volker Schramm BE-BI-BL volker.schramm@cern.ch

# Agenda

- Introduction
- Methodology & Objective
- Test Setup
- Test Results
  - Validation Tests
  - Burn-In (ongoing)
  - Run-In (ongoing)
- Summary & Outlook



## Introduction

The VFC-HD board is a new multi-purpose VME carrier board developed in BI

Main challenges:

- Complexity
  - 1200 mounted components, 12 layers
  - VME, Arria V FPGA, DDR3, DCDC, FMC, SFPs, ...
- Serves a variety of different users & applications
- Dependability and performance requirements
- Production of 1200 units



➔ To ensure a good and dependable performance upon installation, validation and reliability testing is put in place.



# **Methodology & Objective**

#### Design

- Specifications
- Predecessor analysis
- MTTF prediction
- Failure Modes, Effects
  & Criticality Analysis
- Design Review

<u>Dependability methodology</u> applied upon planning and design through production, testing and installation to system operation

Production (& Tests)

Reception Tests

Installation/ Commissioning

Operation

- Addressing dependability during the full life cycle
- Validation (functional, environmental, quality)
- Reliability assessment & improvement
- Risk reduction & risk assessment
- Mitigation of (late) design changes
- Methodological approach (re-useable; universal; evolving)



# **Methodology & Objective**





# Agenda

- Introduction
- Methodology & Objective
- Test Setup
- Test Results
  - Validation Tests
  - Burn-In (ongoing)
  - Run-In (ongoing)
- Summary & Outlook



Hardware

Visual Inspection:

Validation & Burn-In:







- Digital microscope *Tagarno FHD UNO* (92x magnification)
- Climatic chamber *Binder MKF240* for temperature and humidity
- Other procurements & modifications: Crate, FMC/SFP loopback modules, custom boards, cables, fibres, ...

#### Run-In:

• 8 VME crates in 2 racks + PSUs, CPUs, cabling, ...





#### Firmware

Test firmware constantly checks all board functionalities in parallel & saves the results:

- Sends/writes & reads back
- Checks various values: f<sub>clocks</sub>, temperatures, voltages, ...
- Loopback for SFPs, FMCs & LEMOs
- FPGA self checking
- $\rightarrow$  Results register is read by the CPU card





#### Software

CPU card communicates the test results to the database:

- Cyclic results reading depending on selected test
- Data logged on NXCALS with Spark for queries
- Asset management integrated to register and follow up devices
- → Expert application provides a GUI









#### **Expert application**



[Nov 14, 2018 4:57:25 PM] : BurnIn graph displayed ...

Many thanks to: M. Gonzalez Berges, J. O. Robinson & M. A. Stachon



Validation and Reliability Tests for the new VFC-HD

# Agenda

- Introduction
- Methodology & Objective
- Test Setup
- Test Results
  - Validation Tests
  - Burn-In (ongoing)
  - Run-In (ongoing)
- Summary & Outlook



#### 1) Temperature/Humidity Cycling:

Functional & environmental validation + production approval

- Sample size: 2x pre-series
- Tests between 5-55°C & 10-90% RH
- Various SFP transceiver (*FT3A05D*) + fibre configurations
- Cycle tests in pyramid-shape or at constant conditions
- In total 11 tests performed





- 1) <u>Temperature/Humidity Cycling:</u>
- #1,2: Validation at temperature limits (low RH)
  - No errors with SFP loopbacks
- #3: Validation at 80% RH (25°C)
  - No errors with SFP+ transceivers with fibre loopback (LC)
- #4,5: 2h temperature cycling at 50/70% RH
  - FMC loopback & SFP+ errors ≥50°C

| #<br>1<br>2<br>3<br>4<br>5 | T-range | RH-   | Board     | Board | Error | Error |   |
|----------------------------|---------|-------|-----------|-------|-------|-------|---|
|                            | #       | [°C]  | range [%] | А     | В     | Α     | В |
|                            | 1       | 5     | <35       | LB    | LB    | Ν     | Ν |
|                            | 2       | 55    | <50       | LB    | LB    | Ν     | Ν |
|                            | 3       | 25    | 80        | LC    | LB    | Ν     | Ν |
|                            | 4       | 5-55  | 50        | LC    | LB    | Y     | Ν |
|                            | 5       | 5-55  | 70        | LB    | LC    | Ν     | Y |
|                            | 6       | 35-55 | 50        | LB    | LC    | Ν     | Y |
|                            | 7       | 35-55 | 50        | Mix   | LB    | Y     | Ν |
|                            | 8       | 35-55 | 50        | Mix   | LC    | Y     | Y |
|                            | 9       | 30    | 50-90     | Mix   | LC    | Ν     | Ν |
|                            | 10      | 30    | 80        | Mix   | LC    | Ν     | Ν |
|                            | 11      | 40    | 80        | Mix   | LC    | Y     | Y |





#### 1) <u>Temperature/Humidity Cycling:</u>

- #6-8: Further temperature cycling at smaller steps with different fibre setups
  - Several SFP+ errors ≥ 40°C
  - Wide spread of SFP+ quality
  - Error rate ~ temperature

SFP

slot 2

SFP

130

slot 3 slot 4 slot 5 slot 6

SFP

40

200

700

1

1300

SFP

1300

500

350

SFP

30

90

30

16

SFP

slot 1

3

8

| #  | T-range<br>[°C] | RH-<br>range [%] | Board<br>A | Board<br>B | Error<br>A | Error<br>B |
|----|-----------------|------------------|------------|------------|------------|------------|
| 1  | 5               | <35              | LB         | LB         | N          | N          |
| 2  | 55              | <50              | LB         | LB         | Ν          | Ν          |
| 3  | 25              | 80               | LC         | LB         | Ν          | Ν          |
| 4  | 5-55            | 50               | LC         | LB         | Y          | Ν          |
| 5  | 5-55            | 70               | LB         | LC         | Ν          | Y          |
| 6  | 35-55           | 50               | LB         | LC         | Ν          | Y          |
| 7  | 35-55           | 50               | Mix        | LB         | Y          | Ν          |
| 8  | 35-55           | 50               | Mix        | LC         | Y          | Y          |
| 9  | 30              | 50-90            | Mix        | LC         | Ν          | Ν          |
| 10 | 30              | 80               | Mix        | LC         | Ν          | Ν          |
| 11 | 40              | 80               | Mix        | LC         | Υ          | Y          |





#

4

5

6

7

8<sup>Mix</sup>

8<sup>LC</sup>

T<sub>min</sub>

[°C]

57/54

53/45

46/45

44/40

42/40

46/45

Validation and Reliability Tests for the new VFC-HD

#### 1) <u>Temperature/Humidity Cycling:</u>

- #9-11: Cycling and constant humidity tests
  - Limitations when cycling with crate inside the chamber
  - Not possible to control humidity reduction
  - Validation for 80% RH at 30°C
  - Errors at 80% RH and 40°C

#### Validation summary:

- No VFC-HD board failures
- FMC errors most likely a combined effect of temperature & firmware → solved with new FW

 $\rightarrow$ 

 Further investigation for SFP+ errors above 40°C needed (possible RX saturation due to short fibre length)



| #  | T-range<br>[°C] | RH-<br>range [%] | Board<br>A | Board<br>B | Error<br>A | Error<br>B |
|----|-----------------|------------------|------------|------------|------------|------------|
| 1  | 5               | <35              | LB         | LB         | Ν          | Ν          |
| 2  | 55              | <50              | LB         | LB         | Ν          | Ν          |
| 3  | 25              | 80               | LC         | LB         | Ν          | Ν          |
| 4  | 5-55            | 50               | LC         | LB         | Y          | Ν          |
| 5  | 5-55            | 70               | LB         | LC         | Ν          | Y          |
| 6  | 35-55           | 50               | LB         | LC         | Ν          | Υ          |
| 7  | 35-55           | 50               | Mix        | LB         | Y          | Ν          |
| 8  | 35-55           | 50               | Mix        | LC         | Y          | Y          |
| 9  | 30              | 50-90            | Mix        | LC         | Ν          | Ν          |
| 10 | 30              | 80               | Mix        | LC         | Ν          | Ν          |
| 11 | 40              | 80               | Mix        | LC         | Y          | Υ          |

#### 2) High Temperature:

Trigger possible early failure mechanisms + production feedback

• 4 tested boards:

(SFP + FMC loopbacks)

| Date       | Production batch                       | T <sub>max_chamber</sub> [°C]                                |
|------------|----------------------------------------|--------------------------------------------------------------|
| 26.07.2018 | Pre-series                             | 70                                                           |
| 26.07.2018 | Pre-series                             | 70                                                           |
| 31.08.2018 | Version 2                              | 100                                                          |
| 27.11.2018 | 1 <sup>st</sup> production batch       | 115                                                          |
|            | 26.07.2018<br>26.07.2018<br>31.08.2018 | 26.07.2018      Pre-series        26.07.2018      Pre-series |

- Error-free communication until 1
  95°C
  1
- VME communication loss at 115°C
  → Test terminated
- Crate current consumption increased by **5A**
- T<sub>max\_FPGA\_surface</sub> = 161°C
- T<sub>max\_PCBsensor</sub> = 128°C (in error)







#### 2) High Temperature:



• Linear extrapolation:  $T_{j_{max_{FPGA}}} = 196^{\circ}C$ 

Datasheet:  $T_{j_max} = 125^{\circ}C$ 

 $T_{j\_max\_recommended} = 85^{\circ}C$ 



- After cooling down full recovery of all functions
  - Reliable board operation up to 95°C
    - No hardware failures up to  $115^{\circ}C$   $\rightarrow$  Successful

#### Robust design & production



Tests summary:

27.11.2018

# Agenda

- Introduction
- Methodology & Objective
- Test Setup
- Test Results
  - Validation Tests
  - Burn-In (ongoing)
  - Run-In (ongoing)
- Summary & Outlook



100°C: Raising tension with raising temperature in front of the climatic chamber ©



## **Screening & Reliability**

Mean Time To Failure:

#### Reliability engineering basics:

Reliability:

$$R(t) = 1 - F(t) = \frac{f(t)}{\lambda(t)}$$

Failure Rate: 
$$\lambda(t) = \frac{No. of failures}{\sum t_{Device}} = \frac{f(t)}{R(t)}$$

Constant failure rate for electronics during useful life (exponential distribution):

$$R(t)=e^{-\lambda t}$$

$$MTTF = \frac{1}{\lambda}$$

Mean Time To Failure for Chi-Square distribution confidence level:

$$MTTF_{CL} = \frac{2 * \sum t_{Device}}{\chi^2 [CL; 2(r+1)]}$$

- F(t) Failure probability [%]
- f(t) Failure density function
- T Characteristic Lifetime
- CL Confidence level [%]
- r Number of failures



# **Screening & Reliability**

#### The bathtub curve:



Weibull parameter b:

$$R(t) = e^{-\left(\frac{t}{T}\right)}$$



# **Screening & Reliability**

- Burn-In and Run-In to screen for early life failures and to assess the (minimum) reliability
- Strategy comprises two steps:
  - 1) Temperature cycling to raise stress level
  - 2) Constant conditions to accumulate time & gain confidence







### **Burn-In**

Temperature cycling to raise stress:

- No. of cycles: **30**
- Range: 5 to 50 °C
- Rate of change: **5°C/min**
- Dwell: **12 min**
- Rel. humidity:
- Total time: 22 h



#### ~20% ! Short peaks of 90% RH, no condensation





#### Validation and Reliability Tests for the new VFC-HD

## **Burn-In**

Tests started on 01.11.2018:

- 25% of boards already tested; 304 out of 1200
- 12 boards failed the test
  - 4 different circuits (components) affected
  - 3 different causes of failure
  - Main cause of failure: <u>Cleanliness of production</u>
  - Main circuit affected: Pushbutton with R1, C1



Results are preliminary ! Full analysis to be followed up !

| Batch | No. of tested | Sorted out during  | Sorted out after | Failure/  |                    |    | Pushbutton             | 6 |
|-------|---------------|--------------------|------------------|-----------|--------------------|----|------------------------|---|
| Daton | boards        | inspection*        | Burn-In*         | Error     | Failures/Errors:   | 12 | $\downarrow$ + R1 + C1 | 4 |
| 1     | 132           | 1 (+1 high-T test) | 10               | 11        | Cleanliness        | 9  | IC2                    | 1 |
| 2     | 137           | 0                  | 1                | 1         | Manufacturing      | 1  | _ <u>IC35</u>          |   |
| 3     | 35            | 1                  | 0                | 0         | Other              | 1  |                        |   |
| TOTAL | 304           | 2 (+1)             | 11               | 12        | To be investigated | 1  |                        |   |
|       |               |                    |                  | * pending |                    |    |                        |   |

- Most failures in batch 1  $\rightarrow$  Big efforts done to improve the situation:
  - Intensive collaboration with the manufacturer
  - Production process changes (handling, cleaning, inspection, documentation ...)



- Installation of 8 crates in building 6546
- Tests without FMC, SFP & LEMO loopbacks
- Room temperature  $\geq$  30°C
- First 8 crates finished an extended Run-In before Christmas
  - No failures observed
- Status as of 09.01.2019:

|                | Lot 1      | Lot 2      | Total   |
|----------------|------------|------------|---------|
| Start date     | 22.11.2018 | 22.12.2018 |         |
| End date       | 20.12.2018 | 09.01.2018 |         |
| Number of days | 29         | 19         | 48      |
| Number of DUTs | 128        | 128        | 256     |
| Failures       | 0          | ?          | 0       |
| Device hrs     | 89 000     | 58 000     | 147 000 |







CERN

MTTF assessment from Burn-In and Run-In (all failures): status of 09.01.2019



• No stress acceleration considered;  $T_{Run-In} \ge 30^{\circ}C$ 

- High failure number of 1 single cause results in poor MTTF (no actual HW failures!)
- Failure cause was solved (mitigated) + no appearance during run-in
  - $\rightarrow$  <u>Censoring</u> of the failures

**Preliminary**!

MTTF assessment from Burn-In and Run-In (censored):

status of 09.01.2019



150k accumulated device hours;  $T_{Run-In} \ge 30^{\circ}C$ ٠



**Preliminary**!

Possible MTTF prediction for future tests (censored):



• Assuming testing until 31.05.2019



# Summary & Outlook

- Successful board validation at different humidity/temperature + at high temperature
  - > SFP+ transceiver as bottleneck  $\rightarrow$  Further investigation
- Screening (incl. inspection) turned out to be necessary
  - > Until now: Good, <u>reliable design</u>  $\rightarrow$  No PCB/component failures, but:
  - > <u>Production issues</u>, mainly cleanliness, revealed during burn-in
    - Efforts done to improve the situation for 2<sup>nd</sup> & 3<sup>rd</sup> batch
    - Important to check production & to collaborate with manufacturer
  - Run-In very important to accumulate failure free (low FR) time and to obtain MTTF<sub>min</sub>
    - > 09.01.2019: MTTF<sub>95%</sub> = 50k hrs (HW failures; no stress acceleration)
    - Final analysis and conclusions to be done after May
- Dependable design methodology proved to be of use
  - Might be used in a flexible way, but proper <u>documentation</u> necessary



#### Thank you very much for your attention



And many thanks to all colleagues in BI-BP, BI-SW, BI-BL who contributed to this large project !

Density function of the load/strength



Load/Load capacity



> Power of 10 rule: Fault costs during the product life cycle



[B. Bertsche, Reliability in Automotive and Mechanical Engineering]









Electronic Systems' Dependability Methodology applied to the VFC-HD

#### Performance comparison



William Viganò (william.vigano@cern.ch)

[W. Viganò, ATS – KT innovation day]



Voltages VS temperature 3500 3000 2500 2000 1500 1000 500 0 14:1811/27/2018 14:08 11/27/2018 14:10 16:2611/27/2018 14:13 11/27/2018 14:15 11/27/2018 14:21 11/27/2018 14:25 11/27/2018 14:28 11/27/2018 14:30 11/27/2018 14:33 11/27/2018 14:35 11/27/2018 14:38 11/27/2018 14:40 11/27/2018 14:45 11/27/2018 14:50 11/27/2018 14:55 11/27/2018 15:04 11/27/2018 15:14 11/27/2018 15:24 11/27/2018 15:26 11/27/2018 15:34 11/27/2018 16:19 11/27/2018 16:24 11/27/2018 16:34 11/27/2018 16:36 11/27/2018 14:23 11/27/2018 14:43 11/27/2018 14:47 11/27/2018 14:52 11/27/2018 14:57 11/27/2018 15:00 11/27/2018 15:02 11/27/2018 15:07 11/27/2018 15:09 11/27/2018 15:12 11/27/2018 15:17 11/27/2018 15:19 11/27/2018 15:29 11/27/2018 15:36 11/27/2018 15:38 11/27/2018 15:41 11/27/2018 15:43 11/27/2018 15:46 11/27/2018 15:48 11/27/2018 15:53 11/27/2018 16:05 11/27/2018 16:08 11/27/2018 16:29 11/27/2018 15:21 11/27/2018 15:31 11/27/2018 15:51 11/27/2018 16:17 11/27/2018 16:21 11/27/2018 16:31 11/27/2018 11/27/2018 Series1 — Series2 — Series3 — Series4 — Series5 — Series6 — Series7 — Series8 — Series9









RASWG meeting, 02.03.2017



- Additional microscope inspection at CERN before burn-in
- > Many findings for  $1^{st}$  batch (~50%), <u>9</u> boards failed:
  - Cleanliness
  - Handling (scratches, packaging, ...)







- > Tolerances
- Documentation





After consultation with Norcott, situation improved for 2<sup>nd</sup> batch, but still not perfect:





Current Status of the VFC-HD Burn-In and Run-In Tests