The lonely perpetrator in the DCDC FEAST2 case
How an individual transistor threatened the operation of the CMS pixel detector

F. Faccio, CERN, EP-ESE

S. Michelis, G. Ripamonti, D. Porret
CERN, EP-ESE

F. Szoncso
CERN, HSE-DI

D. Valuch
CERN, BE-RF-FB

N. Bacchetta, S. Cuadrado Calzada, A. Karneyeu, T. Prousalidi, M. Hansen, A. Kaminski, S. Lusin + many others
CMS team
These slides have been prepared solely for the purpose of supporting an oral presentation and are not suitable to convey a clear message outside this context.

A full report as well as an executive report are available at the following link:

http://project-dcdc.web.cern.ch/project-DCDC/public/Reports.html
Investigation = the act or process of examining a crime, problem, statement, etc. carefully, especially to discover the truth
The truth about the FEAST2 affair

The crime scene
Witnesses
The autopsy
The motive
The crime reconstruction
The perpetrator
The crime scene
FEAST2 in the CMS pixel detector

Bpix DC-DC: radius 240mm, z range 2067-2430mm from i/p.
Fpix DC-DC: radius 140mm, z range 1315-1530mm from i/p

384 in total

832 in total

Cooling at -20°C
We are the FEAST2 ASIC designers, but this is not our DCDC module. We do not know the system where the module is used. There is a limit to the reach of our investigation.

Aachen module

This fuse prevents the lowering of Vin

Vin 12V

FEAST2 module

Vout 2.5-3.3V

CERN module (FEASTMP)

Module in the CMS pixels

Module everywhere else
Witnesses
Failure of FEAST DCDCs in the CMS pixel detector

No correlation with:
- output voltage
- output current
- position in the detector
- anything other than the beam

DCDCs fail during disable/enable cycles

Increase in luminosity, change in beam structure

Automatic power cycles during fills

1st DCDC lost OCT 5th

Number of inactive ROCs is reset at power cycl

Power cycle

Manual power cycles in inter fills

Slope is proportional to luminosity (SEU on TBM)

Accumulate permanently lost ROCs, due to broken DCDC converters
### Plan around November 2017

<table>
<thead>
<tr>
<th>Rest of 2017 Physics Run</th>
<th>Nothing can be done. Accept loss of modules</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>YETS 17-18</strong></td>
<td>Request longer Year-End Stop</td>
</tr>
<tr>
<td></td>
<td>Open the detector</td>
</tr>
<tr>
<td></td>
<td>Extract all DCDC modules</td>
</tr>
<tr>
<td></td>
<td>Replace (all?) modules (fuse changed)</td>
</tr>
<tr>
<td><strong>2018 Physics Run</strong></td>
<td>Find a patch ensuring data taking</td>
</tr>
<tr>
<td><strong>LS2</strong></td>
<td>Solve the problem for the long term</td>
</tr>
</tbody>
</table>

“At this pace, game over for CMS around May 2018”
The autopsy
YETS 17-18: Merry Christmas!

Photo memories from the 2017 Christmas Break
FEAST2 modules in the CMS experiment were found to present 2 distinct types of damage

- “Broken” samples failed to provide any output voltage
- “High-current” samples were perfectly functional, but were found to have an excessive current below UVLO

### Symptoms of damaged converters:

<table>
<thead>
<tr>
<th>$V_{in}$ (V)</th>
<th>$I_{in}$ (mA)</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.2</td>
<td>3</td>
</tr>
<tr>
<td>4.6</td>
<td>1</td>
</tr>
</tbody>
</table>

**UVLO$_{Th1}$:** Regulators on

**UVLO$_{Th2}$:** DCDC enabled

<table>
<thead>
<tr>
<th>Pixel Names</th>
<th>Number of Converters</th>
<th>Tested Broken</th>
<th>Tested working with High Current</th>
<th>Tested working with normal current</th>
<th>BROKEN % with respect to total</th>
<th>HIGH CURRENT % with respect to total</th>
</tr>
</thead>
<tbody>
<tr>
<td>BPIX (+Z, Near)</td>
<td>208</td>
<td>4</td>
<td>48</td>
<td>156</td>
<td>1.9</td>
<td>23.5</td>
</tr>
<tr>
<td>BPIX (-Z, Near)</td>
<td>208</td>
<td>10</td>
<td>48</td>
<td>150</td>
<td>4.8</td>
<td>24.2</td>
</tr>
<tr>
<td>BPIX (+Z, Far)</td>
<td>208</td>
<td>13</td>
<td>70</td>
<td>125</td>
<td>6.3</td>
<td>35.9</td>
</tr>
<tr>
<td>BPIX (-Z, Far)</td>
<td>208</td>
<td>11</td>
<td>70</td>
<td>127</td>
<td>5.3</td>
<td>35.5</td>
</tr>
<tr>
<td>FPIX (+Z, Near)</td>
<td>96</td>
<td>7</td>
<td>41</td>
<td>48</td>
<td>7.3</td>
<td>48.1</td>
</tr>
<tr>
<td>FPIX (-Z, Near)</td>
<td>96</td>
<td>7</td>
<td>34</td>
<td>55</td>
<td>7.3</td>
<td>38.2</td>
</tr>
<tr>
<td>FPIX (+Z, Far)</td>
<td>96</td>
<td>9</td>
<td>23</td>
<td>64</td>
<td>9.4</td>
<td>28.4</td>
</tr>
<tr>
<td>FPIX (-Z, Far)</td>
<td>96</td>
<td>6</td>
<td>22</td>
<td>68</td>
<td>6.3</td>
<td>24.4</td>
</tr>
<tr>
<td>BPIX - not connected to modules</td>
<td>32</td>
<td>2</td>
<td>8</td>
<td>22</td>
<td>6.3</td>
<td>28.7</td>
</tr>
</tbody>
</table>
Operation of a switching converter

"high-voltage" LDMOS

\[
\frac{V_{out}}{V_{in}} = \text{Duty cycle}
\]

Graph showing \( I_L \) vs \( t \)
Typically stuck at 0.9-1.4V

Insufficient BootS-phase

No turn-on of HS transistor
On-chip V33Dr regulator with clamps

Green elements are only used under UVLO Thresholds

 Clamp transistors for soft-start procedure
(transistors rated to 3.3V)
What happens if the clamp transistors are damaged?

Gate current: uA levels can already create problems, but mA lead to definite stuck condition

This also disrupts the correct value of the UVLORegs voltage, hence preventing the UVLO from working correctly
Failure Analysis (FA) with emission microscopy and Optical Beam Induced Resistance Change (OBIRCH) at MASER (NL)

During OBIRCH a laser beam selectively illuminates the metal lines, altering their resistance. The consequent input current change is measured, allowing the mapping of current paths. In broken FEAST2 samples, current flows to the clamp transistors.

Emission images are based on the detection of photons generated from hot carriers (therefore only conducting NMOS transistors are well visible). “Broken” or “High-current” FEAST2 samples showed different current paths.
The motive
Why the clamp transistor(s) is(are) damaged?

- Flawed ASICs?
- Flawed PCBs?
- Radiation in ASIC?
- Radiation in package?
- EM noise?
- Environmental conditions?
- Electrical stress?
- Combination of any of the above??
Environmental conditions?

Test in a 3T magnetic field revealed no problem

FEASTMP and CMS modules inside the M1 facility in the H2 beam line in Prévessin (Building 887)
Flawed PCBs?

Faulty capacitor, or intermittent contacts of the capacitor in the PCB generated a stress that produced somewhat similar damage.

3D X-ray imaging of the module to inspect the quality of the soldering of the capacitor to the PCB.

Waveform of the V33Dr node when the capacitance has intermittent contacts to the PCB.
**Electrical stress?**

Long-term ageing tests on 124 converters did not reveal problems with FEAST2 ASICs.

3 drawers with 8x4 modules

“Crate96” system with 32x3 modules

Electrical stress? the parasitic inductance along the current path induces over-voltages during the commutations.

The parasitic inductance along the current path induces over-voltages during the commutations.
EM noise? Environmental conditions?

Injection of EM noise by capacitive coupling to the input/output and signal lines. The converter appears to be very resilient

Led by F. Szoncso and D. Valuch

Capacitive high-frequency coupler used on the input bus line

AC-observation of the effect of a 3kV (!) pulse with 50ns duration on Vin and V33Dr. The peak is several V above the DC (V33Dr reaches 8V)
High-frequency, large power RF noise injection could produce damage with different signature

Led by F. Szoncso and D. Valuch

High frequency (GHz) and high power transient pulses are injected via an antenna in a special chamber in CERN Prévessin. At very large power, the ASIC can be damaged but the signature is different than in samples failing in CMS. Coupling is through the long enable line.
EM noise? Environmental conditions?

Stress tests with an ESD gun (1.2kV pulses) could produce somewhat similar damage - but the energy injected needs to be really large.

An ESD gun is used to inject a discharge to the different pins of the FEAST2 package. To produce any damage, a visible spark has to be produced.
Radiation in ASIC?

SEE Heavy Ion irradiation on samples pre-exposed to X-rays, Protons and Neutrons did not show any sign of damage

4 modules prepared on a motherboard are placed inside the irradiation chamber where they will be exposed to a Heavy Ion beam. The FEAST2 chips were previously irradiated with X-rays, 230MeV protons or neutrons from a reactor.
Exposure of 32 FEAST2 in the CMS Castor Table was meant to reproduce some of the environmental conditions (proximity to the beam line, EM environment, radiation environment)

32 sample DCDC modules, both FEASTMP and CMS modules, are exposed and constantly monitored in the CMS Castor Table during the 2018 run.
<table>
<thead>
<tr>
<th>Radiation in ASIC?</th>
<th>Environmental conditions?</th>
<th>EM noise?</th>
</tr>
</thead>
</table>

Two irradiation runs at the CERN IRRAD facility were instrumental in understanding the origin of the damage.
Exposure of 32 FEAST2 at -25C at the CERN IRRAD facility

The facility run in a purposely modified configuration to expose the converters in a mixed field:

*MANY THANKS to the IRRAD TEAM!*

32 samples, both FEASTMP and CMS modules, are exposed and constantly monitored in a cold box (-25C) in the CERN IRRAD facility (May 2018).
A specific bias and control sequence was used during the exposure

The full sequence lasts about 2 hours, with 97% of the time in “monitoring”
The results powerfully revealed some important correlation:

- **Between the damage and the integrated flux**
  - The first damage occurs after 9 days, then several samples per day
  - Only samples closer to the beam are damaged
  - Samples are damaged also after the end of the exposure

- **Between the occurrence of the damage and the disable-enable sequence**
  - Also true for the “High-Current”

---

**Red** = High-Current damage occurred during exposure  
**Blue** = High-Current damage occurred after exposure  
**Black** = failure
X-ray irradiation using the same enable/disable cycle as in IRRAD, and monitoring the current under UVLO thresholds, it was eventually possible to produce the same damage!

Now we had a tool to study the mechanism in detail!

*X-ray machine of the EP-ESE group*
The crime reconstruction
Radiation tests on the I3T80 nLDMOS transistor in 2008 revealed the impossibility to use Enclosed Layout Transistors (ELT): TID-induced leakage could not be avoided!

Displacement Damage

TID

Overall, due to the large leakage induced by radiation, this transistor was not suitable for high-voltage logic transistors. As shown in Figure 2, this modified transistor was still working correctly. The small shift in the threshold voltage (about 45mV/dec) and the increase of transconductance (maximum of respectively 20% and 10%) were observed, more evidently, also for the output characteristics shown in Figure 3, this modified transistor with irradiation up to a maximum of about 2 Mrd.

Figure 1: Evolution of the leakage current as a function of the TID for different test chips. Chip A3 has been irradiated with a smaller bias on the gate (2V) before irradiation. The points at 100rad are in reality measured for 4 different chips. Chip A3 has been irradiated (the observed shift in the threshold voltage). Figure 2: Evolution of the threshold voltage after irradiation, as shown in the output characteristics, presented by the purple dashed line). The small decrease can be attributed to trapping of states at the interface between the thick lateral oxide and the gate of every transistor.

Figure 9: Output characteristics for the LNNDM14 transistor with standard layout. Same conditions as described in Figure 1. The evolution of the leakage current (drain to source current for Vgs=0V and Vds=14V) has been extracted in the linear region (Vgs=3.3 1.3e14p/cm2 Vgs=0.5 1.3e14p/cm2). Also the output characteristics shown in Figure 11. The Id=f(vd) in linear scale for LNNDM14 transistors with linear layout, C1 and A3 are not affected by irradiation. For C1 this leakage is due to TID effects. For A3 the leakage is shown in Figure 10. The Id=f(vg) in linear scale and a current flowing in the transistor. The points at 100rad are in reality measured for 4 different chips. Chip A3 has been irradiated with a smaller bias on the gate (2V) before irradiation. The points at 100rad are in reality measured for 4 different chips. Chip A3 has been irradiated (the observed shift in the threshold voltage).
The leakage current in the nLDMOS transistors, used for the power train, induces an acceptable decrease in efficiency.

Elsewhere in the circuit, the leakage path is “cut” by adding core ELT transistors in series.
The qualification testing strategy of FEAST was based on:
- the irradiation of the device to the maximum radiation levels
- the verification on-line that the device was always functional (except for neutron tests, when the verification was done after irradiation)
- periodic and/or final full characterisation of the main electrical parameters: efficiency, line and load regulation, thresholds (UVLO, enable, OCP, OTP)

✓ X-rays for TID up to 700Mrad
✓ Heavy Ions for SEEs
✓ Pulsed laser test for SEEs
✓ Neutrons from a reactor for displacement damage
✓ 230MeV protons for SEEs (+ TID + DD)
Somewhere else, in another continent, the TBM chip for the CMS pixel system was being designed.

During a late and quick addition of functionality, logic unprotected from SEUs was integrated in the final chip.

A reset command was not implemented.

In a severe radiation environment the correct functionality is frequently corrupted, and a power cycle is needed to re-initialise the chip.
Potentially “leaking” nLDMOS in FEAST

Unprotected logic and no reset in TBM

Failure of FEAST2 in the CMS pixel system during the 2017 run
The problem is localised in the linear regulator (V33Dr) that provides the required current to the drivers of the power transistors (HS and LS).
These are the conditions when FEAST2 is “enabled” or “disabled”
In the presence of a large TID-induced leakage in the nLDMOS, consequences appear ONLY when FEAST2 is disabled (no load for the V33Dr regulator)

Current integration on the 220nF capacitor: V33Dr increases !!
The integration of the current on the two 220nF capacitors has eventually been observed experimentally in June 2018 on samples exposed to TID at our X-ray facility.

"disabled"
We observed a “voltage peak” on the V33Dr node when FEAST2 is disabled during X-ray exposures. The voltage peak increases with TID.
The voltage at the V33Dr node goes well beyond the nominal maximum of 3.3V+10%. This voltage stress might end up damaging a device. We have observed 2 damage mechanisms, and Failure Analysis with emission microscopy and OBIRCH have confirmed the current paths.

If a device along this path is damaged => increase in regulator current => FEAST2 continues to operate

If a clamp transistor is damaged => linear regulator stuck => FEAST2 failure
A graphical representation of the narrative

Luminosity

TID

TBM power cycle = disable of FEAST2

nLDMOS Leakage current

October

Time (in 2017)

TBM requires a power cycle
=> FEAST2 is disabled while the leakage is large
=> peak voltage at V33Dr
=> damage
This model for the damage explains why the problem was not observed during the radiation qualification of the FEAST2 ASIC

1. To produce the damage, it is necessary to perform disable/enable cycles at the time when the TID-induced leakage is large
2. To observe the signature of the damage, in the vast majority of the cases (High-current), it is necessary to measure the current consumption below UVLO thresholds
The perpetrator
Patches and long-term solutions
Patch 1 for FEAST2.1 (only for environments with TID > 500krad):

The voltage peak can get close to $V_{in}$, but not higher

Disable the converter at lower $V_{in}$

"disabled"

$V_{in} = 5V$

In case of a power cycle ($V_{in}$ down to 0V), the converter is disabled by the UVLO at about 4.4V

This strategy was successfully used during the full 2018 run of the CMS pixel detector, and its efficiency was demonstrated in the second IRRAD run and in X-ray testing
Patch 2 for FEAST2.1 (only for environments with TID > 500krad):

Provide a path for the mirrored leakage current

A sufficiently small external resistor will do the job
In our tests, a 3 KOhm resistor is OK

“disabled”

The efficiency of this strategy was demonstrated in the second IRRAD run and in X-ray testing
Long-term solutions:

**FEAST2.2**

Used in FEASTMP modules produced in the first half of 2019. During X-rays testing, no voltage peak on V33Dr event at -30°C if the dose rate is below 180 krad/hour.

**FEAST2.3**

Present default version for all modules
Final thoughts
This is a very complex system
- Assembly of different sub-systems
- Unique “prototype”
- Tested in the real environment only
List of expert EMC recommendations after the observation of the system:

- Improve filters for incoming transients on the DC supply lines
- Improve capability of system to route high frequency common mode (including connection of cable screens)
- Improve system equipotentiality (including cable screens, glued screens of detector and metallic parts of the cooling circuits)
- Avoid using twisted pairs for asymmetric signal or power connection
- Avoid separated grounds leading to severe lack of immunity against E-fields and transient H-fields
How about qualification practices?
We are not NASA in the 1970s...
Some reasons to be grateful for our luck

- The designers of the failing component were still around, and at CERN
- The problem happened to a detector that is amongst the easier to open
- Once understood, patches and fixes were easy to implement
- The problem appeared in 2017 and not in HL-LHC trackers
Personally I'm always ready to learn, although I do not always like being taught

W. Churchill
Experimental confirmation:
Second run at the IRRAD facility
A second run in IRRAD used 64 converters

Samples inside the cold box

Samples at room T on top of the cold box
- Samples at room T and at -25°C
- New samples and samples survived during the first run
- Samples protected with the addition of a 15kOhm resistor
- Samples protected with the addition of a 3kOhm resistor
- Samples biased with a Vin of 8V (versus 12V for all others)
- Samples without disable/enable sequence (turned off decreasing Vin)
- Half of the samples had the enable input protected by an RC filter

**Sequence used during the first run**

**Sequence avoiding the disable/enable at high Vin**

*Applied ONLY to 8 converters during the second run*
Figure 17: Summary of the results in the cold box for run 2. Dark grey squares represent fresh modules with FEAST2.1 (Aachen or FEASTMP designs); clear grey represents modules with FEAST2.1 already exposed during run 1 (but unharmed); yellow squares represent bPOL12V.V3 modules. If the modules were protected by a resistor on V33Dr, the value of the resistor in Ohm is indicated on the module. The temporal sequence of the observed damages is illustrated by the red numbers, and the values close to each damage indicate the estimated TID to failure. Results from the dosimetry are reported, in the appropriate location, in the blue and green squares.

Inside the box, $T = -25^\circ C$

Figure 18: Summary of the results for samples at room temperature during run 2. The representation is the same as for Fig. 17 above. The dosimeters report the same levels inside and outside the box: because of the geographical arrangement and of the actual readings we believe that the environment was very comparable in the two locations.

As in the previous run, all damage appears during or after a disable sequence. The currents after the event are comprised between about 8 and 13 mA. Also in this case the current increase in the off state is considerably larger than the one in the on state, which is limited to 1-2 mA. All damage characteristics are hence very comparable to those observed in the first run. This result is incompatible with the hypothesis where the damage is due to the noise pickup on the enable line because all modules in the board, regardless the presence of the RC filter, have been identically damaged.

In order to define a safe area of operation for the FEAST2.1 converters, it is important to summarise all results obtained in the two irradiation runs. This is done in Fig. 19, that reports the best estimate for the TID to failure for all modules. The image has to be taken with caution, since the dose levels are, as already pointed out twice before, given by a combination of passive dosimeter results and extrapolation. Also, the size of each data point is proportional to the

Outside the box, room $T$
## Summary of the observations during the second run

- Samples at room T and at -25°C
- New samples and samples survived during the first run
- Samples protected with the addition of a 15kOhm resistor
- Samples protected with the addition of a 3kOhm resistor
- Samples biased with a Vin of 8V (versus 12V for all others)
- Samples without disable/enable sequence (turned off decreasing Vin)
- Half of the samples had the enable input protected (RC filter)

- Worse damage when cold
- Damage earlier if pre-exposed
- No damage
- No damage
- No damage
- No damage
- No damage
Summary of the best estimate for the dose to failure for the two IRRAD runs

Damage only occurs above 1Mrad in these high-dose-rate (and low-T) experiments
A revised version of the ASIC (FEAST2.2) to remove the damaged transistors

In February 2018 we have modified the design of the ASIC to remove the “weak” low-voltage transistors from the V33Dr node, as well as to add a dedicated ESD protection device to the pad.
- this was meant to eliminate the “broken” type of damage (only “High-current” possible)
- additionally, this could have helped the node to resist a hypothetical aggression (discharge?)