BIS2 and SMPv2 reliability studies - progress meetings 2024

Europe/Zurich
30/3-023 (CERN)

30/3-023

CERN

12
Show room on map
    • 1
      BISv2 progress meeting - 1 CIBF 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Results of the CIBF FMECA; statistics of the project, failure rate apportionment to global end-effects; next steps.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - 1

      Present: M. Blaszkiewicz, L. Felsberger, C. Martin, I. Romera Ramirez

      Minutes

      The presentation highlighted the main outcomes of the quantitative analysis of the FMECA table.

      Reliability targets

      The BISv2 study is moving into a quite mature form where the reliability targets based on assumptions will become less relevant and focus will move towards the global reliability model. The model needs to have failure rate estimations "plugged in" from all of the studies of individual boards - for that reason it could not have been used earlier.  

      Nonetheless, the presentation features a quick recap of the targets calculated for the CIBU board - which can be considered equivalent in many ways. 

      Preliminary results

      These are based on the first iteration. They clearly show that the most dangerous end effect, i.e., blind failures, do not have a high failure rate. Remaining categories are also within the range of values established for other boards.

      Blind failures

      The following section of the presetation deals with the blind failures in more detail. Discussion of their locations in both, CIBU and CIBF led to many interesting take-aways:

      • FITS number for the inverter: https://assets.nexperia.com/documents/quality-document/74LVT14D_Nexperia_Product_Reliability.pdf
      • RL1C stuck to position 5 is used for testing if the board is stuck to one state only. Impossible to fail until a number of things go wrong.
      • Optocoupler stuck to low signal: because there is another inverter that is not shown in the picture. Whether it’s a common problem or not - that’s another question.
      • Inverters role:
        • Isolation (preventing fault propagation),
        • Replenishing the power of the physical signal and reduce the power that is pulled from the output of the upstream devices (so that they are less stressed).

      Statistics

      The remaining part of the slides was devoted to the statistics presenting where each type of end effects originates most often from and other similar measures. They can be useful to diagnose problems with the study or assumptions made.  

      Actions

      1. Whenever presenting FITS, provide an additional, more "palpable" metric so that the number is more transparent to people without reliaiblity background.
      2. The next step in the study is to analyse the SFP, then CIBDS and then CIBFX.
      3. When ready, use the (pessimistic) outcomes of the FMECA in the global model to predict overall reliability of BISv2.
    • 2
      BISv2 progress meeting - 2 CIBDS 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Recap of the progress made so far in the study of the CIBDS board: top-level functions, functional block diagram, failure rate prediction & FMECA.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - CIBDS

      Present: M. Blaszkiewicz, A. Collinet, L. Felsberger, I. Romera Ramirez

      Minutes

      The meeting was a recap of the early advancements completed some time ago for the CIBDS board and next steps. 

      Top-level functions FMECA

      We agreed that the main functions of the CIBDS is: to generate an asynchronous dump request via the TDU, to pass on the BIS beam permit loop signal, to interpret the Link Mode and trigger asynchronous (via TDU) and synchronous (via beam permit and TSU) beam dump requests.

      Functional Block Diagram

      The main signal flows are as shown in the slide. There are some modifications: there are links between Monitor FPGA and Critical FPGAs, as well as there is a software link mode singal which is received in the monitor FPGA and passed on to the critical FPGAs.

      FMECA

      A further discussion of the top-level functions, possible failures, their effects and criticality was one of the highlights of the meeting. The list of identified failures is as follows: 

      • Blind failure (single path).
        • Blind sync - not generating asynchronous dump request.
        • Blind async - not generating a synchronous dump request.
        • Blind - not generating asynchronous nor synchronous dump requests.
        • Link mode.
          • Blind sent ext - not sending external dump request.
          • Blind generate ext - not generating dump request upon receiving an external dump request when link mode enabled.
      •  False dump.
        • False dump async - spuriously generate ONLY asynchronous dump request.
        • False dump sync - spuriously generate ONLY synchronous dump request.
        • False dump - spuriously generate asynchronous and synchronous dump requests.
      • Maintenance
      • No effect. 

      Firmware verification methods

      It was agreed that firmware verification would be an interesting development for the FPGA-based projects. AC suggested that the Equivalency Checking is of particular interest to him, while formal verification methods can prove to be a time-consuming and effort-intense activity. We will follow up on those points further.

      Questions

      • Q1. Transistors' applied voltage is generally 3.3V.
      • Q2 and Q4. Assignment of the 217Plus categories to the actual components requires further scrutiny. 
      • Q3. Generally 50% derating is acceptable, but we will look further at capacitors with highest stress value.

      Actions

      1. Update and share FMECA tables for CIBDS and CIBF SFP.
      2. Take a look into past CIBDS studies of BISv1.
    • 3
      BISv2 progress meeting - 3 CIBFX 30/2-005

      30/2-005

      CERN

      15
      Show room on map

      Summary of the CIBFX FMECA study.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - CIBFX

      Present: M. Blaszkiewicz, L. Felsberger, C. Martin, I. Romera Ramirez

      Minutes

      The meeting on May 28, 2024, focused on the reliability progress of the BISv2 CIBFX project. The FMECA was presented, highlighting the failure rate predictions and reliability assessments for various components on the CIBFX motherboard. The failure rates were estimated in the first step, with capacitors and resistors being major contributors, each adding significant Failure in Time (FIT) rates to the overall reliability concerns. "Metal Film" 217Plus category was accepted as resistors category.

      In a discussion, it has been mentioned that manual testing during HW commissioning someimes not repeated for long periods, such as ten years or so. PIC, BLMs, WIC are testsed at least once a year, but it's not the case for others. There are also yearly test of CIBU to BIS connection (up to input connection of users - except for burndy).

      Specific failure rates were discussed for various components, such as the tantalum capacitors, which showed a maximum failure rate of 4 FIT for ceramic capacitors under a 0.33 stress factor. Resistors contributed a total of 94 FIT, with each resistor having a small deviation in failure contribution. Externals, such as the IGLOO2 FPGA, were evaluated manually, with the IGLOO2 FPGA showing a failure rate of 8 FIT.

      In a discussion of this blind failures - IGLOO2 input stuck high would also be blind in CIBF - IRR only considered output stuck high. It was also mentioned that IC failures in CIBF are due to current loop - and that is why they do not exist in CIBFx.

      The end-effect analysis provided insights into the impact of these failures on the overall system, breaking down the failure rates per page of the CIBFX motherboard. Blind failures and false dumps were key concerns, with blind failures resulting from RS-422/RS-485 receivers and the IGLOO2 FPGA contributing to a combined failure rate of 8.7 FIT. False dumps on the motherboard accounted for 137 FIT, emphasizing the need for improved reliability in these areas.

      Maintenance issues were also addressed, highlighting failure modes such as shorts and opens in transistors and relays, with specific FIT rates assigned to each failure mode. The meeting concluded with a discussion on remaining questions and next steps, focusing on potential testing procedures and further reliability improvements.

      There has been a short discussion of the PDSU concentrator at the end. RS485 input is less reliable - providing a voltage in blind is less critical. Testing for PDSU will require indiviual triggering nowadays. Post-mortem resolution on BIS side is 10us - therefore is too crude and cannot see the effects of redundancy. Could use CIBU in parallel. It was also stressed that we should keep in mind patch panel in-between.

      Next steps

      1. CIBFX Report
      2. FMECA of CIBAB board.
      3. IRR will share a global overview/inventory when ready for the BIS2 Global Model.
    • 4
      BISv2 progress meeting - 4 CIBDS 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Summary of the CIBDS FMECA

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - CIBDS

      Present: M. Blaszkiewicz, A. Collinet, L. Felsberger  

      Minutes

      Presentation featured statistics of the CIBDS FMECA.

      Brownout condition was mentioned by LF; AC replied that it is not relevant as BPL can be stopped always. 

      Project statistics

      The numbers are similar to other BIS projects. There was a short discussion of capacitors - their operating voltage is almost always 3.3V, aside from some 12V around TDU triggering.

      Blind failures

      The first blind failure, the one connected to the FPGA, actually requires another failure to take place before it is actually a blind failure in both, asynchronous and synchronous paths.

      The second blind failure, of an oscillator, remains in this category as the counters and similar elements may not work anymore - and prediction of likely end-effects is difficult, however worst-case scenario is a blind failure. Mitigation:

      • All FPGAs in the BIS have phase locked look (also considers natural change, see the datasheet)
      • Auto-correction based on PPS (for monitoring FPGA)

       

      Additional matter was a FPGA junction temperature which is an important factor to predict its failure rate. AC confirmed however that 55C is a reasonable assumption. In the lab, only the ambient temperature is monitored - an it is 30 C for this card. 

      Blid sync indication in the FMECA table means NOT requesting an asynchronous dump (and the opposite for blind async).

      Link Mode

      • Hardware link goes via backplane connection to CISU.
      • Logic of generating STOP is done in the receiving CIBDS.

      False dumps

      "False dump async local" is essentially no effect: it signifies sending a spurious dump request in the local mode (i.e., during testing).

      Next steps

      1. Check IC19 component.
      2. Double check what TSU does with feedback (whether it is in the remote or local mode).
      3. CIBDS report to be created.
    • 5
      BISv2 progress meeting - 5 CIBAB 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Kick-off meeting of the CIBAB reliability study. Initial results of the failure rate prediction step, general statistics, definitions of end-effects, top contributors to the failure rate.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - CIBAB

      Present: M. Blaszkiewicz, A. Collinet, L. Felsberger, T. Podzorny, I. Romera Ramirez  

      Minutes

      The presentation showed general statistics from the failure rate prediction step, proposed end-effects definitions, top contributors to the overall failure rate and remaining questions to conclude. 

      In the discussion, the following points were made:

      • Remaining 49 capacitors are probably filters where the nets can not easily be read out.
      • There are 3 beam permit output channels because of different actuators:
        • E.g. in SPS Injection there is a need for more than one channel.
        • In the LHC, it will be only one channel used.
        • In the FMECA table, the effects will be filled for all, but in the global model only a single connection should be considered.
      • End-effects definitions to be updated:
        • maintenance - no immediate action
        • false dump - can also be availability problem
      • Adjustments to components' failure rate predition:
        • Transistors
          • FITS are probably from powering transitors rather than ones used in the CIBAB context.
        • TVS
          • Various parameters specified in the datasheets (clamping/standing)
          • Parameters in the 217Plus don't make sense in this case (operating and rated voltage)
          • These ones are used for RS485; however, operating voltage could be higher as sometimes it may be an aggregation of other voltages.
      • Designators naming conventions
        • The problems are coming from default way Altium replicated channels; mostly for LEDs pages
        • FPGA - there is only one component but referred to using different letters in the end.
          • This results in the similar desingator names as for different channels. 
          • There is no straight-forward way to establish if one or the other is the case for specific designator name.
      • Channel configurations
        • If disabled by accident, they send fail safe value.

      Next steps

      1. Share the FMECA table with Antoine.
      2. Follow-up the failure rate prediction of transistors and TVS diodes in 217Plus standard.
    • 6
      BISv2 progress meeting - 6 CIBAB 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      CIBAB FMECA summary

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - CIBAB

      Present: M. Blaszkiewicz, A. Collinet, L. Felsberger, T. Podzorny, I. Romera Ramirez  

      Minutes

      The meeting was started with the comparison of various failure rates assigned to a specifc transistor selected as a study case. Details are shown on the slide 8.

      The next block was a discussion of all comments made in the FMECA by AC:

      • J3, J4, J5 are handling differential connections, therefore a blind failure would happen there only if there is a double failure: both in positve and negative signals. When only one of those is wrong, then there will be a beam dump.
      • When Post Mortem fail, the "false dump" is indicated there as a loss of availability, even though there might be no immediate dump.
      • Artix 7 open will lead to a false dump immediately
      • CIBU cards will be tested only once per year, as it is only one client that triggers the dump every time. It means that some of them will be tested only during Technical Stops.

      Next steps

      1. The first next step is to create a report as for all other BISv2 boards in the study.
      2. The quantitative results to be used in the global model of the BISv2. 
      3. for research interest: compare prediction standards for buffers (logic transistors) with transistors (suspected power applications)
    • 7
      BISv2 progress meeting - 7 Global Reliability Model Online

      Online

      Discussion of the global reliability model draft: functional block diagram, fault tree, board numbers, etc.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      BISv2 Reliability Study Progress Meeting - Global Model 

      Present: M. Blaszkiewicz, L. Felsberger, I. Romera Ramirez  

      Minutes

      The functional block diagram of the entire model was generally accepted as is, with the following adjustments: 

      1. CIBFi-Rx cards to be added in BIC 1R, BIC 1L, BIC 5L and 5R.
      2. CIBFi-Tx cards to be added to BIC 2R and BIC 8R.
      3. All CIBFX cards are supposed to be unmaskable.  

      The fault tree discussion spanned several items. The following adjustments are to be added to the model (attached slides are already in the corrected version):

      1. The number of CIBUS and CIBUD cards should assume the potential maximal number, which is 17 (20 channels, 3 for optical ones)
      2. CIBF and CIBFX cards can take the three channels, which means - there can be 3 cards of those.
      3. Only one path of one of the CIBDS OR CIBAB cards has to work for the system to not fail. 
      4. CIBG card should also be represented by its two independent paths (assigned 0 FITS).

      Next steps

      1. Adjustments to the model as discussed.
    • 8
      BISv2 Global Model Discussion 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Presentation of the tentative results of the models of BISv2 reliability.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      Present: M. Blaszkiewicz, L. Felsberger, I. Romera, D. Westermann

      The meeting centered around the results of the iteration from last time. There were three models presented:

      • pessimistic analytical model; assuming yearly checks only and continous demand,
      • exact analytical model; assuming yearly checks only and continous demand,
      • subset-based model; assuming 12 h inpection interval for some components and 1 year for the rest.

       

      The second part of the meeting switched to the hardware failures and possiblities to gather information about them in the existing BIS infractucture in the LHC and SPS.

      Actions:

      • We share with Ivan the blind failure modes of CIBM to be checked and confirmed.
        • Particularly, if they are indeed critical.
      • We send to Ivan list of failures to be checked and ask for the system inventory
        • May include SPS.