TSU CONS reliability studies - progress meetings 2024

Europe/Zurich
30/3-023 (CERN)

30/3-023

CERN

12
Show room on map
    • 1
      TSU CONS Reliability Meeting - 1 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Overview of the current and future TSU system architectures, LBDS Power distribution, TSU card detection and synchronization, DRBRF, TSU User Permit to BIS

      Speaker: Nicolas Voumard (CERN)

      TSU CONS Reliability Meeting - 1

      Present: M. Blaszkiewicz, L. Felsberger, J. Uythoven, N. Voumard

      Minutes

      N. Voumard presented a comprehensive overview of the current and future TSU systems to explain the general concept and differences between the two versions. In addition, several additional aspects were discussed, briefly outlined below.

      Layout

      • Powering for TSU is provided via UPS units from separate power sources.
      • Voltage surveilance is present in the line between TSU and Trigger Delay unit - if a cable is, e.g., disconnected, it will trigger a synchronous dump via Slow Control (PLC) and via the TSU. The feature is tested on a regular basis by cutting the UPS.
      • The TDU shown in slide 10 of 2022 Dec BISv2 Status and Actuator Board Presentation (cern.ch) is a different one that the Trigger Delay unit in the schematics of the shown presentation.
      • There are 48 outputs of the Trigger Fan Out unit, some of which go to the LBDS Generator and some to other destinations.

      TSU Card Detection and Synchronization

      • The logic will be entirely moved to the FPGA.
      • External watchdog will trigger an asynchronous beam dump in case of loss of configuration in the FPGA. Additionally to the asynchronous dump on the faulty card, a synchronous dump is issued by the other TSU (the not-faulty one). An asynchronous dump can be issued only if both cards are faulty at the same time.
      • Communication between TSUs will be performed on the same FPGA. 

      TSU User Permit to BIS

      • The idea to enable the arming procedure is to force the the permit to true at first, so that it can be used to allow arming.

      Actions

      1. Schedule a bi-weekly meeting to keep everyone updated in terms of the study's progress.
      2. Begin the top-down analysis, which at first will focus on architectural aspects of the new system; specifically addressing major changes between the old and the new versions. 
      3. Consider options for formal verification methods to be applied to the FPGA firmware testing.
    • 2
      TSU CONS Reliability Meeting - 2 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Progress in the top-down architectural analysis.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 2

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard

      Minutes

      In the meeting we went through the top-level functionalities FMECA, which is supposed to identify the main functions of the system, their potential failures, effects and criticallity. In selected cases, we also discussed potential other protection layers (such as 2 redundant TSU having to fail in the same way, or redundancy on CIBDS asynchronous dump).

      FMECA Table

      • The table contains the following situations relevant to the TSU:
        • Normal operation
        • TSU in LOCAL mode
        • Dump request
        • Injection
        • Arming
        • Power outing
      • For each of those, we discussed corresponding elements of the table:
        • NV menioned that there is a possibility of a bi-directional link between CIBAB and TSU - requires further discussions with the MI section.
        • DRT is used to have the timestamp of the dump; used for Post Mortem, IPOC, etc.; not critical.
        • LF observed that diagnostic and non-diagnostic functions are handled by the same FPGA - which can be problematic in case of a failure like a clock failure.
        • In the LOCAL mode, the requests are still transferred to the LBDS. 

      Actions

      1. Review the table from the presetnation. 
      2. Build a Fault Tree for failure modes which are identified as the most critical and likely to occur. 
      3. (NV) Check what happens when one of the UPS units fails; whether it is registered or noticed somehow.  
      4. Other matters to check:
        • Whether Ring BIS is an input to the Injection BIS
        • What is the procedure to dump the beam when it's not possible via neither TSU nor CIBDS.
    • 3
      TSU CONS Reliability Meeting - 3 Online

      Online

      Reliability block diagrams for selected top-level functions (synchronous beam dump triggered by BIS, beam dump via CIBDS and TSU triggered by BIS, beam dump not triggered by BIS); potential failure mode of internal dump after a discrepancy detection; next steps.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 3

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard, P. Van Trappen, J. Uythoven

      Minutes

      The main focus of the meeting was reliability block diagram of the three beam dumping scenarios.  

      Recap - Top-level FMECA table (excerpt)

      This slide was a recap of the table discussed in the last meeting. It highlighted only the most critical failures. In the discussion, Jan has confirmed that the "acceptable once in 10 years" should be applied for an asynchronous beam dump, even though such a beam dump should occur not more often than once per year. The reason is that there are multiple systems which may trigger that effect. It has been suggested to be marked in the table. 

      Nicolas noted that no such event has been triggered by the TSU in the last 10 years.

      Pieter asked about 3.1.1.1 "no damage" after an asynchronous beam dump. Both Jan and Nicolas agreed that it should remain as such.

      Nicolas remarked that in the 4. Injection situation, there would need to be an entire chain of circumstances leading to the critical error.Jan remarked that it is partially coverd by scenario 3.

      The same applies to scenario 5.

      Reliability Block Diagram

      The slide explains the RDB methodology. . 

      Synchronous Beam Dump

      The slide presents an RDB for a synchronous beam dump triggered by the BIS. 

      The discussion featured several points:

      • BETS tracks all the generators to be sure that they are within a certain tolerance - surveys all the generators, Q4 and septa magnets. Connected directly to the TSU

      • All of the BLMs are connected to the BIS, however the one BLM here might be not. It may be connected directly to the TSU only. Jan suggested assuming the one directly connected does not exist. Lukas remarked that we are more interested in critical systems connected to the TSU but not to the BIS.

      • SCSS (Slow Control PLC) is connected directly to the TSU and may also connected to the BIS ring. In SPS TSU is connected directly to the Ring BIS (forcing the BIS during the arming procedure), where it’s only used for ARMING. No direct link to the ring BIS from the SCSS. In LBDS we can add it - however Pieter says that it would be preferable to keep the architecture the same, and that it is not only SCSS - there are also other input channels (as the BETS). 

      • External TRIG LBDS used for “inject and dump” procedure. And for early dump in the SPS (dump issue via the timing). Standard Dump in the SPS issued by the BIS at the end of the flat top. 

      Generally it was concluded that it counts whether the upstream trigger is connected to the BIS or not, as only then CIBDS gets triggered. Hence, for async. dumps the two variants are considered.

      TSU Interconnect

      Nicolas explained that it is mostly done to check the syncronization between the two cards. If there is a problem, it will issue a trigger immediately. 

      Lukas asked about comparing the client status. Nicolas explained that it is not done now, but may be in the future for additional security. Pieter suggested that it should be done only if it’s necessary and could be one of the results of the analysis here. Nicolas agreed adding that it is only an additional client. Lukas concluded that we will compare the options.

      New top-level failure mode

      Nicolas explained the idea of an internal fault - if one of the TSU detects an internal dump, the other one sends synchronous BDT and asynchronous BDT. First one sends only asynchrony BDT. Normally, if there is only oneTSU with an internal fault, there is still an synchronous beam dump trigger. If both TSUs have an internal fault, there is a chance of an asynchronous trigger.

      Next steps

      Jan remarked that the TSU has its own IPOC - TSU IPOC which is the most important as it is looking at the redundancy on the TSU. Nicolas added that it is the Triggering Synchronization IPOC, all signals of the TSU output. Jan highlighted importance of checking if it covers the entire redundancy with Nicolas Magnian.

      Pieter added that XPOC checks all kinds of sources. The check here is done by IPOC. However, he would like to follow up on exact actions which may be triggered from that check. He also stated that the LBDS IPOC samples the signal. When something goes bad, it will be seen there.

      Lukas suggested to start the modeling by assuming that everything is checked after every dump request.

      Regarding a possibly closer investigation of the LBDS power distribution, Nicolas said that the LBDS Power distribution reliability study has already been done. Jan confirmed that it has been looked at and that we should look at it as “if we lose power, then we dump”. Nicolas added that if we lose power for the TSU crate, then we have voltage surveillance via Slow Control and we will act. To be followed up. Lukas concluded that we focus on the beam dump function. Then we look into power distribution - with slightly lower priority. 

      Actions

      1. Nicolas V. will disucss with NM what kinds of checks we do in LHC and SPS.
      2. Prepare initial simulations with crude assumptions and study the results. 
      3. Prepare model of LBDS power distribution
    • 4
      TSU CONS Reliability Meeting - 4 30/3-023

      30/3-023

      CERN

      12
      Show room on map

      Results of the top-level simulations

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 4

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard 

      Minutes

      This meeting was focused on discussion of the reliability simulations of top-level component representation of the system. They were completed to establish a reliability target for the TSU board based in the context of other systems.  

      SPS

      Largely overlooked so far, as we tend to focus on LHC. There are significant differences between the system deployment there.

      • The dumps occur every 3 - 15 seconds.
      • It is unclear how often a check can be performed there; there is IPOC, might not be XPOC.
        • "Rough XPOC" could be implemented; to be seen, also through the results of this study.
      • A more comprehensive check is carried out every 2-3 months, worst case - 6 months.
      • Criticallity is lower though, as repairs are less resource-intense.
      • CRC configuration of the FPGA.

      TSU Interconnection

      The interconnection is already present - it checks if both TSU BRF singals are synchronized. Whenever one is experiencing problems, it issues only an asynchronous dump, while the other one triggers both synchronous and asynchronous dumps. 

      Other connections in the system

      The new TSU will get a watchdog, e.g., for when the clock stops or there is a power supply problem. 

      Connections between TSU and BIS, BETS, BLM and all others except for the External Trigger (used for a procedure of injecting and dumping directly - in the SPS, it's done via "Early Dump") as well. 

      Connection between TSU and TDB is fail-safe (6V, dump - goes to 12V, remove the plug - TDB/TDU trigger itself). Connections to TFO are pulse with active high (IPOC check if both are always sent, same for TFO -> generators). On the LBDS (not SBDS), there is a feedback from TFO to TSU which ensures that each TFO received a trigger from both TSUs (after every pulse, there is a check in XPOC). 

      Actions

      1. Next meetings will be online, via Zoom - the recurring invitation needs to be updated.
      2. Whenever the schematics of the design are ready, we can start the bottom-up FMECA analysis.
    • 5
      TSU CONS Reliability Meeting - 5 Online

      Online

      Component-level FMECA; failure rate prediction - statistics and summary; next steps: failure mode apportionment and end-effects assignment.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 5

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard, P. Van Trapen

      Minutes

      The purpose of the meeting was to review progress and discuss findings of the reliability study after the failure rate prediction and failure mode apportionment steps and before the end-effects assignment.

      Key Discussion Points

      • Detailed methodology and assumptions for failure rate prediction
      • Statistical analysis and findings for TSU and TSU RTM boards
      • Importance of accurate failure mode apportionment and end-effects assignment
      • Strategies to improve reliability and reduce failure rates through ongoing analysis

      Study Workflow

      Steps followed in BISv2 reliability analyses

      1. Estimation of failure likelihoods for components and subsystems
      2. Identification of failure modes with assigned probabilities
      3. Establishing failure end-effect probabilities for risk matrix comparison

      Failure Rate Prediction

      A recap of assumptions and parameters used in the first step of the study, followed by statistics and estimations - outcomes. 

      The TSU Board: 2618 FIT (excluding rotary switches), TSU RTM Board: 299 FIT. 30 capacitors have no applied voltage read by the script, therefore a default value is used for them instead (to be followed up). Global assumptions were confirmed, with a modification to the duty cycle which was suggested to be set to 3-4, operating temperature to 50 for PSU and FPGA, as well as humidity to low (as is the case of the LHC).

      The high failure rate indicated by the 217Plus standard for rotary swiches is somewhat confirmed by the experience (not in operation though) - therefore, this subject can be further explored if the end-effects show relevant criticality of their failures. They are used to set the delay between BRF and Delayed BRF and abort gap keeper - TSU in SBDS does reading when power on, and keep when in operation

      Conclusions and Recommendations

      The FMECA end-effects assignment should consider effects on the level of two TSU boards, exculding the CIBDS. 

      Next steps are laregly to establish criticality of rotary switch failures (as both theory and practice shows potential problems). The next step is the end-effects assignment, for which we will send an FMECA table to NV as soon as possible, extending it with columns specifying pins connected to the capacitors.

      Actions

      1. Expand the FMECA tables to contain information on capacitors' pins.
      2. Share the FMECA table with NV.
      3. Providing support for the end-effects assignment step. 
    • 6
      TSU CONS Reliability Meeting - 6

      FMECA Tables catch up, design review summary, timeline of the project

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 6

      Present: M. Blaszkiewicz, L. Felsberger, J. Uythoven, N. Voumard, P. Van Trapen

      Minutes

      FMECA

      Started, very much still ongoing, it is a lengthy process. They will be ready in one or two weeks. Short circuits filled in, then the other modes. 

      NV: Rotary switch - cross check between two TSUs. They are read only at start up: not continuously read. They can be compared to a database and validate before arming the TSU. During the operation, even if they change, they are not relevant for reliability.  

      PVT: the study results are not necessary before the Design Office's routing. It will take 3-4 weeks, but it doesn’t need to be finalised. It would be good to have the analysis before the holidays.

      Design review

      Not many changes overall - some in the BIS interface reported by the MPE.

      Project timeline

      Prototype by the end of the year. Changes can still be done afterwards. Validation next year. So the study should be concluded in the beginning of September. The files submitted to the Design Office already now. 

      Other

      Establishing end-effect with a simulation model is possible for smaller subsets of components. If there are parts wherenot it is necessary, we will proceed. 

      Actions

      1. Re-run the failure rate prediction for the latest project changes from Gitlab. (Changes to be pushed to the repository later today).
      2. Less urgent: look into functional specifications as well as the power distribution analysis.
    • 7
      TSU CONS Reliability Meeting - 7 Online

      Online

      Summary of the end-effects FMECA assignment step; assumptions for the hybrid MC model.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 7

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard, P. Van Trapen

      Minutes

      FMECA

      The meeting was focused on discussing the FMECA table after detailed end-effects assignment step performed by Nicolas. The slides present the total failure rate numbers as well as highlights for each individual category. 

      Additional clarifications on the meaning of end-effects assigned by Nicolas:

      • no effect failures may sometimes have impact for other systems,
      • downtime - tsu needs to be replaced, cannot rearm,
      • no diagnostics - almost same as above,
      • loss of injection permit - almost same - just during injection,
      • unpredictable:
        • for FPGA, the worst case async dump,
        • D2, IC18, IC41 should not lead to async,

      During fmeca added connection to fpga, to trigger sync dump from other tsu

      It was reiterated that the rotary switches are used only for read at start up.

      Additionally, the following changes were proposed:

      • IC31, IC39, IC29, IC43 - may also trigger async dump; marked as asynchronous and no effect.
      • Analyse the following possibility: fuse short --> short on VME --> asynchronous dump.
      • Review F3, F4, F5.

      Actions

      1. A smaller meeting in the following week to conclude the assumptions for the hybrid MC model. (Milosz)
      2. Review of the FMECA file and upload of the updated version (additional asynchronous dumps, fuses, unpredictable end-effect). (Nicolas)
    • 8
      TSU CONS Reliability Meeting - 8 865/1-B03

      865/1-B03

      CERN

      12
      Show room on map

      Confirming the model assumptions

      TSU CONS Reliability Meeting - 8

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard

      Minutes

      The meeting's goal was to interpret the FMECA table in a way which provides estimations to use in the hybrid MC model.

      Actions

      1. Update of the files with the changes made during the meeting (Nicolas).
      2. Preparing and starting the simulations (Milosz).
    • 9
      TSU CONS Reliability Meeting - 9 Online

      Online

      TSU Hybrid MC model of missing a triggering: description and results

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 9

      Present: M. Blaszkiewicz, L. Felsberger, P. Van Trapen, N. Voumard

      Minutes

      The obejctive of the meeting was to discuss the reliability model created for TSU missing a triggering. The presentation covered the transition from the model discussed in the simulations for establishing the reliability requirements to the one allowing analytical calculations, failure rates taken from the FMECA file for individual parts of the TSU design and, finally, the simulation results.

      During the simplifed model discussion, it was mentioned that the crossing between TSUs and TFOs will exist, but a connection between TFOs will not.

      The failure rates for various systems communicating with the TSU are summed together for simplicity. It was also highlighted that the pessimistic assumptions extend to the lack of asynchronous triggering from BIS via the CIBDS card.

      Fuses used in Dump Request Triggers page caused a temporary confusion. In the end, it is the fuse F3 used for SBDT path A, F4 for SBDT path B and F5 for ABDT. The high failure rate of the fuse was explained as having to do with lack of corresponding estimation in the 217Plus standard. 

      Actions

      1. Follow-up on the VPSOK signal stuck to TRUE, as the answer regarding its criticallity is not obvious. (NV)
      2. Continue with the creation of another hybrid MC model for probability of experiencing an asynchronous dump (MB & LF). 
    • 10
      TSU CONS Reliability Meeting - 10 Online

      Online

      Asynchronous dump probability estimations; summary of end-effects likelihoods based on the FMECA results.

      Speaker: Milosz Robert Blaszkiewicz (CERN)

      TSU CONS Reliability Meeting - 10

      Present: M. Blaszkiewicz, L. Felsberger, N. Voumard

      Minutes

      The meeting focused on the results of the probability estimations for asynchronous dump triggered by the TSU. 

      Before starting, the participants revisited the issue of the VPSOK signal stuck to TRUE. It was observed that only the 3.3V line could be problematic. Nonetheless, it also does not lead to specific end-effects, as there is a connection to the other TSU. On top of that, the missing the 3.3V line would cause many other elements to fail, eventually making the failure easily identifiable. 

      Asynchronous dump triggered by TSU

      There are two potential causes identified in the FMECA:

      1. Failure of both SBDT paths leaving asynchronous dumps as only possibilities.
      2. Failure causing an asynchronous dump triggered via synchronous lines.

      The third possbility, asynchronous dump triggered via asynchronous path without triggering the synchronous dump was rulled out, as no relevant failure modes were found in the system. The model for the estimation of the 1st option likelihood and data found for the 2nd option were further discussed.

      The high failure rate of fuses (20 FITS each) got some of attention; the high number is caused by the lack of more concrete estimations from newer sources than another study completed sometime ago at CERN. 

      Summary of results

      Table summarizing probability of occurrence for each end-effect details estimations for 12, 7200 or undefined mission leghts compared with the initial reliability requirement set for each. 

      For availability related end-effects, NV mentioned that there should be also other systems included to obtain a full picture. 

      Actions

      1. Proceeding to prepare the report for the study.
      2. Finding out with which target groups the results of the study should be discussed (NV and LF).