

## Out-of-Sync FECs - SAMPA OoS

- OoS means that the order of the ADC values is changed and the wrong pedestal values are applied. The FEC itself stays fully operational
- There are 2 possibilities how this can happens:
  - The SAMPAs receive an unexpected reset
  - GBT frames on the uplink to the CRU are lost

• In most cases all 5 SAMPAs of a FEC (2 GBT links) are going OoS, so a (unwanted) reset of all 5 SAMPAs is the most likely scenario

- The OoS of a FEC seems to be strongly correlated with an increase in the Forward-Error-Correction (FEC) error counter inside GBTx0
- The FEC protects the data stream on the GBT downlink. It can correct up to 16 consecutive(!) bit errors in a GBT frame (80 bit)
  If the FEC can not correct errors on the GBT downlink anymore, the SAMPAs will/can see unwanted resets
- No loss of PLL locks inside the GBTx0 chip nor SEU errors are observed: The stability of the downlink doesn't seem to be affected

The mitigation strategy is simply to re-sync the SAMPAs periodically. This is already implemented, working, and in use at P2. At the moment, all SAMPAs are re-synced at the start of every time-frame. In order not to loose any data, the re-sync will be shifted from the start of the time-frame to the end of the time-frame since this data is anyway not processed.





## Out-of-Sync FECs - Lost GBT frames

- Despite running with Re-Sync of the SAMPAs TPC still experienced OoS FECs -> suspicion that GBT frames on the uplink are "lost"
- "Lost" GBT frames can have two reasons
  - the GBT link is o.k. and the clock can be recovered, but the data can not be decoded -> data corruption
  - the GBT link is not o.k. and fails to recover the clock (and as a consequence, can not decode the data)
- Why did we not see this before?
  - CRU Common-Logic monitors the two cases and sets a "sticky" bit when this happens. So once set, it will remain set until it is reset
  - The process which reads out the CRU CL registers periodically (1s) to provide monitoring information to Grafana resets this sticky bit
  - If one does not look at the link status page of the CRU in Grafana in exactly this one second period, one will not see it
- Monitoring of the links was preliminary improved by adding counters (Pippo) for both data corruption and link loss
  - Out-of-Sync coincides with an increase of both this counters
  - More detailed monitoring to characterise the link loss will be added to the TPC UL. The CRU CL will forward the two required signals (Clock-Loss, Data-Loss) to the UL

The mitigation strategy is "relatively" simple but depends on the behaviour of the link during re-sync.

If GBT frames are simply "lost" one can add "dummy" frames while the link is down, containing only 0s. Decoding will simply continue only that instead of wrong ADC values, all ADC values will be 0 and hence not affect any processing in the UL and also be suppressed by the ZS

If GBT frames are "inserted" during the upcoming of the link, a re-lock on the SYNC-pattern is required.

Both options are feasible, the first one is simply more elegant and easier with less impact

https://gitlab.cern.ch/alice-cru/cru-fw/-/issues/350

