Speaker
Description
This paper addresses the challenge of mitigating the effects of radiation on the electronic systems of the Large Hadron Collider (LHC) by introducing BatMon, a battery-powered, MCU-based wireless radiation monitoring system. The paper proposes software mitigation schemes that can be used alongside an external watchdog to guarantee higher availability of the application without impacting the system performance. Tests conducted at the CHARM facility show that the proposed schemes enable BatMon to achieve 99.9996% availability in the harsh environment of the LHC. The manuscript highlights that the result obtained will also allow the system to be used for critical tasks.
Summary (500 words)
The LHC employs numerous electronic systems, including high-criticality and low-criticality applications, with the latter increasingly utilizing off-the-shelf Commercial Off-The-Shelf (COTS) MicroControllers Units (MCU) for cost, power, size, and flexibility advantages. In this complex environment, single-event effects (SEE) can have a strong impact on the reliability of these systems, requiring effective mitigation schemes to ensure high system availability during operation.
This paper introduces the BatMon, an MCU-based battery-powered wireless radiation monitoring system for the LHC. It is designed with qualified low-power COTS components, uses LoRa wireless transmission technology and can tolerate up to 275 Gy. It is modular allowing the platform to be application independent and be used for different purposes. It embeds an external watchdog as a hardware mitigation scheme to cover possible failures due to SEE. However, it alone cannot detect all possible SEFIs and consequently cannot always restore the system functionalities. An example is the case of the MCU stacked in a while loop due to a SEE: the MCU will continue to serve the external interrupt toggled by the external watchdog. As consequence, the failure will remain undetectable and the device not operational.
The primary aim of this work is the definition of software mitigation strategies that can be employed alongside an external watchdog, while maintaining high performance and compatibility with any MCU-based design. These schemes include triplication of the counters, storage of the configuration and previous measurements in internal flash memory, an internal software watchdog, an automatic dummy-handler recovery scheme, and control of the wireless communication link. Triplication of the counters and storing the configuration and previous measurements in an internal flash can help identify and correct errors due to SEE in the MCU SRAM. The internal software watchdog supervises software execution and prevents the non-detection of errors. Finally, the control of the wireless communication link guarantees reliable communication. These strategies can be classified into three groups based on their usage requirements: C-0, which does not require any peripheral; C-1, which necessitates internal or external hardware peripherals; and C-2, which requires an acknowledgment from an external source, such as a network. This paper will explain this classification in greater detail, outlining the necessary prerequisites for a system to implement these mitigation strategies.
The effectiveness of the proposed software mitigation schemes is demonstrated through a comparison of system availability under radiation. In this work tests conducted at CHARM facility compare the performance of BatMon with and without the implemented software mitigation techniques. The results show that the proposed schemes enable BatMon to achieve an availability of 99.9996% in the harsh LHC environment. The downtime would be related to self-recovery time. This availability agrees with the requirements defined by CERN for a critical system for the LHC which corresponds to 99.537%. Although radiation monitoring is not a critical task for the accelerator and does not have to fulfil these requirements, since the BatMon is application-independent, compliance with these constraints allows it to be used for critical tasks such as equipment control and reset in the future.