Speaker
Description
Summary
In high energy physics experiments such as the Compact Muon Solenoid (CMS), the electronics located near the interaction region are prone to soft (i.e., recoverable) errors as a result of the radiation coming from the collisions. In order to simultaneously achieve the two goals of both collecting high quality data and doing so efficiently, it is important to be able to fix soft errors as quickly as possible in the midst of the data collection process. This presentation describes a generalized mechanism to initiate a pause of the full CMS trigger and data collection machine in order to give subsystems the possibility to recover from soft errors. This mechanism has been automated to maximize the time spent by CMS taking high quality data.
The CMS data collection machinery is triggered by a synchronous signal sent to all subsystems from the output of a complex, multi-stage selection algorithm. When a subsystem receives this signal, the data are labeled with a number to uniquely identify the collision and sent to a central collection point to be combined into a single event containing data from all subsystems for the entire experiment. We describe a way to pause this machine in order to allow subsystems to fix soft errors, and then start it up automatically in a coherent way.
A soft error is often used to describe a specific failure mode of FPGAs. However, the CMS detector contains thousands of electronic devices of many different types which can also be affected by radiation. Depending on the type of soft error and the role of the device within the machinery, the scope of the error’s impact can range from being hardly noticeable to being completely debilitating. The recovery from errors can be as quick as reloading a FPGA to configuring a device from a computer through VME. The recovery procedure has been streamlined to minimize the downtime depending on the action needed.
Results will be shown of the effectiveness of this mechanism to maximize the uptime of CMS. This experience can be applied to any complex system and implies that generalized mechanisms to recover from soft errors should be incorporated into the design of any experiment who wishes to maximize operating efficiency in a radiation environment.