17–21 Sept 2012
Oxford University, UK
Europe/Zurich timezone

Soft Error Recovery during Operation of the CMS Experiment

20 Sept 2012, 17:53
1m
Oxford University, UK

Oxford University, UK

<font face="Verdana" size="2"><b>Clarendon Laboratory</b> Parks Road OX1 3PU, Oxford, United Kingdom
Poster POSTERS

Speaker

Gregory Rakness (Univ. of California Los Angeles (US))

Description

In high energy physics experiments such as the Compact Muon Solenoid (CMS), electronics located near the interaction region are prone to soft (i.e., recoverable) errors as a result of radiation coming from the collisions. Depending on the type of error, the scope of their impact on data collection can range from being hardly noticeable to being completely debilitating. Here, we present evidence of soft errors in CMS and describe a mechanism which allows subsystems to recover from them in an automated way. Results will be shown as to the effectiveness of this scheme to maximize the uptime of CMS.

Summary

In high energy physics experiments such as the Compact Muon Solenoid (CMS), the electronics located near the interaction region are prone to soft (i.e., recoverable) errors as a result of the radiation coming from the collisions. In order to simultaneously achieve the two goals of both collecting high quality data and doing so efficiently, it is important to be able to fix soft errors as quickly as possible in the midst of the data collection process. This presentation describes a generalized mechanism to initiate a pause of the full CMS trigger and data collection machine in order to give subsystems the possibility to recover from soft errors. This mechanism has been automated to maximize the time spent by CMS taking high quality data.

The CMS data collection machinery is triggered by a synchronous signal sent to all subsystems from the output of a complex, multi-stage selection algorithm. When a subsystem receives this signal, the data are labeled with a number to uniquely identify the collision and sent to a central collection point to be combined into a single event containing data from all subsystems for the entire experiment. We describe a way to pause this machine in order to allow subsystems to fix soft errors, and then start it up automatically in a coherent way.

A soft error is often used to describe a specific failure mode of FPGAs. However, the CMS detector contains thousands of electronic devices of many different types which can also be affected by radiation. Depending on the type of soft error and the role of the device within the machinery, the scope of the error’s impact can range from being hardly noticeable to being completely debilitating. The recovery from errors can be as quick as reloading a FPGA to configuring a device from a computer through VME. The recovery procedure has been streamlined to minimize the downtime depending on the action needed.

Results will be shown of the effectiveness of this mechanism to maximize the uptime of CMS. This experience can be applied to any complex system and implies that generalized mechanisms to recover from soft errors should be incorporated into the design of any experiment who wishes to maximize operating efficiency in a radiation environment.

Primary author

Gregory Rakness (Univ. of California Los Angeles (US))

Presentation materials