Speaker
Hegoi Garitaonandia
(NIKHEF)
Description
The ATLAS experiment at CERN will require about 4000 CPUs for the online data acquisition system (DAQ). When the DAQ system experiences software errors, such as event selection algorithm problems, crashes or timeouts, the fault tolerance mechanism routes the corresponding event data to the so called debug stream. During first beam commissioning and early data taking, a large fraction of events is expected to end up in this stream. In order to identify problems with the DAQ as soon as possible and reduce the turn-around time for fixing these problems, it is of prime importance to treat the debug stream. We have adopted a quasi real-time approach. We have developed an automated system that analyzes the contents of the debug stream and provides fine grained error classification. A high percentage of error events is related to online transient problems. Many of those events are recovered by feeding them to an independent system that reruns the trigger software. To be flexible in terms of computing power requirements, we added a layer of abstraction over the computing backend. This gives the possibility of using the Grid as well as dedicated resources. Using cosmic ray runs, we validated the automatic error analysis and recovery procedure.