Speaker
Mr
Serguei Kolos
(University of California Irvine (US))
Description
The ATLAS Error Reporting feature, which is used in the TDAQ environment, provides a service that
allows experts and shift crew to track and address errors relating to the data taking components
and applications. This service, called the Error Reporting Service(ERS), gives software applications
the opportunity to collect and send comprehensive data about errors, happening at run-time, to a
place where it can be intercepted in real-time by any other system component. Other ATLAS online
control and monitoring tools use the Error Reporting service as one of their main inputs to address
system problems in a timely manner and to improve the quality of acquired data.
The actual destination of the error messages depends solely on the run-time environment, in
which the online applications are operating. When applications send information to ERS,
depending on the actual configuration the information may end up in a local file, in a database, in
distributed middle-ware, which can transport it to an expert system or display it to a users, who
can work around a problem. Thanks to the open framework design of ERS, new information
destinations can be added at any moment without touching the reporting and receiving
applications.
The ERS API is provided in three programming languages used in the ATLAS online environment:
C++, Java and Python. All APIs use exceptions for error reporting but each of them exploits
advanced features of a given language to simplify program writing experience. For the example,
as C++ lacks language support for exceptions, a special macro have been designed to generate
hierarchies of C++ exception classes at compile time. Using this approach a software developer
can write a single line of code to generate a boilerplate code for a fully qualified C++ exception
class declaration with arbitrary number of parameters and multiple constructors, which
encapsulates all relevant static information about the given type of issues. When corresponding
error occurs at run time, a program just need to create an instance of that class passing relevant
values to one of the available class constructors and send this instance to ERS.
This paper presents the original design solutions exploited for the ERS implementation and
describes the experience of using ERS for the first ATLAS run period, where the cross-system error
reporting standardization, introduced by ERS, was one of the key points for successful launching
and utilization of automated problem-solving solutions in the TDAQ online environment.
Primary author
Denis Oliveira Damazio
(Brookhaven National Laboratory (US))