Speaker
Artur Szostak
(University of Bergen (NO))
Description
The ALICE High Level Trigger (HLT) is a dedicated real-time system for on-line event reconstruction and triggering. Its main goal is to reduce the large volume of raw data that is read out from the detector systems, up to 25 GB/s, by an order of magnitude to fit within the available data acquisition bandwidth. This is accomplished by a combination of data compression and triggering. When a reconstructed event is selected by the HLT trigger algorithms as interesting for physics then it is recorded, otherwise the raw data for that event is discarded. The combination of both approaches allows for flexible strategies for data reduction.
A second but equally vital function of the HLT is on-line monitoring. The HLT has access to all raw data and status information from the detectors during data taking. Combined with on-line event reconstruction the HLT becomes a powerful monitoring tool for ensuring data quality. Many problems can only be spotted easily when looking at the high level information on the physics level. In addition, on-line compression and triggering must be monitored live during data taking to ensure stability of the system and quality of recorded data.
A very high computational load is placed on the HLT to perform its tasks, in particular during event reconstruction and compression. A large dedicated computing cluster for on-line operations is used, which comprises 206 individual machines, 2744 CPU cores, 64 GPUs, 5.24 TB of distributed memory; all interconnected with an InfiniBand network and Gigabit Ethernet for management. There are an additional 43 machines which provide a development and testing environment, infrastructure support and storage.
Running a large complex system like the HLT in production data taking mode proves to be a challenge. During the 2010 pp and Pb-Pb running period many problems were experienced that lead to a sub-optimal operational efficiency. Lessons were learned and certain crucial changes were made early in 2011 to prepare for the 2011 Pb-Pb run, in which HLT would have a vital role performing data compression for the largest detector in ALICE, the Time Projection Chamber (TPC). Key changes such as separation of the production part of the system from the supporting infrastructure and upgrading to a mass storage system more suited to the HLT performance requirements has lead to higher stability, improved operational efficiency and reduction in startup latency of the system during runs.
A overview of the status of the HLT, experience from 2010 and 2011 production runs and important lessons learned are presented. Emphasis is given to the overall performance, showing a overall reduction in failure rates between 2010 and 2011, attributed to the significant improvements made to the system. Finally, further opportunities for improvement are identified and discussed, based on the experience gained in the 2011 Pb-Pb run.
Primary author
Artur Szostak
(University of Bergen (NO))