ATLAS TDAQ System Administration: evolution and re-design

14 Apr 2015, 14:15
15m
Village Center (Village Center)

Village Center

Village Center

oral presentation Track1: Online computing Track 1 Session

Speaker

Christopher Jon Lee (University of Johannesburg (ZA))

Description

The ATLAS Trigger and Data Acquisition (TDAQ) system is responsible for the online processing of live data, streaming from the ATLAS experiment at the Large Hadron Collider (LHC) at CERN. The online farm is composed of ~3000 servers, processing the data readout from ~100 million detector channels through multiple trigger levels. During the two years of the first Long Shutdown (LS1) there has been a tremendous amount of work done by the ATLAS TDAQ System Administrators, implementing numerous new software applications, upgrading the OS and the hardware, changing some design philosophies and exploiting the High Level Trigger farm with different purposes. During the data taking only critical security updates are applied and broken hardware is replaced to ensure a stable operational environment. The LS1 provided an excellent opportunity to look into new technologies and applications that would help to improve and streamline the daily tasks of not only the System Administrators, but also of the scientists who will be working during the upcoming data taking period (Run-II). The OS version has been upgraded to SLC6; for the largest part of the farm, which is composed by netbooted nodes, this required a completely new design of the netbooting system. In parallel, the migration to Puppet of the Configuration Management systems has been completed for both netbooted and localbooted hosts; the Post-Boot Scripts system and Quattor have been consequently dismissed. Various new ATCA-based readout systems, with specific network requirements, have also been integrated into the overall system. Virtual Machine (VM) usage has been investigated and tested and many of our core servers are now running on VMs. This provides us with the functionality of rapidly replacing them in case of failures and increasing the number of servers when needed. Virtualization has also been used to adapt the HLT farm as a batch system, which has been used for running Monte Carlo production jobs that are mostly CPU and not I/O bound. In Run-II this feature could be exploited during the downtimes of the LHC. A new Satellite Control Room (SCR) has also been commissioned and in the ATLAS Control Room (ACR) the PC-over-IP network connections have been upgraded to a fully redundant network. The migration to SLC6 has also had an impact on the Control Room Desktop (CRD), the in house KDE-based desktop environment designed to enforce access policies while fulfilling the needs of the people working in the ACR and the SCR. Finally, monitoring the health and the status of ~3000 machines in the experimental area is obviously of the utmost importance, so the obsolete Nagios v2 has been replaced with Icinga, complemented by Ganglia for performance data. This paper serves for reporting "What", "Why" and "How" we did in order to improve and produce a system capable of performing for the next 3 years of ATLAS data taking.

Primary author

Christopher Jon Lee (University of Johannesburg (ZA))

Co-authors

Aleksandr Korol (Budker Institute of Nuclear Physics (RU)) Alexander Bogdanchikov (Budker Institute of Nuclear Physics (RU)) Artem Voronkov (Budker Institute of Nuclear Physics (RU)) Cristian Contescu (Polytechnic University of Bucharest (RO)) Daniel Fazio (CERN) Diana Scannicchio (University of California Irvine (US)) Franco Brasolin (Universita e INFN (IT)) Matthew Shaun Twomey (University of Washington (US)) Sergei Dubrov (Budker Institute of Nuclear Physics (RU)) Sergio Ballestrero (University of Johannesburg (ZA))

Presentation materials