The ATLAS experiment at the LHC relies on a complex and distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data. The High Level Trigger (HLT) component of the TDAQ system is responsible for executing advanced selection algorithms, reducing the data rate to a level suitable for recording to permanent storage. The HLT functionality is provided by a computing farm made up of thousands of commodity servers, each executing one or more processes.
Moving the HLT farm management towards a containerized solution is one of the main theme of the ATLAS TDAQ Phase-II upgrades in the area of the online software; it would make it possible to open new possibilities for fault tolerance, reliability and scalability.
This paper presents the results of an evaluation of Kubernetes as a possible orchestrator of the ATLAS TDAQ HLT computing farm. Kubernetes is a system for advanced management of containerized applications in large clusters.
We will first highlight some of the technical solutions adopted to run the offline version of today’s HLT software in a Docker container. Then we will focus on some scaling performance measurements executed with a cluster of 1000 CPU cores. In particular, we will:
- Show the way Kubernetes scales in deploying containers as a function the cluster size;
- Prove how a proper tuning of the Query Per Second (QPS) Kebernetes parameter set can improve the scaling of applications.
Finally, we will conclude with an assessment about the possibility to use Kubernetes as an orchestrator of the HLT computing farm in LHC’s Run IV.