Speaker
Description
The ALICE Event Processing Node (EPN) farm, a high-density GPU HPC system, serves as the backbone for real-time data reconstruction during LHC Run 3 period (2022—2026) and it is the largest computer farm at CERN, in terms of compute capacity. Comprising 350 nodes and 2800 GPUs, with a peak performance of 48 PFLOP/s, the EPN infrastructure has been operated throughout Run 3 by a dedicated team of two to three individuals at a time.
This contribution presents the experience gained during detector operations throughout Run 3, and architectural choices that enabled a 24/7-supported, high-reliability, low-maintenance operational model. An overview of the provisioning, configuration, and observability frameworks governing a specialized GPU-accelerated HPC facility is presented. Management spans the physical layer—including infrastructure—through the software stack and experiment-specific software. A key feature of this architecture is the integration with central detector-control systems and the logical separation of synchronous and asynchronous processing modes. To conclude, a retrospective is provided on several years of continuous operation, offering a blueprint for how small teams can maintain mission-critical scientific infrastructure through robust automation and sustainable practices.
| I read the instructions above | Yes |
|---|