23–28 Oct 2022
Villa Romanazzi Carducci, Bari, Italy
Europe/Rome timezone

Stability of the CMS Submission Infrastructure for the LHC Run 3

24 Oct 2022, 11:00
30m
Area Poster (Floor -1) (Villa Romanazzi)

Area Poster (Floor -1)

Villa Romanazzi

Poster Track 1: Computing Technology for Physics Research Poster session with coffee break

Speaker

Antonio Perez-Calero Yzquierdo (Centro de Investigaciones Energéticas Medioambientales y Tecnológicas)

Description

The CMS Submission Infrastructure is the main computing resource provisioning system for CMS workflows, including data processing, simulation and analysis. It currently aggregates nearly 400k CPU cores distributed worldwide from Grid, HPC and cloud providers. CMS Tier-0 tasks, such as data repacking and prompt reconstruction, critical for data-taking operations, are executed on a collection of computing resources at CERN, also managed by the CMS Submission Infrastructure.

All this computing power is harnessed via a number of federated resource pools, supervised by HTCondor and GlideinWMS services. Elements such as pilot factories, job schedulers and connection brokers are deployed in HA mode across several “availability zones”, providing stability to our services via hardware redundancy and numerous failover mechanisms.

Given the upcoming start of the LHC Run 3, the Submission Infrastructure stability has been recently tested in a series of controlled exercises, performed without interruption of our services. These tests have demonstrated the resilience of our systems, and additionally provided useful information in order to further refine our monitoring and alarming system.

This contribution will describe the main elements in the CMS Submission Infrastructure design and deployment, along with the performed failover exercises, proving that our systems are ready to serve their critical role in support of CMS activities.

Significance

This presentation will cover how the CMS Submission Infrastructure (SI) has been designed and set up to avoid single points of failure, along with the tests performed in order to verify its resilience and stability, considering that the SI plays a critical role in the capability of the CMS experiment's Tier-0 node to take and process collisions data.

Experiment context, if any The CMS experiment at the LHC at CERN

Primary author

Antonio Perez-Calero Yzquierdo (Centro de Investigaciones Energéticas Medioambientales y Tecnológicas)

Co-authors

Edita Kizinevic (CERN) Farrukh Aftab Khan (Fermi National Accelerator Lab. (US)) Hyunwoo Kim (Fermi National Accelerator Lab. (US)) Marco Mascheroni (Univ. of California San Diego (US)) Maria Acosta Flechas (Fermi National Accelerator Lab. (US)) Saqib Haleem (National Centre for Physics (PK))

Presentation materials

Peer reviewing

Paper