Speaker
Description
The CMS Submission Infrastructure (SI) provisions and orchestrates the compute resources used for CMS data processing, simulation, and analysis. While the SI has reliably supported Run-3 operations at scales of several hundred thousand concurrent jobs across Grid, HPC, and cloud sites, the computational demands of the HL-LHC era require a substantially more scalable and robust system. To prepare for this transition, this contribution will present a series of large-scale stress tests aimed at quantifying the scalability limits of key HTCondor components, identifying bottlenecks that could constrain future CMS operations, including the mission critical data processing at the CMS Tier-0.
A central focus is in the performance of the HTCondor Central Manager, particularly the top-level collector daemon, responsible for processing the incoming stream of classad updates from the globally distributed execution points. Its performance and saturation behavior under increasing loads is being studied, evaluating software improvements from the HTCondor development team, along with exploring potential configuration and hardware-based optimizations. In parallel, the vertical scalability of the job schedulers is examined, complementing existing horizontal scaling strategies with measurements of single-node throughput, shadow-process limits, and job-start rates. Finally, the ability of the CMS SI to sustain larger numbers of idle jobs by using the late-materialization feature of HTCondor, an essential capability in the context of the redesigned CMS workload-management model for Phase-2, will be investigated.