4–8 Nov 2019
Adelaide Convention Centre
Australia/Adelaide timezone

Exploiting network restricted compute resources with HTCondor: a CMS experiment experience

7 Nov 2019, 11:00
15m
Riverbank R1 (Adelaide Convention Centre)

Riverbank R1

Adelaide Convention Centre

Oral Track 9 – Exascale Science Track 9 – Exascale Science

Speaker

Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)

Description

In view of the increasing computing needs for the HL-LHC era, the LHC experiments are exploring new ways to access, integrate and use non-Grid compute resources. Accessing and making efficient use of Cloud and supercomputer (HPC) resources present a diversity of challenges. In particular, network limitations from the compute nodes in HPC centers impede CMS experiment pilot jobs to connect to its central HTCondor pool to receive the actual payload jobs to be executed. To cope with this limitation, new features have been developed in both HTCondor and the CMS resource scheduling and workload management infrastructure. In this novel approach, a bridge is set up outside the HPC center and the communications between HTCondor daemons are relayed through a shared file system. We have used this strategy to exploit the Barcelona Supercomputing Center (BSC) resources, the main Spanish HPC site. CMS payloads are claimed by HTCondor startd daemons running at the nearby PIC Tier-1 center and routed to BSC compute nodes through the bridge. This fully enables the connectivity of CMS HTCondor-based central infrastructure to BSC resources via PIC HTCondor pool. Other challenges have included building custom Singularity images with CMS software releases, bringing conditions data to payload jobs, and custom data handling between BSC and PIC. This contribution describes this technical prototype, its deployment, the functionality and scalability tests performed, along with the results obtained when exploiting the BSC resources using these novel approaches. A key aspect of the technique described in this contribution is that it could be universally employed in similar HPC environments elsewhere.

Consider for promotion Yes

Primary author

Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)

Presentation materials