# Experiments and HPC

David Cameron (University of Oslo) WLCG workshop, Lancaster, 8 Nov 2022



David Cameron, HPC at WLCG workshop, Lancaster 8.11.22



#### Disclaimer

These slides are an attempt to summarise the current state of HPC use by the four LHC experiments

It does not represent any official statements from any of the experiments

Any omission or errors are purely my fault

Thanks to all who provided input and feedback



#### HTC vs HPC

#### Taken from EGI Glossary

| Term         | High Throughput Computing                                                                                                                                                                                                                                                                                                                                                                                             |
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Abbreviation | нтс                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Definition   | A computing paradigm that focuses on the efficient execution of a large number of loosely-coupled tasks. Given the<br>minimal parallel communication requirements, the tasks can be executed on clusters or physically distributed<br>resources using grid technologies. HTC systems are typically optimised to maximise the throughput over a long<br>period of time and a typical metric is jobs per month or year. |

- Long timescale
- Distributed worldwide
- Loosely connected
- Data intensive
- Designed for HEP needs



| Term         | High Performance Computing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Abbreviation | HPC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Definition   | A computing paradigm that focuses on the efficient execution of compute intensive, tightly-coupled tasks. Given the<br>high parallel communication requirements, the tasks are typically executed on low latency interconnects which makes<br>it possible to share data very rapidly between a large numbers of processors working on the same problem. HPC<br>systems are delivered through low latency clusters and supercomputers and are typically optimised to maximise the<br>number of operations per seconds. The typical metrics are FLOPS, tasks/s. J/O rates. |

- Short timescale
- Single room
- Tightly connected
- Data intensive
- Not designed for HEP



Argonne National Laboratory's Flickr page, CC BY-SA 2.0 https://creativecommons.org/licenses/by-sa/2.0, via Wikimedia Commons



## Barriers to exploiting HPC

- No network access from the worker nodes
- No CVMFS available
- No CE service to submit jobs
- No persistent storage
  - Data needs to be moved in and out for the job
- Jobs must use whole nodes or multiple nodes
- Sociological/political barriers
  - High level of security
    - Access only granted to certain individuals, credentials need renewed regularly
  - HEP software is viewed as poor standard and untrusted
- Not all of these apply to all HPCs, and gradually these barriers are being lowered





#### So, why bother?



- A great untapped rapidly growing resource
  - More than 100x WLCG (1 million 3GHz cores x 10 Flops/cycle = 30Pflop/s)
- A substantial part of national computing infrastructure investment now and in the future
- Potential for allocations or "free" opportunistic computing
- Interesting and motivating R&D and PR
- Improving flexibility of experiment workloads and services



#### HPCs used per experiment

- LHCb
  - Piz Daint in CSCS (Switzerland)
  - Marconi-A2 in CINECA (Italy) not used anymore
  - SDumont in LNCC (Brazil)
  - MareNostrum in BSC (Spain)
- ALICE
  - CINECA (Italy) used in 2020
  - LBNL (Berkeley, USA):
    - Lawrencium (in production)
    - CORI (to be decommissioned)
    - Perlmutter (being commissioned)

- ATLAS
- Past:
  - Titan (Oak Ridge), Theta (Argonne), Edison (NERSC), Tianhe-1 (China), SuperMUC (Germany)
- Currently:
  - Frontera (TACC), Cori (NERSC), Vega and Karolina (EuroHPC), Toubkal (Morocco), RIVR (Slovenia), Tokyo HPC (Japan)
  - As part of pledge: MareNostrum (BSC), CSCS (CH), HPC2N and NSC (NDGF)
- Planned:
  - Perlmutter (NERSC) currently being commissioned



| Machine                  | Location            | Architecture*                           | Status                        |  |
|--------------------------|---------------------|-----------------------------------------|-------------------------------|--|
| Piz Daint                | CH (CSCS)           | x86 + Nvidia GPU                        | Production                    |  |
| CLAIX                    | DE (RWTH<br>Aachen) | x86 + Nvidia GPU                        | Production                    |  |
| HoreKa                   | DE (KIT)            | x86 + Nvidia GPU                        | Production                    |  |
| Marconi 100              | IT (Cineca)         | POWER9 + nVidia<br>GPU                  | Validated,<br>pre-production  |  |
| MareNostrum 4            | ES (BSC)            | X86 ( + GPU )                           | pre-production                |  |
| Cori                     | US<br>(NERSC)       | x86                                     | Production                    |  |
| Frontera                 | US (TACC)           | x86                                     | Production                    |  |
| Stampede2                | US (TACC)           | x86                                     | Production                    |  |
| Bridges-2                | US (PSC)            | x86                                     | Production                    |  |
| Expanse                  | US (SDSC)           | x86                                     | Production                    |  |
| Anvil                    | US<br>(Purdue)      | x86                                     | Production                    |  |
| Perlmutter               | US<br>(NERSC)       | X86 + nVidia GPU                        | integration/commis<br>sioning |  |
| Summit                   | US (OLCF)           | Power9 + nVidia<br>GPU                  | integration/commis<br>sioning |  |
| Frontier                 | US (OLCF)           | X86 + AMD GPU                           | Planned                       |  |
| Polaris                  | US (ALCF)           | X86 + nVidia GPU                        | Planned                       |  |
| Ookami HPC<br>testbed    | US                  | ARM                                     | Planned                       |  |
| Leonardo IT (<br>CINECA) |                     | X86+ nVidia GPU Planned                 |                               |  |
| MareNostrum 5 ES (BSC)   |                     | x86 / Arm + nVidia Planned<br>GPU (tbc) |                               |  |
| Jureca                   | DE (FZJ)            | X86 + nVidia GPU                        | Planned                       |  |



#### CMS HPC use in M core hours per month Jan 2020 - Aug 2022





#### David Cameron, HPC at WLCG workshop, Lancaster 8.11.22

7

#### What runs there

- The main workflow is MC generation/simulation
  - Easy on I/O, small or no input data, low memory and few Ο dependencies on external services
  - Stable software releases used for long periods Ο
  - Not time-critical  $\bigcirc$
- Where possible, other production workflows are run (eq CMS full chain on all DOE/NSF HPC)
- Some however run all workflows including analysis
  - Vega, CSCS and NDGF for ATLAS Ο
  - HPCs at LBNL for ALICE (helped by co-located T2 site) Ο
  - CSCS, CLAIX and HoreKa for CMS 0



ATLAS activities running at CSCS, last two years





#### Software availability

- CVMFS is the de facto standard throughout HEP for software distribution
- Some HPCs install CVMFS + squids natively, which makes things a lot easier
  - $\circ$  ~ In fact it is mandatory for some experiments to use the resource
- Various solutions exist for HPC without CVMFS
  - Copy part of the tree (eg particular software release) to local file system and point jobs there
  - Install CVMFS as an unprivileged user with <u>CVMFSExec</u>
  - Develop tools such as <u>subcymfs-builder</u> to deploy releases automatically
  - Pack software into a container in which the job runs
- Some of these solutions are generally only suitable for HPC running specific releases of a specific workflow
- To make use of HPC as a general purpose resource CVMFS must be available natively, or at least a squid provided to make use of CVFMSExec



## Data management and communication

- Two main issues to solve:
  - No local storage element
  - No network connectivity to access remote storage or experiment services
- Some sites provide proxies or gateways for network traffic
- Another method is to "tunnel" traffic through a login or edge node
- Otherwise typically solved by deploying an edge service which handles data transfers independently of jobs
- How does the pilot model work in the most heavily constrained environments?
  - Pilot runs on the edge node or even outside the HPC: pull job, do data transfer then submit payload to batch system
    - DIRAC PushJobAgent
  - Edge service which handles job communication, data transfers and submits to batch
    - ATLAS Harvester
  - Central service which handles job communication, and submits fully-formed jobs (payload and input/output) to edge service which handles data transfers
    - ATLAS ARC Control Tower (central) and ARC CE (at HPC)
  - Edge service which handles communication with jobs through shared file system
    - CMS at BSC





J.M.Hernandez et al, "Integration of BSC CPU resource in CMS"



A.Boyer, "Integrating DIRAC workflows in Supercomputers"



#### Other issues to solve

- Access to external services for eg conditions data
  - "Fat" container image with all software and conditions data
  - Mirror to local shared file system
  - Edge service proxy
- Batch system policies
  - Whole node scheduling required
    - Single pilot running many parallel single core jobs inside
    - Multi-threaded/multi-core jobs using whole node
  - Minimum nodes per job
    - "Fat pilots" running many jobs in a single batch job, ATLAS "Jumbo jobs"



### Work involved from expt side

- Significant effort can be required from central experiment teams and site contacts (the people who have access)
  - Proportional to difficulty but not always to benefits
- Each HPC is a snowflake, with unique challenges
  - Although work invested into a single site can help for others
  - Specific technologies like edge services or container images can be used on multiple sites
- HPC lifespan is short (3-5 years) compared to WLCG sites
  - Continuous work to commission the next new one (although can be made easier by existing relationships with sites)



#### A.Boyer, "Integrating DIRAC workflows in Supercomputers"



## Evolution from the HPC side

- In the past, typically ssh access was granted to a login node
  - Could then use tricks like wrapping batch commands around ssh and using sshfs
- CSCS is a good example of a restrictive site that now looks like a (pledged) grid site
- Vega EuroHPC was designed to be HEP-friendly
  - Worker node connectivity, performant shared FS, access through ARC CE
  - ATLAS can use it when other users don't



Running ATLAS jobs on Vega, last 18 months, compared to entire WLCG pledge

David Cameron, HPC at WLCG workshop, Lancaster 8.11.22



M. Hostettler, "ATLAS computing on the HPC Piz Daint machine", CHEP 2015



Vega usage per user, last 3 months ATLAS average 150k cores



#### Software architectures

- GPUs make up the majority of the FLOPS of the Top500
  - GPUs are becoming heavily used in the online world
  - ALICE has GPU-ready offline software today
  - CMS plans for opportunistic use of GPUs (offloading 10% of reconstruction) starting next year
  - For others not at significant scale before Run-4 if at all
    - Except small-scale ML (e.g. flavor tagging, fast simulation NNs, ...)
  - Experiments' views on GPUs are collected in a <u>WLCG</u> <u>twiki page</u> - time to update it?
- ARM builds are at various stages of validation ATLAS, CMS and LHCb
- POWER is validated for production for CMS
- In general, a chicken and egg problem
  - If a resource is available work can be done to target it
  - We are not targeting specific resources because software is not available
- Also thorny question of pledge



https://www.top500.org/statistics/treemaps/



#### Pledges

- Most HPC use is opportunistic, i.e. outside the WLCG pledge framework
  - Either dedicated allocations to groups/institutes or backfilling
- Could they be pledged instead of grid sites?
  - If they look like a grid site, yes
  - Already done for CSCS and parts of NDGF-T1
  - MareNostrum used as part of Spain T1 pledge for ATLAS and CMS (although it does not run all workflows)
  - INFN will pledge Leandro (CINECA EuroHPC) as part of Italian T1 resources from 2023
- If not, difficult to accept as pledge
- In addition to running all workflows, the HPC must be constantly available at some level with a multi-year commitment
  - Not possible with allocation and fair-share models used by many HPC
  - But is some cases (eg EuroHPC) multi-year allocations can be granted for long-term production
- How to pledge non-CPU resources? Plans to benchmark GPUs are still in the very early stages
  - But needs to be thought about years in advance of pledging these resources



#### How to organise better

- Perhaps it's useful to have a set of "WLCG requirements for HPC"?
  - Suggested requirements below are just to be able to use them, with probably stricter requirements to be pledged
  - Some experiments could run some workflows with even fewer requirements
- Minimum requirements to be used by all LHC experiments
  - CVMFS installed natively or infrastructure (squids and modern OS) which allows experiment to use CVMFSExec
  - Limited external network access from worker nodes (for communication, not data transfer)
  - A "normal" way to submit jobs (well-known batch system)
  - Allow running containers
  - Allow persistent edge service
  - x86 architecture (possibly soon other CPU architectures like ARM)
- Useful to have
  - Local persistent storage, accessible through our usual interfaces
  - Good enough hardware (memory, storage, network) to support all workflows
  - Experiment-friendly admin :)
- Table from CMS note looks like a nice format/starting point

| Category                  | Explanation                                                                                                     | CMS<br>standard<br>solution   | CMS<br>preferred<br>solution for<br>HPC | CMS fallback<br>workable<br>solution (full<br>utilizability)                                                                                                                    | CMS<br>fallback<br>solution<br>(for a<br>fraction of<br>workflows)                                                                | CMS<br>no-go<br>scenario                                                                                  | Possible<br>CMS devels<br>to solve the<br>no-go                                                                                                                                    |
|---------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------------------|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Architecture              | Base system<br>architecture                                                                                     | x86_64                        | x86_64                                  | x86_64 *<br>accelerators (with<br>partial utilization)                                                                                                                          |                                                                                                                                   | Currently,<br>OpenPower,<br>ARM,<br>they could<br>be used but<br>at the price<br>of physics<br>validation | QEMU?<br>Recompiling +<br>physics<br>validation?                                                                                                                                   |
| Memory per<br>Thread/core | Memory<br>available to each<br>thread / process                                                                 | 2 GB/Thread                   | 2 GB/Thread                             | Down to 0.5<br>GR/thread needs<br>heavy<br>multithreading, at<br>the expenses of<br>CPU efficiency                                                                              | GEN and SIM<br>workflows need<br>less than 2<br>GB/Thread<br>(0.5GB/Thread<br>would be a<br>limit) in order to<br>run efficiently | Less than<br>0.5<br>GB/thread                                                                             |                                                                                                                                                                                    |
| 1/0                       | I/O demand per<br>process                                                                                       | 5 MB/s/core                   | 5MB/s/core                              |                                                                                                                                                                                 | GEN and SIM<br>workflows are<br>mostly CPU<br>bound still ok<br>with 0.1<br>MB/s/core                                             | Less than<br>0.1<br>MB/s/core                                                                             |                                                                                                                                                                                    |
| Local Scratch<br>space    | Local space per<br>production job                                                                               | 20 GB/Thread                  | 20 GB/thread local                      | Less than 20<br>GB/thread ok if a<br>shared high<br>performance FS is<br>available on all the<br>machines<br>Large multithreading<br>lowers 20 GB/thread<br>requirement to ~ 10 | Some CMS<br>workflows run<br>for hours<br>without creating<br>huge local disk<br>areas (GEN,<br>SIM)                              | No sizeable<br>local space<br>and no<br>shared<br>usable FS                                               |                                                                                                                                                                                    |
| Outgoing<br>networking    | Needed on WNs<br>in order to<br>access remote<br>data, conditions,<br>and to speak to<br>the CMS Global<br>Pool | Full outgoing<br>connectivity | Full outgoing<br>connectivity           | Connectivity to only<br>a subset of the IP<br>ranges (for example,<br>to CERN, and to a<br>close xrootd proxy<br>cache)<br>And to everywhere<br>we have condor<br>services?     | NAT with a<br>very limited<br>bandwidth via<br>an edge<br>service                                                                 | No outgoing<br>connectivity<br>from the<br>compute<br>nodes and<br>no NAT<br>available                    | Edge service<br>running<br>Harvester or<br>HTCondor?<br>Prepare a single<br>container to be<br>deployed at the<br>edge and doing:<br>NAT for Condor<br>Squid, Xroot<br>proxy cache |

From CMS note 2020/002: "HPC resources integration at CMS", https://cds.cern.ch/record/2707936/files/NOTE2020\_002.pdf

## Summary

- A lot of great work has been done to exploit difficult resources, with variable return on investment
  - From zero to equivalent of WLCG grid pledge
- Our HEP model has been bent to fit the HPC environment in a variety of ways
- But HPCs look like they are becoming more friendly to HEP
  - At least in terms of accessibility, if not architectures
- Current challenges may be more on the software side to exploit different architectures
- A lot of commonalities in our requirements, can we present a united front?







#### Acknowledgements

Thanks very much to the following people for providing input and material for this talk:

Federico Stagni, Maarten Litmaath, Latchezar Betev, Christoph Wissing, Tommaso Boccali, Daniele Spiga, James Letts, Danilo Piparo, Dirk Hufnagel, Andrej Filipcic, Lincoln Bryant

# Extras



## Use of Marconi A2 (CINECA)



Fig. 7. Left: total utilization of the Marconi A2 from April 2019 to February 2021. The brown area shows the amount of remaining grant (30 Million core hours). (right) utilization by experiments as a fraction of the utilized 93 Million core hours.

From https://pos.sissa.it/378/003/pdf