



# HPC integration in data intensive science

CERN, SKAO, GÉANT, PRACE: The European Consortium on Advances on HPC and applications to Fundamental Research David Southwick (CERN)



### **Motivation**

LHC expects more than exabyte of new data for each year of HL-LHC era from 2029-2040.

This data must be exported in ~real time from CERN to compute sites.

SKAO expects similar requirements during similar period.



ATLAS <u>https://indico.jlab.org/event/459/contributions/11470/</u><u>https://cds.cern.ch/record/2815292</u> CMS <u>https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/UPGRADE/CERN-LHCC-2022-005/</u>



### Schedule





### Ramping up

A complex problem with many moving parts – All feasible methods to close the computing gap are being pursued

• Including HPC!

Astronomy and HEP see potentially large benefits in exploiting HPCs

Substantial technical investment during the last years which increased its usage





#### As we adapt

- Our consortium is ideally composed
  - HL-LHC and SKA have a burning physics need and in depth knowledge of the algorithms employed
  - PRACE provide considerable experience in the system adaptation of software environments
  - GEANT provides the infrastructure to take the computing to the many nodes that are needed to tackle the demand

#### PRACE | Tier-0 Systems in 2020



MareNostrum: IBM

#38 Top 500

BSC, Barcelona, Spain





cluster GAUSS @ LRZ.

Garching, Germany #13

**Top 500** 



Piz Daint: Cray XC50 CSCS, Lugano, Switzerland #10 Top 500

CINECA, Bologna, Italy

#9 Top 500





The Partnership for Advanced Computing in Europe | PRACE

GAUSS @ HLRS, Stuttgart, Germany



**NEW ENTRY 2018** JUWELS (Module 1 Atos/Bull Seguana GAUSS @ FZJ, Jülich

**Close to 110 Petaflops** total peak performance



NFIERI 2021





## CERN, SKAO, GÉANT, PRACE Consortium

Maria Girone CERN openlab CEO



- Consortium completed after 18 months (Dec. 2021)
- Four areas of work identified as foundational; continue to guide development since 2021:
  - Benchmarking
  - Data Access
  - Authentication and Authorization
  - Building a Common Center of Expertise

### The Four Pillars of the Collaboration

Maria Girone CERN openlab CEC



### Areas of work

- Benchmarking and Accounting
- Data Processing and Access
- Authentication and Authorization
- Software and Architectures
- Runtime Environments and Containers
- Provisioning
- Wide and Local Area Networking



Benchmarking in HPC





### **Benchmarking and Accounting**

Adopting HPC compute resources presents several new challenges beyond traditional x86 workload development:

- Diverse compute architectures (ARM, POWER, x86, RISC-V)
- Heterogenous accelerators (GPU, FPGA, Quantum\*)

We must understand and account of all combinations of above to understand:

- Workload efficiency at runtime
- Efficiency of grant usage
- Mapping of users to resources

#### Benchmarking is used at CERN for:

- Efficiency
- Error detection
- Accounting
- Pledges
- Procurement



### **HPC Benchmarking**

HEP Benchmarking Suite: The next generation of benchmarking for the WLCG , replacing HEPspec06 (over 15+ years use).

Historically benchmarking has been:

- Designed for WLCG compute environment
- Intended for procurement teams, site administrators
- First with VM containment, later nested docker images

#### None of these approaches are compatible with HPC!

- Refactor & re-tool for user execution at scale
- HEPscore now in transition phase to replace HS06
- <u>https://w3.hepix.org/benchmarking.html</u>

- Reference HEP applications from multiple experiments
- OCI Containers
- Uses workloads from HEP experiments
- **HEPscore** Produce single score (ala HS06)
  - Orchestrator of multiple benchmarks (HS06, HEPscore, SPEC, etc)
  - Central collector & Reporter



HEP

Benchmark Suite

HEP

workloads

### **HEP Benchmark Suite**



Minimal Dependencies *Python3 + container choice* 



Modular Design Snap-in workloads & modules



Repeatable & Verifiable Declarative YAML config



Designed for Ease-of-Use *Simple integration with any job scheduler* 



Variety of containment choices Singularity (incl. CVMFS Unpacked), Docker, Podman



Metadata + Analytics *Automated Reporting via AMQ* 



https://gitlab.cern.ch/hep-benchmarks/hep-benchmark-suite

### Automated HPC execution

Benchmarking Heterogeneous architectures

- Multi-arch as workloads become available (ARM, IBM Power ...)
- GPU accelerators (Madgraph5, MLPF)

#### Simple integration with SLRUM, other job orchestrators



Hardware Samples



### Heterogeneous Benchmarking

- Combination of General-Purpose GPUs (GPGPU) and alternatives architectures targeted by experiments for Run 4
- GPU benchmarks for production workloads that operate on GPGPU and CPU+GPGPU
- ARM workloads
- MadGraph event generation for GPU and Vector CPUs



|       | Process                                             |                                                                                     | Madevent $262144$ even     | ents                        | Standalone CUDA                                                                                                     |  |
|-------|-----------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|--|
|       |                                                     | Total                                                                               | Cotal Momenta+unweight Mat |                             | ME Throughput                                                                                                       |  |
| -<br> | $e^+e^- \rightarrow \mu^+\mu^-$<br>+CUDA Tesla A100 | 17.9 s<br>10.0 s<br>1.8 x                                                           | 10.2 s<br>10.0 s<br>1.0 x  | 7.8 s<br>0.02s<br>390 x     | $1.9 	imes 10^6 { m s}^{-1} \ 633.8 	imes 10^6 { m s}^{-1} \ 334 { m ~x}$                                           |  |
|       | $gg \rightarrow t\bar{t}gg$ +CUDA Tesla A100        | 209.3 s<br>8.4 s<br>24.9 x                                                          | 7.8 s<br>7.8 s<br>1.0 x    | 201.5 s<br>0.6 s<br>336 x   | $2.8 	imes 10^3  m s^{-1}$<br>$758.9 	imes 10^3  m s^{-1}$<br>271  m x                                              |  |
|       | $gg \rightarrow t\bar{t}ggg$ +CUDA Tesla A100       | $\begin{array}{c} 2507.6 \ {\rm s} \\ 30.6 \ {\rm s} \\ 82.0 \ {\rm x} \end{array}$ | 12.2 s<br>14.1 s<br>0.9 x  | 2495.3 s<br>16.5 s<br>151 x | $\begin{array}{c} 1.1\times10^2 \mathrm{s}^{-1} \\ 170.7\times10^2 \mathrm{s}^{-1} \\ 155 \ \mathrm{x} \end{array}$ |  |

#### Event generation speedup, Nvidia A100

https://indico.jlab.org/event/459/contributions/11829/



### ML/AI Benchmarking

#### Machine-learned particle-flow reconstruction algorithms (MLPF)

- Approach GPU workloads as repeatable benchmark
  - Containerized in similar manner to traditional CPU benchmarks.
  - Support (multi) GPU accelerators for training/tuning
  - Examine events/second processed (same metric as HEPiX CPU jobs) ٠



#### CUDACPP vs SYCL on NVidia/AMD/Intel GPUs

 Nvidia GPUs: the performances of the SYCL implementation seems ~comparable to direct CUDA for gg→ttgg - More fine-grained analysis on the next slide, for different physics processes

Intel and AMD GPUs: the SYCL implementation runs out of the box

Xe-HP is a software development vehicle for functional testing only - currently used at Argonne and other customer sites to prepare their code for future Intel data centre GPUs XE-HPC is an early implementation of the Aurora GPL

A. Valassi - CPU vectorization and GPUs in Madgraph5\_aMC@NLO

CERN Openlab workshop, 16 March 2023 16



#### Particleflow model training speed



### Understanding workload efficiency

Utilization at runtime is critical to benchmarking and production

- PRmon plugin to HEP benchmark suite enables profiling of CPU utilization
- Profile both native and containerized workloads
- Identify issues, acceptance testing, verification

PRmon source: https://github.com/HSF/prmon





### Energy efficiency

Energy efficiency is now considered a critical metric of performance

- Plugin to poll server power metrics (ipmi)
- Compare Nvidia-smi, ipmi & external metering
- BMK include energy metrics from CPU









D. Southwick - INFIERI '23

### Some numbers

Initial models expect **1 Exabyte physics data processing in 100 days.** 

HEP experiments will no longer be able to store all the produced data at a single site – it must be streamed in **~realtime.** 

Goal is to stream & process 10 PB of physics data through a HPC site in a day: several hundreds of Gbps continuously.

- Challenge of increasing complexity: start with 10-20% goal (1PB), demonstrate management of hundreds of TBs data
- Maintain compute efficiency with high data rate in/out from/to storage & stream



### Storage

HPC storage is typically built from a common set of commercial building blocks. Although standard, they are uniquely implemented at each site:

- Variable number of replications, metadata nodes, interconnect capabilities
- Little to no visibility into capabilities, usage, accounting, etc.

Lots of moving parts! Break it down into three general areas:

- Data ingress/egress from HPC center
- Efficient usage of storage systems on site
- Dynamic scaling interaction between (1) and (2)



### Shared filesystems

Traditional HPC workloads have low I/O demands – highly problematic running Big-Data workloads!

Compute-bound workloads dependent on shared file systems may be **effectively I/O bound** if scaled sufficiently

To avoid consuming a shared community resource, we need to understand what we can effectively scale to

- Workload throughput O(100KB/s)-O(100MB/s)
- Many workloads per host





### Data formats

Data format drastically affects HPC storage efficiency:

- Writing data in storage format supporting parallel I/O
- Optimization: Tuning of parallel libraries to optimize the performance
- Adopting native object storage (HDF5) native to parallel IO
- Dramatically reduce random read during jobs







Separation of WLCG sites responsibilities to new "Data Lake" model for LHC data storage has introduced new standards and modernized capabilities. Leveraging better data access patterns to datasets with latency-hiding advancements of XrooD/Xcache greatly reduces data transfer requirements:

- RUCIO a high level data management layer, coordinates file transfers over several protocols (HTTP/WebDAV, XrootD, GridFTP, S3)
- FENIX Collaboration with HPC sites and ESCAPE to standardize data transfers







### **HPC Connectivity**

Successfully exploiting opportunistic HPC allocation demands high connectivity for data-driven workloads. CERN current target ~5Tbps connectivity by time of HL-LHC from CERN Tier0 to compute sites. WAN from HPC sites may be limiting factor for resource allocation without pre-placed data.

HPC Data challenge composed of EU Projects (CoE RAISE, InterTWIN), WLCG, and GÉANT to validate data-driven streaming and transfers

- Leverage GÉANT Data Transfer Nodes (DTNs) around EU for testing against backbone network
- Testing Unicore FTP (UFTP), FTS, Rucio for open science with HPC
- Currently exercising 200Gbps tests with Jülich HPC Centre, DE



# Authentication & Authorization





### HPC and Authentication

HPC sites operate differently regarding account creation and access policies from from traditional WLCG:

- Varying levels of trust requirements
- Authentication methods (SSH, Certificate, tokens..)
- Not reasonable to expect importation/trust of CERN computing accounts (16k+)



### **AAI Transformation**

WLCG transition from certificate-based authorization to token-based carries through into HPC . Among several components of the ESCAPE project, AAI aims to bridge CERN AAI to HPC

- OIDC-token Authentication migration from X.509 Certificate faster, easier for institutional trust
- Federated login AuthN/AuthZ for HPC via EduGAIN federation/Puhuri

ESCAPE IAM has been integrated into the EOSC AAI federation in collaboration with GÉANT,



ESCAPE project completed Summer 2022 after 42 months







### **Future Direction**

Much effort has been invested into HPC adoption in the past years, but challenges still remain:

- Integrating independent machines as single entities, requiring specific integration
- Access and usage policies, available services, system architectures and machine-lifetime.
- Software deployment, edge services for data and workflow management,

Moving towards a General Purpose HPC – addressing HPC as a common machine

• Enable flexibly and elastically expanding the resources available to big data sciences



#### SPECTRUM Computing Strategy for Data-Intensiv

Computing Strategy for Data-Intensive Science Infrastructures in Europe

Objective:

Deliver a Strategic Research, Innovation and Deployment Agenda (SRIDA) which defines the vision, overall goals, main technical and non-technical priorities, investment areas and a research, innovation and deployment roadmap for data-intensive science and infrastructures during 2025-2035

Vision:

Data-intensive scientific collaborations have access to a European exabyte-scale research data federation and compute continuum

Duration:

From 2024, 30 Months

Members:

EGI, CERN, SKAO, INFN, LOFAR, CNRS/JPV, EuroHPC (FZJ, CINECA, SURF), Other partners being contacted





Approved for 2024!

Objective:

Deliver a Strategic Research, Innovation and Deployment Agenda (SRIDA) which defines the vision, overall goals, main technical and non-technical priorities, investment areas and a research, innovation and deployment roadmap for data-intensive science and infrastructures during 2025-2035

Vision:

Data-intensive scientific collaborations have access to a European exabyte-scale research data federation and compute continuum

**Duration:** 

From 2024, 30 Months

Members:

EGI, CERN, SKAO, INFN, LOFAR, CNRS/JPV, EuroHPC (FZJ, CINECA, SURF), Other partners being contacted





### Portable frameworks

|               | CUDA | Kokkos | SYCL                               | HIP                          | OpenMP                                              | alpaka                | std::par    |
|---------------|------|--------|------------------------------------|------------------------------|-----------------------------------------------------|-----------------------|-------------|
| NVIDIA<br>GPU |      |        | intel/llvm<br>compute-cpp          | hipcc                        | nvc++<br>LLVM, Cray<br>GCC, XL                      |                       | nvc++       |
| AMD GPU       |      |        | openSYCL<br>intel/llvm             | hipcc                        | AOMP<br>LLVM<br>Cray                                |                       |             |
| Intel GPU     |      |        | oneAPI<br>intel/llvm               | CHIP-SPV:<br>early prototype | Intel OneAPI<br>compiler                            | prototype             | oneapi::dpl |
| x86 CPU       |      |        | oneAPI<br>intel/llvm<br>computecpp | via HIP-CPU<br>Runtime       | nvc++<br>LLVM, CCE,<br>GCC, XL                      |                       |             |
| FPGA          |      |        |                                    | via Xilinx<br>Runtime        | prototype<br>compilers<br>(OpenArc, Intel,<br>etc.) | protytype via<br>SYCL |             |

CHEP 2023 https://indico.jlab.org/event/459/contributions/11807

