GridPP Technical Meeting - GPUs in GridPP7

Europe/London
Virtual Only

Virtual Only

Alastair Dewhurst (Science and Technology Facilities Council STFC (GB)), Andrew McNab (University of Manchester), David Colling (Imperial College (GB))
Description

Weekly meeting slot for technical topics. We will try and focus on one topic per meeting.  We will announce at the Tuesday Ops meeting if this meeting is going ahead and if so the topic to be discussed.

 

 

Videoconference
GridPP Technical Meeting
Zoom Meeting ID
95305495507
Host
Alastair Dewhurst
Alternative hosts
Samuel Cadellin Skipsey, Matthew Steven Doidge
Useful links
Join via phone
Zoom URL

The primary reason for deploying non-x86 architectures on the Grid is to run jobs more efficiently.  The Grid relies on a high throughput computing paradigm, so the absolute speed of the jobs is less important than the cost (both in terms of money and carbon) to run it.  The Grid has yet to deploy non-x86 architectures at any scale but there is a high likelihood that this will happen during GridPP7.  This is because:

- During LS2, CMS upgraded their HLT farm with GPUs and these have been used during Run the start of Run 3.  Parts of CMS software (e.g. Reco) is proven to run on GPUs.

- ATLAS have demonstrated that their code compiles and runs on ARM architectures and a large scale validation is planned before GridPP7 starts

- The WLCG have agreed to adopt the HEPScore benchmark for official accounting.  This benchmark will support jobs run on different architectures.

 

Currently small GPU queues are deployed at several sites but see very little utilisation.  More sites are getting investments from their universities to purchase GPUs.

 

 

Tasks for GridPP7

  1. Deployment of resources at sites (Several sites)

Several sites have already deployed GPU or other non-x86 architecture into their batch farms.  While there will inevitably need to be some tweaking of the setup, this is not considered outside the realms of normal DevOps work.

 

  1. Benchmarking and accounting (RAL APEL, GridPP management)

We will need to benchmark and account for the non-x86 resources deployed.  The development work going into APEL will allow this to be recorded.  It is expected that VO will produce GPU benchmarking code that will allow us to compare hardware.  

 

It should be noted that should usage of non-x86 architecture increase significantly, it will be worth organising sites to specialize in specific hardware, thus avoiding a situation where every sites has multiple queues for different resources but very little of each.

 

  1. VO job submission / onboarding (RAL VO Liaisons, Liverpool?, Glasgow? IC)

We will need to work with the VOs to ensure that they can successfully submit jobs that target the required resource at sites.  It should also be noted that there will need to be a period of pro-active support for those running non-x86 architectures to ensure that their jobs don’t get stuck / break.  Several weeks delay in getting jobs to run on the Grid could set back adoption by months.  We will also need to ensure that the VO summit the relevant test jobs.

 

Imperial will also need to develop and implement the relevant features for DIRAC to allow the VO using it, to submit jobs to different resources (Note we didn’t have an IC person at the meeting to confirm what was need).

 

  1. WLCG Leadership (RAL VO Liaisons?, Manchester?, Glasgow?)

GridPP should ensure it has a leadership role in the WLCG Working Groups or Task forces that will be required to roll out non-x86 architectures across the grid.

 

  1. Utilising GPUs efficiently (QMUL, Glasgow?)

While it would be a significant step forward for VO code to reliably run on GPUs, the aim is to get them to run more efficiently than on x86 hardware.  If a job does not utilise GPU for the majority of its run time then it might not be beneficial.  There is technology such as Nvidia Multi Instance GPU which allows a GPU to be “split” into several less powerful GPUs, which might allow batch farms to broker jobs better giving higher occupancy of the GPU.  This would require pro-active monitoring of the GPUs utilization as well as working with the VO code developers.

 

  1. Utilising ARM efficiently (Glasgow, RAL?)

Early studies by Glasgow have shown that ATLAS benchmark code runs significantly more efficiently on ARM CPU than x86.  There is a large-scale validation expected to take place before GridPP7.  There are many different types of workflows and optimisation that are likely necessary to ensure that jobs continue to run efficiently, it may be identified that some jobs continue to require x86.  It should also be noted that currently the Ampere Altra and Ampere Altra Max are the only available datacentre focused ARM CPU available.  This should change soon with the release of Nvidia’s Grace Hopper CPU and potentially others in future.  We need to understand how the different architectures impact job performance. 

 

  1. Testing different GPU hardware and APIs (RAL)

Most Tier-2 sites have up to 30 GPUs available for use.  At RAL, the SCD Cloud has 750 GPU currently which is expected to continue to grow.  RAL is in an excellent position to test the performance of a wide number of different GPU hardware.  GPU performance can be heavily tied to the API used (E.g. which version of CUDA or OpenCL).  

There are minutes attached to this event. Show them.
    • 11:00 11:20
      GPU Development 20m
      Speakers: Alastair Dewhurst (Science and Technology Facilities Council STFC (GB)), David Britton (University of Glasgow (GB))
    • 11:20 11:45
      Discussion 25m
      Speaker: Alastair Dewhurst (Science and Technology Facilities Council STFC (GB))
    • 11:45 12:00
      AoB 15m