ATS GPU Requirements

Europe/Zurich
31/S-028 (CERN)

31/S-028

CERN

30
Show room on map
Zoom Meeting ID
63938653081
Host
Ricardo Rocha
Useful links
Join via phone
Zoom URL

Present: Eric Bonfillou, Laurence Field, Nils Hoimyr, Giovanni Iadarola, Bernd Panzer-Steindel, Ricardo Rocha, Ulrich Schwickerath


Ricardo Rocha briefly gave an update of the initiative's milestone plan.

 

Bernd Panzer-Steindel suggested contacting Alice regarding the AMD GPU cards, as they might have available during the technical stop.


Giovanni Iadarola presented the needs for GPU resources in the ATS sector for the next years, both for batch and interactive workloads.

 

He started by presenting the batch needs for 2023, 2024 and 2025.

Bernd asked if multi-GPU jobs are a requirement. Nils Hoimyr also asked if low latency connectivity is a requirement. Giovanni replied they should not be a general requirement, but will double check if some groups in ATS might need them.

Ulrich Schwickerath asked about the reported storage issues, and if local disk space would help as the new A100 nodes have large SSDs. Giovanni replied that shared filesystems help a lot for checkpoint / restart of jobs and that lots of local disk is not a requirement as most data stays in RAM.

 

Giovanni proceeded presenting the requirements for interactive workloads. He mentioned this is an area where activities are still shaping contrary to batch usage which is well established. Use cases include ML, development for simulation and interactive design. Options used up to today include workstations, VMs, notebooks, and public cloud resources.

 

Laurence Field commented an lxplus service with GPUs (mentioned in the doc) already exists. Giovanni replied that the requirement is "lxplus-like" so that some guarantee exists in terms of performance which is not the case for today's lxplus setup.

Nils asked if the existing interactive batch mode would be enough. Giovanni mentioned the goal is to avoid queuing, and that they need a way to at least up to a certain quota guarantee immediate access to the resources. An additional requirement is to keep the session for a long period.

 

Giovanni continued presenting the requirements for such a lxplus-like service, described in more detail in the document.

 

Nils asked about shared filesystems requirements for this use case. Giovanni said CephFS as it is in HPC is perfect. Ricardo asked about the kind of authentication available with CephFS. Nils mentioned this is currently unix auth, no strong auth available. Giovanni mentioned the main requirement is that the working environment survives across sessions.

 

Giovanni asked the room if these numbers are realistic. Bernd mentioned it is important to decide things soon as getting 120 GPUs ready for 2024 means ordering already next year given the existing delays due to supply chain. He added the cost should be in order of 2.5 to 3 million and that a decision on who pays is also needed. Current resources have been bought on a split of 90% for WLCG and 10% for the rest of the workloads, which is not the assignment today but that could change. He gave as an example the possible ramp up from the team working on madgraph where significant progress has been made. Bernd mentions a decision before christmas would be ideal to comply with the expected delivery dates. Giovanni suggested to raise this urgency item in the ATS-IT committees.

 

Ricardo added this is the requirements gathering round from ATS, and that other sectors will soon come up with their numbers as well.

 

Bernd suggested also investigating Nvidia Hopper - in the same way it was done earlier for Ampere. Ricardo added Hopper should be available already in some of the public cloud providers.

 

Ricardo asked if there are numbers for the split between small, medium and large jobs in batch, as this was mentioned during the presentation but the table doesn't express it. He added that this is where MIG could be very benefitial. Giovanni suggested to have a meeting with the whole ATS-IT GPU initiative group to understand better this split. Ricardo suggested to also look at the batch usage data to see if we can get this view from the real data as well. Bernd mentioned this will definitely help in the TCO for V100 vs A100.

 

Giovanni asked what is the best solution for interactive access, thinking on a 6 months timescale. Laurence mentioned teams like madgraph coordinate between themselves shared resources (VMs). Giovanni said this would be hard to achieve in their teams. Ricardo suggested to take this as an item back to ATS-IT as well as it could be a spin-off initiative for the future.

 

Giovanni asked about access to public cloud resources, given that some teams in ATS expressed the will to have a central way to access public cloud resources for activities such as benchmarking. It is expected this would be a request for a small amount of resources for a start. Bernd mentioned there is an ongoing discussion that is not yet sorted in IT and suggests this shouldn't take a major role in these discussions for now. 

 

ACTION (Giovanni/Ricardo): Raise the urgency of an early decision regarding future GPU resources for ATS in the ATS-IT committees. If a tender is foreseen, take into account current supply chain issues (up to 14 months delivery times)

ACTION (Giovanni/Ricardo): Indicate to the ATS-IT committees a possible new initiative request regarding improvements for GPU interactive access

ACTION (Batch/Laurence): Check batch data to understand a possible split between small, medium and long ATS jobs

ACTION (Ricardo): Redo GPU benchmarks with Nvidia Hopper resources

ACTION (Ricardo): Check with ALICE regarding usage of AMD GPU cards for benchmarking

There are minutes attached to this event. Show them.
The agenda of this meeting is empty