WLCG Sustainability Forum Meeting #4: Detector Simulation
In the last meeting we looked at the improvements of energy efficiency achieved by the event generator community.
The next step, after hadronisation, is detector simulation. This is largest individual fraction of an experiment's computing needs.
Introduction - M. Schulz
-
Detector Simulation is a major consumer of cycles and storage (ATLAS example).
-
Sessions have covered power accounting, embodied carbon, and the use of generators on GPUs.
-
Discussions aim to integrate technical aspects into a holistic picture of the CO2e footprint.
-
Where we are / what happens next?
-
[ENV1: power measurements] Much progress ongoing in HEPiX Benchmarking team
-
[ENV2: efficient hardware, common software] common strategy and benchmarking needed for both hardware and software
-
[ENV3: sustainability plan] Still in its infancy
-
Vision: HL-LHC is near, but far enough to have a discussion on what we can do better
-
Motivation in common: Higher efficiency is critical for surviving HL-LHC AND the environmental impact
-
From a “pull” strategy (forum) to a “push” strategy (what can we do to improve?)
-
Also for connections to the broader community: register for SC4RC (https://indico.cern.ch/event/1526482), a multidisciplinary Sustainability Conference 4 Responsible Research Computing (not related to the Forum)
AdEPT - Juan Gonzalez
Project Overview and Goal
-
AdePT (Accelerated particle transport)
-
accelerating Geant4 simulations using GPUs.
-
offloads e-, e+, gamma to the GPU: "GPU-friendly" physics + large fraction of compute time (50-80%) → potential speedup of 2x to 5x.
-
lightweight Geant4 plugin with components including AdePT core for kernel scheduling & particle tracking, G4HepEm for physics functions on GPU and specialized tracking on CPU (giving a 20% speedup over G4 on CPU), and VecGeom for geometry and navigation.
Current Status and Challenges
-
AdePT can run close-to-production simulations for LHCb, CMS, and ATLAS on GPU.
-
"Close-to" means running the full production setup except for parts that cannot handle parallel tracking, primarily MC truth.
-
Technical challenges for parallel tracking
Performance and Physics Results
-
AdePT on GPU shows excellent agreement with G4 on CPU in Athena (ATLAS simulation environment).
-
Performance in Gauss (LHCb simulation environment):
-
Hardware: Nvidia RTX4090 GPU and AMD Ryzen 9 16 cores CPU. Max. speedup factor: 2.7x; Current speedup: 1.6x at 128 threads. Oversubscription (4x) is needed to fill the GPU.
-
Hardware: 4x A100 GPUs and AMD EPYC 7763 (64 cores) CPU. Max. speedup factor: 2.7x; Current speedup: 1.8x at 96 threads. Oversubscription (3x) is needed to fill the GPU.
-
Performance in Athena (ATLAS simulation environment):
-
Hardware: 4x A100 GPUs and AMD EPYC 7763 (64 cores) CPU. Max. speedup factor: 1.8x; Achieved: 1.4x.
-
On the NGT cluster (8x H100 GPU and AMD EPYC 9654 96 cores CPU), the maximum speedup factor is 1.8x, and the achieved speedup is 1.4x.
Energy Efficiency
-
Energy consumption highly hardware-dependent.
-
In Perlmutter nodes, GPUs idle at a relatively high power, which is likely to prevent large temperature fluctuations.
-
Possible situations for energy efficiency where buying or using available GPUs is recommended:
-
Lower energy usage offsets the cost of the GPU node over its lifetime.
-
Lower energy usage than a CPU node, but won't offset the cost of the GPU node.
-
Running on GPU uses less energy than using only the CPU and keeping the GPU idle.
On Perlmutter:
-
Athena: Speedup of 2.6x is required to beat a CPU-only run.
-
Gauss: Speedup of 2.4x is required to beat a CPU-only run.
On the NGT cluster for Athena: Not fulfilling case 3; a speedup of 1.5x would be needed.
Future Work
-
Understand/further improve the energy efficiency of AdePT.
-
E.g. studying the effect of power and frequency capping on CPU and GPU.
-
Reliably benchmarking different hardware is crucial.
-
The HEPIX Benchmarking Working Group is exploring an ATLAS + AdePT workload, which is available as a preliminary version
-
Team looking forward to exploring different hardware setups with help from WLCG.
-
A preliminary docker image for ATLAS-FullSim-GPU is available for testing.
Q&A
Q: I noticed you did not report anything regarding CMS except for the acknowledgments. Is that planned for the future?
A: We have performance data for CMS, but we have not yet measured the energy consumption. For this specific talk, we felt it did not make sense to show only the performance results. There is definitely a plan for the future to evaluate performance in the NGT cluster. That makes sense; you could likely use the CERN facility.
Are there any other comments or questions on Zoom?
Q: The power gains are relatively modest—certainly not a factor of five. If you project this over the typical hardware lifecycle and consider the embodied $CO_2$ of the GPUs, what would be the conclusion? We seem far from the point where we are actually beating a CPU run.
A: We have not looked into this extensively yet. For the moment, it does not make sense to purchase GPUs for this specific use case because we are not yet more efficient than a CPU run. This is why we want to investigate different types of hardware; it is currently difficult to tell with the equipment we have.
Comment: I was hoping for a more ambitious goal. The problem is that GPU lifetimes are shorter, embodied carbon is high, and idle power is significant. I understand that frequency scaling on GPUs offers better gains than on x86 CPUs. There may be mileage in reducing the frequency of the GPUs to make them more akin to ARM hardware. That might provide some benefit, but it is disappointing that we are not seeing factors of five when looking at the big picture.
Comment: Regarding frequency scaling and voltage, we had a summer student study this. The results depended very strongly on the generation of the GPUs; with newer models, the effect was very small. Looking at the breakdown of the CPU runs, the CPU already uses less energy during a GPU run. When we tried reducing the CPU frequency, we also saw gains because the CPU sometimes idles even when oversubscribing. These will be marginal gains on top of our current results. For Athena, I previously showed a maximum speedup of 1.8, but that is only with half of the detector uploaded. That margin will increase, but we cannot yet say by how much.
Q: Are there discussions regarding new concepts for detector Monte Carlo in addition to using GPUs? The fundamental structure of detector Monte Carlo is essentially an enormous Markov chain problem, and shortcuts are not easy. All Monte Carlo involves a step, a weighted random decision, and branching. It is an immense chain of if-then-else statements, jumps, and random number generation. Algorithmic shortcuts—with the exception of vector geometry—are not obvious. Everyone is using machine learning now; is the Geant group looking into using machine learning for part of this work?
A: There is a simulation team within the general group. While the experimenters themselves are involved, Geant is not currently trying to "learn" parts of the sub-detectors.
Celeritas - Julien Esseiva
-
Update on Celeritas and our preliminary numbers for measuring energy efficiency for EM physics on GPUs.
-
The code is optimized for GPUs, but we run the same code on CPUs to ensure reproducibility.
-
We support both NVIDIA and AMD GPUs.
-
Geometry navigation and multiple scattering are the main bottlenecks on GPUs, involving significant random access and branching, which is not ideal; double precision is also required, contrasting with the GPU trend toward lower precision.
-
As an EM-only standalone problem (not a production workload), it provides an idealized view; a factor of two speedup over 16 CPU cores is observed for the formal CMS Run 2 detector.
-
Simulation is a major CPU expense for experiments (e.g., 40% of ATLAS's CPU hours, with 70% of that on EM physics).
-
Offloading EM physics to GPUs could free up 20% of the total CPU hour budget for ATLAS, potentially yielding up to a 3.3x wall-time speedup for individual grid jobs if the EM part was instantaneous.
-
Primary power measurements on Perlmutter show Celeritas uses about 120 watts per GPU (far below the 400-watt max), compared to 50 watts when idle.
-
Celeritas is currently slightly slower than Athena because CPU and GPU processing is not parallelized (service processing occurs after the Geant4 event).
-
A 2x speedup in EM physics on the GPU could lead to a 1.5x overall wall-time speedup, though overhead exists due to CPU calls for sensitive data.
-
A 1.9x speedup achieves parity in "events per watt," meaning GPUs can increase throughput without saving energy; a 1x speedup uses more energy for the same throughput.
-
The primary benefit for ATLAS is hardware utilization (e.g., HPCs), not necessarily faster offline processing; energy efficiency is key for determining cost-effective hardware usage.
Q&A
Q: You mentioned that GPUs have a shorter lifetime than CPUs. Is that because they become obsolete or because they actually break?
A: It is mostly due to obsolescence. GPUs that are only a few years old are often considered too old to be useful, whereas a 10-year-old CPU can still be run.
Comment: We have a tendency to run very old hardware, perhaps because we have not prioritized power costs in the past. We are currently running many T4s on the grid. When you say it is not worth running them, what metric are you using? Is it the cost of power and cooling? A T4 would use less power, even if it is out of date.
A: I was referring to power efficiency. For full simulation, anything older than an A100 seems very inefficient compared to modern CPUs. It would be interesting to know quantitatively what that costs regarding embedded carbon and the crossover points. Practical funding also plays a role; agencies often do not provide budgets for new purchases every year.
With the T4, we do not see much speedup for full simulation. It is very dependent on the application; for example, IceCube has used T4s effectively for a long time. For full simulation, newer GPUs benefit from faster memory, which helps manage random number divergence.
Q: Was your measurement taken with DDR5 or HBM2E memory?
A: It is separate, not unified memory, but I would need to check the frequency. An A100 should be HBM2E.
Q: The GPU power draw is not at the maximum. Is this a highly memory-bound workload?
A: Yes, it is very memory-bound. We are not maxing out the available memory, but the transfer bandwidth from GPU global memory to the multiprocessors is the limiting factor. This is due to divergence; each track performs different tasks and loads different parts of the geometry. We also do not use hardware like Tensor Cores, so we would never reach maximum power usage.
If you add SIM and RecoSim, the total is close to what we spend on reconstruction. As we improve tracking, the reconstruction slice becomes smaller. Since we have GPU tracking for the HLT, some of this information will change.
Q: Why do we have two separate activities for moving detector simulation to GPUs?
A: That is a question for the project in general. There have been ongoing discussions about whether these projects will eventually converge. They share core components like VecGeom for geometry but differ in how they schedule kernels and implement physics. Since GPUs will certainly be a part of future computing, it is worth investigating. If forces were joined, it might lead to quicker progress.
Comment: Some manufacturers, like NVIDIA, impose a hard lifetime on their products by restricting their use or resale after a certain point. Power consumption depends largely on the silicon process; smaller features require moving less charge, resulting in better efficiency. The funding question also plays a significant role. Funding for operations is usually separate from hardware purchases. It is often easier to maintain funding for power costs than to secure a budget for new equipment.
The next meeting will focus on fast simulation, where we will examine the limitations of fast versus full simulations. See you then.