Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Need to check ROCm 7.2 corrtecness.
- Need to understand and fix crash on RTX Pro 6000 reported by Oliver - Status?
- Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround. New transformation map has a global 100cm cutof for corrections, we will keep that.
- Final solution: merging transformation maps on the fly into a single flat object:
- PR is now green in the CI, POD version of the fasttransform and merging of the maps integrated (thanks to Ruben and Matthias and of course Sergey!)
- Now having a full version with map merging, and conversion to POD map, which can run on GPUs.
- Remaining performance regression is cache related, and completely disappears after the map merging (since if we have only 1 map, the pressure on the cache is smaller).
- Nonetheless, together with Sergey trying to fix this. Discussed that we will make cutoff value of 100cm global constant, and remove per-sector grid structure (which was never used) but keep per row grid.
- Together with Matthias, did several additional optimizations of the map, to remove unused pointers, combine geometry constants with the GPUTPCGeometry class, reduce the data size, and make as many constants as possible constexpr.
- CI runs of PR are sometimes red, sometimes green, due to a bug in the combination of how we implemented the AddRootDictionary macro, CMake, and ninja dependency files, leading to false cyclic dependencies for incremental builds when libraries are moved. Could not find a solution, propose to just merge it as is.
- Marked PR as draft, in order not to break other PRs.
- 300 files changed meanwhile, creates merge conflicts easily, e.g. after merging the ITS PR yesterday.
- For the final merging, want to wait until we have converged on a final map format, since otherwise if we change it again we need another compatibility layer for reading.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Check for unnecessary f64 instructions in GPU code.
- NVIDIA Blackwell
- Performance regression on Blackwell (RTC 5090, RTX Pro 6000) was also due to cache effects, and disappears with the new merged TPC transformation maps.
- Solved / solving a couple of issues validating the results in deterministic mode:
- Compiler bug or hardware problem on blackwell with precise floating point division in combination with denormals flushing to zero. (FTZ is enabled in deterministic mode, since some GPUs do not support denormals, and I wanted to have identical deterministic results. To be fair, there is no specification that hardware must support IEEE compliant float rounding with FTZ enabled, and the fact that x86 supports it, does not mean NVIDIA has to…).
- Created a minimal reproducer and will report to NVIDIA, but can only report as normal developer via the forum, since we do not have a special technical contact.
- Avoided by introducing 2 different deterministic modes, with FTZ enabled and disabled. Unfortunately, it means for each GPU we have to chose the right one, and we have 2 (marginably) different reference results on CPU.
- Found one compiler bug (if condition branch not taken correctly), which I did not manage to reproduce to a minimal example. Worked around it by some code restructuring / simplification, which was on my todo list anyway.
- Had to include one bug-fix by Felix for same CPU + GPU behavior.
- Parallel developments from Felix on clusterization and Gabriele for parameter tuning were incompatible. Fixed and added constraints to be safe in the future.
- In summary, NVIDIA Blackwell performance looks really good now. Still waiting to get final results, but currently ~0.6s processing time compared to 3.7 seconds on MI50.
- Performance regression on Blackwell (RTC 5090, RTX Pro 6000) was also due to cache effects, and disappears with the new merged TPC transformation maps.
- Performed quite some restructuring in the GPU code to remove the amount of preprocessor defines, and use constexpr variables where possible. Removed GPUCA defines which people copy&pasted to code that is never compiled for GPU.
Other topics:
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
- Test at NERSC still ongoing, IRAKLI is checking why jobs cannot be scheduled..
EPN GPU Topics:
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Cluster error parameterization
- Identified a bug in NN training. Fixed, but did not change much.
- Find out where the NN struggles the most:
- Take all training data and evaluate NN for each point
- Check the loss values per data point (MSE loss), take all points above threshold

- Take histogram of each NN input dimension and normalize...
- ...full input data distribution w/o threshold cut
- ..."loss-outlier" distribution
- Take ratio between the two normalized distributions: hist(outliers) / hist(full data)
--> Resulting trends will indicate where the network struggles the most

----------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------



----------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------





----------------------------------------------------------------------------------------
(Similar for other cov matrix elements)
---------
Network struggles most at:
- Low x coordinates (IROC)
- Large clusters: high sigma pad and/or sigma time
- Clusters with small charges (minor effect) and high charge (more pronounced)
- Highly inclined tracks
- High sin(phi), high tan(lambda), high sign(q)/pT (so low pT)
- high multiplicity
- Low cov matrix values -> This would hint that the cov matrix is underestimated (maybe?)
-
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
Big refactoring to the tuner link
- Now user friendly
- All dependencies installed via pip (except from profilers)
- Automatic GPU vendor detection
- Possibility to add a time budget to indicate a desired duration of the tuning
- Biggest change: now independent steps (ensemble of kernels) are tuned togheter using multiple optuna studies at the same time
- Single run of the standalone benchmark
- Profiling of multiple step with the same run
- One optuna study per step, which will suggest a new configuration
- This heavily reduces the time needed for tuning
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
- CHEP talk together with Jolly got accepted.
- Co-Supervision of a summer student. Topic: Imrpove clustering algorithm CLUE.
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- GPUTPCTrack
- SectorTracker
- Some issues with AMD W7900 (mostly due to WARP_SIZE and because the architecture gfx1100 is not built-in)
- Works fine if I add architecture "gfx1100" in FindO2GPU.cmake and a new column "RDNA3" in GPUPrameters.csv.
- If a don't change the FindO2GPU.cmake, but I add AMD_EUS_PER_CU=4 for default to GPUParameter.csv, we have the following issues:
- Terminal spamed with "This sorting variant is disabled for RTC" even without RTC.
- Runs with RTC without failure, is deterministic, but GPU.out is (slightly) wrong.
- Runtime (segfault) error with (and only with) custom .par file that otherwise works.
- Just putting COMP_GATHER_MODE=3 (previously 0) in the default column of GPUParameter.csv fixes the segfault and yields correct GPU.out with custom .par file (still wrong GPU.out without custom .par file).
- Detailed guide for reproduction is in the appended files.
- Some things I noticed along the way:
- WARP_SIZE seems not to be set from .par file. (Can be worked around with --RTCoverrideWarpSize.)
- In my case, --PROChipOverrideAMDEUSperCU is useless because already the build fails due to a static_assert in GPUReconstructionKernelMacros.h, line 37.
- In GPUT/GPUTracking/CMakeLists.txt, the variable GPU_PARAM_JSON_FILE actually points to a CSV file.
- In the same file, we create an actual JSON file CONVOUTFILE out of it. It's called "GPUParameters_GPUParameters_0.json". If this is not intended, we should change line 282.
- Should we make the variables in GPU/GPUTracking/Standalone/cmake/config.cmake overrideable during the CMake step (e.g. -DENABLE_HIP=1)?
- Next Steps:
- Build and run O2 with C++26 compiler to integrate reflections (needed for CHEP benchmark)
- Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
- Make better use of SoA to improve performance
- Try David's suggestion
-
10:35
→
10:40
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
-
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (CERN)
OpenCL
No news.
GPU Servers
CI Server: No update. Sergio away this week.
Highly Ionizing Particles
Opened PR for the current implementation.
Simulation Settings
Mesut: Ion tail disabled in simulation as it's filtered in FPGA.
(Defaults in FST are other way around: doIonTailPerPad=1, doSaturationTail=0)
Filter Performance
Tail filter performs poorly, even with very low cutoff (ADC=5).
Likely bug, still tried to better how ADCs are distributed in tail:

How values behave wrt peak distance:

Tail length:

Misc
David discovered bug in noisy pad filter. Fixed in #15278.
Caused by a mismatch in pad indexing between GPU and CPU version.
This was never discovered because GPU and CPU used to have the same bug, but accidentally fixed on GPU in #15001.
- 10:45 → 10:50
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20