Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2026-02-11T10:00:00+01:00
End: 2026-02-11T11:30:00+01:00
Location: No location set

Wednesday 11 Feb 2026, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Test with GPU GRID jobs at NERSC pending.
  
  Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
  
  Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
  
  GPU ROCm / compiler topics:
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
  
  Serialization bug pending.
  
  Miscompilation on MI 100 leading to memory error pending.
  
  New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
  
  New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
  
  Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
  
  Dev2 server available with MI210 and ROCm 7.2, MI50 and MI210 running but need further validation, MI100 randomly crashing but with different problem and not serialization issue.
  
  Seeing a crash on RTX6000 NVIDIA GPU with some data sets, must be either a bug in our code or on the NVIDIA side.
  
  New GPU Builder Container with CUDA 13.1.1 available, supporting GCC 15.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Final solution: merging transformation maps on the fly into a single flat object:
  
  Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  Other topics:
  
  Molex connectors and cables for GPU CI server arrived, need to assemble them.
  
  All GPU users must update to aliBuild 1.17.40, disabling GPU builds on old aliBuild versions to avoid random build errors without proper error message.
  
  Major improvement of GPU CMake and parameter loading:
  
  Instead of detecting individual architectures, now performing >= comparison on compute capabilities, and if no tuning for exact architecture available, take parameters for the closest previous architectures where we have tuned values.
  
  Support building O2 with database of tuned parameters provided in CSV format.
  
  Support to merge on-the-fly multiple database files in CSV and JSON format.
  
  Can create binary .par files for loading at RTC from CSV and JSON file with simple script.
  
  Sped up GPU CMake from ~2sec to ~0sec.
  
  Switching from hard-exporting ALIBUILD_O2_FORCE_GPU=1 in GPU builder container, moved to env files of CI jobs and to default of jenkins builders where GPU is needed, and can e.g. be disabled for dataflow defaults CI jobs / jenkins builders.
  
  Everyone using the slc9-gpu-builder container, or Jenkins to build with GPU, please note that to get the old behavior ALIBUILD_O2_FORCE_GPU=1 must be exported.
  
  Note that by default, GPUs will be autodetected, so all backends would be available, but builds would be to the fallback architectures if no GPU is detected (MI50 for AMD; sm_75-virtual for NVIDIA), and not to our default list of production architectures (including MI100, RTX Pro 6000, ...).
  
  EPN GPU Topics:
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Cluster error parameterisation
  
  Goal: Train the cluster errors to approximate gaussian error around track tube for any given cluster
  
  First idea: Cluster error = sqrt((cluster-to-track residual)**2 - track covariance) -> If track covariance gets larger than residual, we have a problem
  
  Improved idea for loss function
  
  Assume gaussianity of cluster-to-track residuals and cluster errors: Maximises probability of gaussian by tuning standard deviation
  
  p(d, S) = 1/sqrt(2\pi S) exp(-d^2/(2S)) -> d = cluster-to-track residual ; S = (track cov. + cluster_error^2)
  
  Maximize p(d, S) == Minimize -ln(p(d, S)) -- Minimze: Loss = 0.5*ln(S) + d^2/(2S) where S = (track cov. + NN output)
  
  Input for NN:
  
  ["clusterState", "xx", "yy", "zz", "cluster.getSigmaPad()", "cluster.getSigmaTime()", "mP[2]", "mP[3]", "mP[4]", "mC[0]", "mC[2]", "mC[5]", "mC[9]", "mC[14]"]
  
  Change clusterState to individual inputs of 0 and 1's for the input by using (clusterState >> 0 & 1), (clusterState >> 1 & 1), ... as input to avoid large values and better distinction
  
  NN implemented in reco chain:
  
  Simulation with standard reco: 388k tracks
  
  Simulation with new NN cluster errors (otherwise identical): 120k tracks
  
  Also tried scaling the error but still far off
  
  -> Something's fishy...
  
  Problem are clusters for which the learned cluster error is 0
  
  Potential problem is the training data selection: https://github.com/ChSonnabend/AliceO2/blob/7208318395ce370d4ba522d05e1d31b00ebe29a3/GPU/GPUTracking/Merger/GPUTPCGMTrackParam.cxx#L305
  
  New implementation
  
  Removed ReLU in final output layer -> Removes entries at prediction = 0
  
  Top default reco, bottom with new cluster error NN
  
  388k tracks (default) vs. 392k tracks (new NN)
- 10:25 → 10:30
  
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  
  News from GPU parameter tuning
  
  See table
  
  In synthesis: finished sim PbPb for RTX6000, started tuning dev pdp server
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  NextGenTrigger Task 1.7
  
  Implementing our SoA code in MadGraph, together with
  
  Stefan Roiser
  
  Daniele Massaro
  
  No news
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo, with a CI pipeline:
  
  Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
  
  Extended CI-pipline to fail if GPU.out changes
  
  Implemented SoA in:
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTracklet
  
  GPUTPCTrack
  
  Went back from CRTP to classical inheritance
  
  Implemented new way of providing iterators in O2: https://godbolt.org/z/haas6YP6c
  
  Next Steps:
  
  Adapt unit test to the changes
  
  Adapt benchmarks to the changes
  
  Make better use of SoA to improve performance
  
  Try David's suggestion
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (CERN)
  
  OpenCL
  
  No news.
  
  GPU Servers
  
  pdp-dev00 and epn100 in same slurm partition. Access for both protected via egroup for ssh and slurm.
  
  TODO: Add documentation to PDP docs.
  
  Highly Ionizing Particles
  
  Tail masking almost done. Currently working on testing implementation.
- 10:45 → 10:50
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  Felix: no news, besides noticing that in the Pb-Pb production we did not use the mean-vertex as a constraint (added in #15054)
  
  Gabriele:
  
  Tested the vertex function on simulated PbPb data
  
  Reminder: vertex function gives a score for each point in the ITS volume based on
  
  Where f_i is
  
  Where i is i-th track from 3 first three inner layers, r-p is the distance of the track to the point
  
  Currently tested this strategy: voxelize the 3d volume around the beampipe:
  
  const int nZ = 2000;
  
  const int nX = 40;
  
  const int nY = 40;
  
  const float xMin = -.3f, xMax = .3f;
  
  const float yMin = -.3f, yMax = .3f;
  
  const float zMin = -11, zMax = 11;
  
  Compute the vertex function for each point (i.e. sum over all tracks)
  
  Simulated PbPb data
  
  max over x,y of V(x,y,z) vs z, vertical line is z of true vertex in ReadOut Frame (ROF) 31:
  Clear peak of the function around the true vertex
  
  So the function can carry the signal of the presence of a vertex in PbPb conditions. However, for the next ROF n. 32
  
  In this ROF there where no vertices, but the signal of previous vertex is still present (even if with less intensity). Also the old seeding vertex was finding a vertex there. Is this a known problem?
  
  Pitfalls
  
  On CPU extremely heavyweight to compute:
  
  Old CPU implementation, 20 threads, vertexing part serial:
  Vertex seeding total elapsed time: 1666.8417 ms for 133 (320 + 85) vertices found in 36/191 ROFsVertex function on CPU with 20 threads, computing the vertex function for a single ROF:
  Celled vertex function time: 1811.853949 ms
  
  Result dependent on binning
  
  Next steps
  
  Since the vertex function seems informative, try on GPU and see the performance results
  
  In case of good performances, try to proceed in this way
  
  Can reduce number of computations by calculating the function only for points where pair of tracks passes by
  
  Select tracks with high contribution
  
  Perform a fit of the vertex with that tracks
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)

Alice Weekly Meeting: Software for Hardware Accelerators

News from GPU parameter tuning

NextGenTrigger Task 1.7

Implement NGT SoA Code in O2 standalone benchmark

OpenCL

GPU Servers

Highly Ionizing Particles

Simulated PbPb data

Pitfalls

Next steps