10:00 → 10:20

Discussion 20m

Speaker: David Michael Rohr (CERN)

Color code: (critical, news from this week: blue, news from last week: purple, no news: black)

Sync reconstruction

Async reconstruction

Need to investigate short GPU stall problem.
Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
Test with GPU GRID jobs at NERSC pending.
Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

GPU ROCm / compiler topics:

Problem with building ONNXRuntime with MigraphX support.
Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
Tested ROCm 7.2 on MI50 / 100 / 210. Running stably on 50 / 210, not checked for correctness yet. Crashes randomly on MI100, but seems to be different pattern compared to serialization bug we had before.
Need to understand and fix crash on RTX Pro 6000.

TPC / GPU Processing

WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
Final solution: merging transformation maps on the fly into a single flat object:
- Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
Need to check the problem with ONNX external memory allocator.
Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
TODO: Workaround for wrong field used for encoding online, make memory scaling factors configurable via ConfigurableParam

Other topics:

Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
- required to adapt / bump ~30 packages, now nearly done.
- Remaining issues: 1 problem in O2 (only on Mac) and one in O2Physics (wrong boost usage), PRs with fixes open.
- Need new DD tag, PR open.
- Problem with new libwebsocket on RHEL7 due to bogus kernel headers in that version colliding with glibc. Must either switch AliRoot CI to SLC9 and drop slc7 support, or we can disable ipv6 for slc7.

EPN GPU Topics:

10:20 → 10:25

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

Cluster errors

General observations
- NLL (negative log-likelihood) loss works better for convergence than MSE (mean-square error) loss
- Scaling necessary (requires tuning of parameter -> currently only one parameter, but optimally take 2 separate ones)
- Training now rather stable and testing different configurations
  - Also tested the idea sigma / sqrt(qTot). Worked decently well, reaches similar number of tracks, but efficiency is down by 10-15%
    - Feeding in both cluster and track position completely deteriorated the fit, no long tracks found
    - Getting rather good results now for long tracks even though total number of tracks is still not as high as with default method -> Need to adjust scaling parameter

Next try: Retuning x and y scaling separately

10:25 → 10:30

GPU Parameter Optimizations 5m

Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

News from GPU parameter tuning

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2026-03-04T10:00:00+01:00
End: 2026-03-04T11:30:00+01:00
Location: No location set

No recent news

Step / Kernel	MI50 32GBs [ms]	Radeon VII [ms]	Speedup
clusterizer step	685.15	740.56	1,08
mergerCollect	70.74	68.61	1,03
mergerTrackFit	640.83	1452.46	2,26
tracklet step	1139.16	1305.79	1,15
compression unattached	643.22	645.79	1,00
mergerSectorRefit	303.76	628.68	2,07
FollowLoopers + compression attached	1116.16	1077.43	1,03

10:30 → 10:35

Efficient Data Structures 5m

Speaker: Dr Oliver Gregor Rietmann (CERN)

NextGenTrigger Task 1.7

CHEP talk together with Jolly got accepted

Implement NGT SoA Code in O2 standalone benchmark

Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
Implemented SoA in:
- SectorTracker
  - GPUTPCBaseTrackParam
  - GPUTPCTrackParam
  - GPUTPCTracklet
  - GPUTPCTrack
The SoA library changed a lot in the course of implementing it in O2. Thus I have now adapted:
- Unit tests
- Documentation
- Benchmarks (see plot below)
Next Steps:
- Make better use of SoA to improve performance
- Try David's suggestion

comparison to other SoA approaches

Issues with AMD W7900 after updating to latest O2

Providing a .par file with certain Parameters results in non-deterministic behavior, despite deterministic mode.
Can't create my own .par files with dumpGPUDefParam.C of newest O2. Yields error incompatible launch bounds.
The only parameters that worked on W7900 are the ones for MI100 with WARP_SIZE changed to 32.
Other parameters (e.g. RDNA) yield deterministic behavior, but with a (slightly) different GPU.out than obtained wit other GPUs.

Other issues

In my setting, I couldn't just add new GPUs to GPUParameters.csv: The headers of the newly added GPUs were created, but not the corresponding .par files.
A particular build dependency of O2 is missing on CVMFS: ninja/fortran-v1.11.1.g9-16 (only .g9-15 is present). I am in touch with Sergio.
In GPUParameters.csv, the AMD GPUs RDNA and MI210 are missing, but they show up in FindO2GPU.cmake. Is this intended?

10:35 → 10:40

Following up GPU to-dos 5m

Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

10:40 → 10:45

TPC Clusterization / OpenCL / Highly Ionizing Particles 5m

Speaker: Felix Weiglhofer (CERN)

OpenCL

No news.

GPU Servers

Waiting for PCIe 8-pin cable. Haven't heard from Guy?

Highly Ionizing Particles

~~Investigating crash on Nvidia. (happens even when not triggered)~~ FIXED

Tail masking working with injected tails.

Next steps:

Test with simulated TF
Add flag to toggle tail masking

10:45 → 10:50

ITS Tracking 5m

Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

Felix: worked on porting the ITS2 staggering readout tracking to the GPU code (not done yet, there is some crash I have to understand but CPU works:

Gabriele:

Talked with Max Puccio
Restricting computation of Vertex function to V(0,0,z) might be problematic due to possible shift of the beam line
Currently implemented 3d version
- For each voxel centre computes the vertex function
- Then peak finder is run to find peaks and get vertex candidates
- Currently tuning parameters with optuna
- Precision: 1.0000, Recall: 0.7133, let's see where the plateau is
3d version too expensive anyway (16s seconds per PbPb TF)
If tuning gives good results, I will try to find a way to preselect tracks per voxel to restrict computations
Will try another implementation with 3d histogramming
- Basically where the track contributes only for bins where it passes through

10:50 → 10:55

System Run Coordination Topics 5m

Speaker: Ernst Hellbar (CERN)

P2 tests with replay pp 750 kHz

MI100 fully serialized
O2_GPU_RTC=0

gpu-reco TF processing rates:

MI50 - 1.97 Hz
MI100 - 1.74 Hz

theoretically (using simple scaling) sufficient for 1 374 kHz with full farm (270 MI50s, 65 MI100s), but not for TPC high rate tests for FC stability
- can simply use 2025 PbPb software for high rate tests
  - can enable GPU RTC and we expected it to be fast enough for 50 kHz PbPb when disabling looper following

P2 tests with replay PbPb 50 kHz

MI100 fully serialized
O2_GPU_RTC=0

gpu-reco TF processing rates:

MI50 - 0.64 Hz
MI100 - 0.78 Hz

slowdown by 28 % on MI50, 23 % on MI100 compared to the 2025 PbPb SW version (with O2_GPU_RTC=1 )
theoretically sufficient for 30 kHz with full farm (270 MI50s, 65 MI100s)