Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.
- Please upload to https://docs.google.com/document/d/1eek6kv_SqHE6b5k0KHs-6wcjHIxCKyQr6eldnhsptvY/edit?tab=t.0
Sync reconstruction
- Crash in TPC ZS decoding when receiving bad data, should check and improve protection such that we do not crash.
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- New GPU architecture selection for async in O2DPG looks good, should be merged.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- AMD is changing their support structure, we shall fill reports via github (which I like, then it is also better traceable). But they will no longer assign an engineer to follow up all our issues, but has to go through their normal support process. In particular, that means their compiler team might not be able to fix issues, if we do not provide a minimal reproducer.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object:
- Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- New safer, dynamic cluster protection working and deployed at P2.
Other topics:
- GRID Memory monitoring: Discussed with Maksim, the problem with incorrect values vrom smaps is already fixed, since they switched to cgroup monitoring.
EPN GPU Topics:
-
10:20 AM
→
10:25 AM
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
CCDB fetching
- PR pending: https://github.com/AliceO2Group/AliceO2/pull/14841 (MacOS CI problem unrelated)
- Works in full system test
- What to do about: *mConfParam = mConfig->ReadConfigurableParam()
- What to do about metadata -> Would be good to have in order to separate networks (e.g. with different inputs)
- Uploaded right now: Networks from the commissioning runs. Do we need more?
- Upload sheet: https://docs.google.com/spreadsheets/d/1BGgDFqKnvYLlCK05hn5paQsDaiiE5HrwCErekvDqTv4/edit?usp=sharing
pp simulation
- 3000 min bias events, LHC24af, 1 MHz
- Evaluation with PbPb networks
Almost no occupancy coverage with this data. No advantage to be gained on the cluster properties (regression).

Some improvement for the qTot estimation for wide clusters (qTot / qMax large):

Still, clusters are being rejected with higher thresholds

Efficiency increases at region where highest cluster rejection occurs. No fake-rate improvement: Fake rate is already extremely low!

Tracks are so well separated that there is no real improvement to be found
Cluster error / split clusters
10 EV, 38 kHz PbPb, 0-5% centrality enforced
Investigation of split clusters with the NN
- Option 1: Use all MC charges, search for maxima that have no assigned ideal label
- Problem: This can find maxima per MC label which might not correspond to location of maxima in digits
- Option 2: Check training data: Find all training data inputs that have class label 0 (no attached ideal cluster) but exactly one peak in the 5x5 neighbourhood with assignment (red), or multiple peaks each with assignment (blue). If the network rejects such maxima, this will correlate to a reduction in split clusters.
Examples:



- For all clusters with class label 0 (regardless of neighbouring peaks) at 0.1 threshold: 44.7% rejection. This includes looper clusters and noise peaks -> Only small drop in efficiency for split clusters: They are similarly well identifiable.
-
10:25 AM
→
10:30 AM
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
-
10:30 AM
→
10:35 AM
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
- Presented at two Workshops:
- Started implmenting our SoA in MadGraph, together with
- Stefan Roiser
- Daniele Massaro
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Changed the SoA / AoS code to better fit ALICE O2
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- SectorTracker
- Performance still same (or maybe 2% slower)
- Next Steps:
- Check if AoS has no overhead due to the new abstraction
- Make better use of SoA to improve performance
-
10:35 AM
→
10:40 AM
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
For To Do:-
Class TODO/FIXME Implementation/Understanding Remarks Base/GPUReconstructionProcessing.h namespace gpu_reconstruction_kernels // TODO: Get rid of this namespace Do I need to delete entire namespace including struct deviceEvent and threadContext class?
Case-1: Removing only namespace or Case-2: Removing entire namespace. In both cases I was gettting errors during make.
What is logic behind this todo?
I would like to remove the namespace completely. The point is: right now GPUReconstruction.h only forward-declares deviceEvent and threadContext, and I want to keep it like that. Perhaps you can change it to something like https://godbolt.org/z/K6h53TnPa, then we can get rid of the namespace. DataTypes/GPUTPCGMMergedTrackHit.h // TODO: take them directly from clusterNative header. Why are we trying to take the states directly from ClusterNative. Not all of the flag states are declared in ClusterNative.h . Thus we cannot replace all of them with those from cluster native. I tried to remove this one by one but I got make errors. I would like to avoid the copy and paste.
Perhaps you can include clusterNative.h, and then reuse the defines from there viaflagSplitPad = ClusterNative::flagSplitPad;
TPCClusterFinder/GPUTPCCFDecodeZS.cxx for (int32_t l = 0; l < hdr->nTimeBinSpan; l++) { // TODO: Parallelize over time bins Compiled without problem. Performance (timing for particular kernel need to be checked.)
This TODO is not needed. Now entire is running on GPU so OMP parallel for is not meaningful. -
10:40 AM
→
10:45 AM
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
Highly Ionizing Particles
- Rewrote Pad Filter kernel for preparation.
- Use one block per TPC row, use tiling to cache in shared memory (on GPU)
- CPU performance is comparable
- GPU performance improved (2x) but slightly different results
- -> Results fixed by missing boundary checks
- But now GPU performance degrades by several factors...
GPU Servers
Waiting for last parts.
OpenCL
No news.
Other
- time bin cut in ZS decoding
- raw data encoding seems broken (?)
- Can't test yet for older ZS formats
-
10:45 AM
→
10:50 AM
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
ITS seeding vertexing news
Motivation
- Algorithm that attempts to find an approximation of vertices before tracking
- Used by ITS tracking to reduce combinatorics (thus ITS seeding vertexer --> ITS tracking)
- Basically tries to roughly find vertices without using full tracking
Current state
Algorithm uses first three layers of ITS, simplified explanation:
Compute tracklets made of two clusters on consecutive layers (0-1 and 1-2)Validates tracklets between layersExtend tracklets with a straight lineFor each line iFor each line j=i+1If line already used, skipIf Distance of Closest Approach (DCA) < cut, create a vertex, mark line as used
For each line kIf line is used, skipIf DCA < cut, add to current vertex and mark line as used
Sort all vertices by number of contributorsFor each "vertex cluster" kFor each "vertex cluster" m=k+1if distance < cut, merge them
Sort all vertices by number of contributorsFor each cluster kPromote the biggest one that passes some cuts as primary vertexPromote the others as vertices if they have low multiplicity and are close to the beam line
- Step 1 and 2 (tracklet creation and tracklet matching) already parallelized on CPU via TBB and ported to GPU by Felix S.
- Rest of the vertexing is purely serial (many sequential dependencies)
- Cannot directly parallelize
- Result dependent on order of evaluation of the lines
- Might miss some better associations because the lines had been already "used"
- This algorithm should be the last step for bringing all ITS tracking to GPUs
What to do
- Talked also with Matteo C.
- He tried to implement a histogram-based algorithm, he told me that it was not ideal due to too many assumptions
- There must be other ways to do this step
My idea
- Use this vertexing algorithm:
Jackson, David. (1997). A topological vertex reconstruction algorithm for hadronic jets. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. 388. 247-253. 10.1016/S0168-9002(97)00341-0. - Basically for each track, a "gaussian tube" is computed:

- Where r is a point in 3d space, p is the point of closest approach of track i to point r, and V is a covariance matrix to adjust the shape of the tube
- The closest the track to a point, the greater the value (gaussian shaped function)
- To find the vertices, compute the vertex function:

- So high peaks in the function automatically indicates vertex candidates
- Second term is to suppress contributions from only a line
- Tested this function with some pp TFs. Below projection onto the transverse plane, integrating over beam line:

- Clearly most of the non-zero regions of the function peaks around the beam line (0,0)
- Computed the vertex function exactly at the points where the old vertexer was finding vertices, showing that indeed the function signals the presence of vertices (example for a ROF where 6 vertices where found):






- Meanwhile, it can signal also secondary vertices (different ROF than previous plots):

The algorithm in a nutshell
- With this function, it is necessary to find the peaks in the 3d space, and cluster them so to identify vertices
- High multiplicity vertices automatically signaled by high peaks
- By tuning the covariance matrix, the shape of the tube can be optimized, and thus the shape of the vertex function (more or less sensitive to noise)
- Since the function is > 0 when two or more lines passes closely, it is not necessary to scan the whole 3d volume
- Just compute the positions of each vertex made from a pair of lines and cluster the close candidates
- Algorithm parallelizable, over pair of lines
- Every thread computes the function for a pair of lines --> high compute load (good for GPU, let's see for CPU)
- Or even every block takes a candidate and all threads compute the vertex function in that candidate point
- In this way every pair gets a "chance to shine"
- Merge candidates that are close....still have to think
- Talked also with Ruben to understand how global vertexing works --> Density Based Scan for clustering
AOB
asked DPG for another test of the async prod. now with memory clearing implemented. (thanks David; hopefully this goes well :))
rewriting the tracking right now to implement the staggering (no eta on this, first have to show that it works)
-
10:50 AM
→
10:55 AM
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00 AM
→
10:20 AM