10:00 → 10:20

Discussion 20m

Speaker: David Rohr (CERN)

Color code: (critical, news from this week: blue, news from last week: purple, no news: black)

CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.

Sync reconstruction

Async reconstruction

Need to investigate short GPU stall problem.
Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
New GPU architecture selection for async in O2DPG looks good, should be merged.
Test with GPU GRID jobs at NERSC pending.
Asked DPG to run first test with ITS tracking on GPU on EPNs.

GPU ROCm / compiler topics:

Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
Problem with building ONNXRuntime with MigraphX support.
Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Serialization bug pending.
- Miscompilation on MI 100 leading to memory error pending.
- New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
- New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
- No update from AMD, but AMD wrote they are reorganizing their support process, and should have a meeting in the next weeks.

TPC / GPU Processing

WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
Final solution: merging transformation maps on the fly into a single flat object:
- Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
Need to check the problem with ONNX external memory allocator.
Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- New tracking is actually faster in the refit, but slower in the looper following. raw has more loopers than MC --> faster in MC, slower in raw overall. Can disable looper following to gain some speed for 50 kHz.
- TPC laser calib issue fixed, was a bug in tracking developments affecting only triggered data
- Large memory usage in pp on the GRID - fixed.
- Gain calib issue was due to new miscompilation.
- Next iteration of tracking improvements in draft PR: https://github.com/AliceO2Group/AliceO2/pull/14651
Should get Gabriele's new parameters merged for Pb-Pb.
- Merged, we definitely needed that speedup, Gabriele can report next week.
Investigating deficiencies with TPC cluster removal.
- With the new tracking improvements, we are rejecting more clusters (reducing fakes), and such rejected clusters were not protected from removal. This led to some degradation in the async reco from CTF, since apparently such clusters were needed for the seeding. Fixed in latest O2.
- In sync reco, we were not using the relaxed cuts to find tracks. Thus we found less tracks in async, and some tracks were not protected. That is the main reason we loose tracks in reco from CTF compared to reco from Raw.
- Immediate measure is to use relaxed cuts in sync reco, which will increase processing time by ~20%. Doing some tests with Ernst. Still need to decide what to do finally with cluster rejection.
- Added more settings to O2 to steer cluster rejection, and run it in less aggressive mode.

Other topics:

Assembled development server yesterday. Unfortunately, got only 32 GB DIMMs, and we have only 4 slots, so we'll have 128 GB not 256. But that should be OK. (We cannot return these modules and replace for 64 GB modules).
- Mass storage disks were ordered as SAS drives, checking if we can switch to SATA, otherwise will add a cheap SAS controller.
- Otherwise, server seems fully working. Felix can take care of installation once he is back after his PhD defense.
GRID Memory moniroting: If PSS is monitored, GPU memory is shown as host memory, thus our GPU jobs are reported as using so much host memory.

EPN GPU Topics:

10:20 → 10:25

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

CPU benchmarking

--resources-monitoring 2
Pb--Pb, 1 simulation: 25 events
- 10x processing per algorithmic setting
- tpc-tracker

CPU wall time	mean [s]	std[s]
default	144.6415	0.8035
Class: 399 (FC), 0.03; Reg. 399 (FC)	148.6452	5.8687
Class: 399 (FC), 0.1; Reg. 399 (FC)	147.1118	8.466
Class: 399 (FC), 0.03; Reg. 399 (CNN)	154.8815	8.3
Class: 599 (FC), 0.03; Reg. 599 (FC)	153.6998	6.6834

More fluctuation when NN is used
For realistic case (399, 0.1): CPU wall time increases by ~2%, but fully compatible within uncertainty
Memory profiling (exemplary)

Momentum vector estimation

10:25 → 10:30

GPU Parameter Optimizations 5m

Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

ITS seeding vertexing news

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-11-05T10:00:00+01:00
End: 2025-11-05T11:00:00+01:00
Location: No location set

Motivation

Algorithm that attempts to find an approximation of vertices before tracking
Used by ITS tracking to reduce combinatorics (thus ITS seeding vertexer --> ITS tracking)
Basically tries to roughly find vertices without using full tracking

Current state

Algorithm uses first three layers of ITS, simplified explanation:

Compute tracklets made of two clusters on consecutive layers (0-1 and 1-2)
Validates tracklets between layers
Extend tracklets with a straight line
For each line i
1. For each line j=i+1
  1. If line already used, skip
  2. If Distance of Closest Approach (DCA) < cut, create a vertex, mark line as used
    1. For each line k
    2. If line is used, skip
    3. If DCA < cut, add to current vertex and mark line as used
Sort all vertices by number of contributors
For each "vertex cluster" k
1. For each "vertex cluster" m=k+1
  1. if distance < cut, merge them
Sort all vertices by number of contributors
For each cluster k
1. Promote the biggest one that passes some cuts as primary vertex
2. Promote the others as vertices if they have low multiplicity and are close to the beam line

Step 1 and 2 (tracklet creation and tracklet matching) already parallelized on CPU via TBB and ported to GPU by Felix S.
Rest of the vertexing is purely serial (many sequential dependencies)
Cannot directly parallelize
Result dependent on order of evaluation of the lines
- Might miss some better associations because the lines had been already "used"
This algorithm should be the last step for bringing all ITS tracking to GPUs

What to do

Talked also with Matteo C.
He tried to implement a histogram-based algorithm, he told me that it was not ideal due to too many assumptions
There must be other ways to do this step

My idea

Use this vertexing algorithm: Jackson, David. (1997). A topological vertex reconstruction algorithm for hadronic jets. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. 388. 247-253. 10.1016/S0168-9002(97)00341-0.
Basically for each track, a "gaussian tube" is computed:
Where r is a point in 3d space, p is the point of closest approach of track i to point r, and V is a covariance matrix to adjust the shape of the tube
The closest the track to a point, the greater the value (gaussian shaped function)
To find the vertices, compute the vertex function:
So high peaks in the function automatically indicates vertex candidates
Second term is to suppress contributions from only a line
Tested this function with some pp TFs. Below projection onto the transverse plane, integrating over beam line:
Clearly most of the non-zero regions of the function peaks around the beam line (0,0)
Computed the vertex function exactly at the points where the old vertexer was finding vertices, showing that indeed the function signals the presence of vertices (example for a ROF where 6 vertices where found):
Meanwhile, it can signal also secondary vertices (different ROF than previous plots):

The algorithm in a nutshell

With this function, it is necessary to find the peaks in the 3d space, and cluster them so to identify vertices
High multiplicity vertices automatically signaled by high peaks
By tuning the covariance matrix, the shape of the tube can be optimized, and thus the shape of the vertex function (more or less sensitive to noise)
Since the function is > 0 when two or more lines passes closely, it is not necessary to scan the whole 3d volume
- Just compute the positions of each vertex made from a pair of lines and cluster the close candidates
Algorithm parallelizable, over pair of lines
Every thread computes the function for a pair of lines --> high compute load (good for GPU, let's see for CPU)
- Or even every block takes a candidate and all threads compute the vertex function in that candidate point
In this way every pair gets a "chance to shine"
Merge candidates that are close....still have to think
Talked also with Ruben to understand how global vertexing works --> Density Based Scan for clustering

10:30 → 10:35

Efficient Data Structures 5m

Speaker: Dr Oliver Gregor Rietmann (CERN)

NextGenTrigger Task 1.7

Workshop November 19 - 21
- Need to prepare talk about Task 1.7.
- Collected summaries from people working on Task 1.7 last week
Workshop November 6
- Created Slides for my talk: https://indico.cern.ch/event/1590675/
Discussing implementation of our template SoA code in MadGraph with Stefan Roiser
- Had another discussion with Daniele Massaro about this
- Next Steps:
  - Help Daniele Massaro by turning the first class into SoA (beginning of November)

Implement NGT SoA Code in O2 standalone benchmark

Working on this fork of the AliceO2 repo
Simplified the AoS-SoA wrapper code
Started to apply our AoS-SoA code to:
- Merger
  - GPUTPCGMSectorTrack
  - GPUTPCGMTrackParam
- SectorTracker
  - GPUTPCBaseTrackParam
  - GPUTPCTrackParam
  - GPUTPCTrack
Started with SectorTracker
- Get same results (same GPU.out files)
- Execution time was the same =(
  - Changed more classes to SoA
  - Running with large event set (thanks to Gabriele)
  - Performance still same (or maybe 2% slower)
Found a solution for
- Explicit instantiations
- GPURestrict()
Fixed CI-Piplines: They now fail if O2 standalone yields different results

10:35 → 10:40

Following up GPU to-dos 5m

Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

10:40 → 10:45

TPC Clusterization / OpenCL / Highly Ionizing Particles 5m

Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

10:45 → 10:50

ITS Tracking 5m

Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

no news

10:50 → 10:55

System Run Coordination Topics 5m

Speaker: Ernst Hellbar (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators